mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(energy-atlas): GEM pipeline import infrastructure (PR 1, plan U1-U4)
Lands the parser, dedup helper, validator extensions, and operator runbook
for the Global Energy Monitor (CC-BY 4.0) pipeline-data refresh — closing
~3.6× of the Energy Atlas pipeline-scale gap once the operator runs the
import.
Per docs/plans/2026-04-25-003-feat-energy-parity-pushup-plan.md PR 1.
U1 — Validator + schema extensions:
- Add `'gem'` to VALID_SOURCES in scripts/_pipeline-registry.mjs and to the
evidence-bearing-source whitelist in derivePipelinePublicBadge so GEM-
sourced offline rows derive a `disputed` badge via the external-signal
rule (parity with `press`/`satellite`/`ais-relay`).
- Export VALID_SOURCES so tests assert against the same source-of-truth
the validator uses (matches the VALID_OIL_PRODUCT_CLASSES pattern from
PR #3383).
- Floor bump (MIN_PIPELINES_PER_REGISTRY 8→200) intentionally DEFERRED
to the follow-up data PR — bumping it now would gate the existing 75+75
hand-curated rows below the new floor and break seeder publishes
before the GEM data lands.
U2 — GEM parser (test-first):
- scripts/import-gem-pipelines.mjs reads a local JSON file (operator pre-
converts GEM Excel externally — no `xlsx` dependency added). Schema-
drift sentinel throws on missing columns. Status mapping covers
Operating/Construction/Cancelled/Mothballed/Idle/Shut-in. ProductClass
mapping covers Crude Oil / Refined Products / mixed-flow notes.
Capacity-unit conversion handles bcm/y, bbl/d, Mbd, kbd.
- 22 tests in tests/import-gem-pipelines.test.mjs cover schema sentinel,
fuel split, status mapping, productClass mapping, capacity conversion,
minimum-viable-evidence shape, registry-shape conformance, and bad-
coordinate rejection.
U3 — Deduplication (pure deterministic):
- scripts/_pipeline-dedup.mjs: dedupePipelines(existing, candidates) →
{ toAdd, skippedDuplicates }. Match rule: haversine ≤5km AND name
Jaccard ≥0.6 (BOTH required). Reverse-direction-pair-aware.
- 19 tests cover internal helpers, match logic, id collision, determinism,
and empty inputs.
U4 — Operator runbook (data import deferred):
- docs/methodology/pipelines.mdx: 7-step runbook for the operator to
download GEM, pre-convert Excel→JSON, dry-run with --print-candidates,
merge with --merge, bump the registry floor, and commit with
provenance metadata.
- The actual data import is intentionally OUT OF SCOPE for this agent-
authored PR because GEM downloads are registration-gated. A follow-up
PR will commit the imported scripts/data/pipelines-{gas,oil}.json +
bump MIN_PIPELINES_PER_REGISTRY → 200 + record the GEM release SHA256.
Tests: typecheck clean; 67 tests pass across the three test files.
Codex-approved through 8 review rounds against origin/main @ 050073354.
* fix(energy-atlas): wire --merge to dedupePipelines + within-batch dedup (PR1 review)
P1 — --merge was a TODO no-op (import-gem-pipelines.mjs:291):
- Previously exited with code 2 + a "TODO: wire dedup once U3 lands"
message. The PR body and the methodology runbook both advertised
--merge as the operator path.
- Add mergeIntoRegistry(filename, candidates) helper that loads the
existing envelope, runs dedupePipelines() against the candidate
list, sorts new entries alphabetically by id (stable diff on rerun),
validates the merged registry via validateRegistry(), and writes
to disk only after validation passes. CLI --merge now invokes it
for both gas and oil + prints a per-fuel summary.
- Source attribution: the registry envelope's `source` field is
upgraded to mention GEM (CC-BY 4.0) on first merge so the data file
itself documents provenance.
P2 — dedup transitive-match bug (_pipeline-dedup.mjs:120):
- Pre-fix loop checked each candidate ONLY against the original
`existing` array. Two GEM rows that match each other but not anything
in `existing` would BOTH be added, defeating the dedup contract for
same-batch duplicates (real example: a primary GEM entry plus a
duplicate row from a regional supplemental sheet).
- Now compares against existing FIRST (existing wins on cross-set
match — preserves richer hand-curated evidence), then falls back to
the already-accepted toAdd set. Within-batch matches retain the FIRST
accepted candidate (deterministic by candidate-list order).
Tests: 22 in tests/pipeline-dedup.test.mjs (3 new) cover the
within-batch dedup, transitive collapse, and existing-wins-over-
already-accepted scenarios. typecheck clean.
* fix(energy-atlas): cross-file-atomic --merge (PR1 review #2)
P1 — partial-import on disk if oil validation fails after gas writes
(import-gem-pipelines.mjs:329 / :350):
- Previous flow ran `mergeIntoRegistry('pipelines-gas.json', gas)` which
wrote to disk, then `mergeIntoRegistry('pipelines-oil.json', oil)`. If
oil validation failed, the operator was left with a half-imported
state: gas had GEM rows committed to disk but oil didn't.
- Refactor into a two-phase API:
1. prepareMerge(filename, candidates) — pure, no disk I/O. Builds the
merged envelope, validates it, throws on validation failure.
2. mergeBothRegistries(gasCandidates, oilCandidates) — calls
prepareMerge for BOTH fuels first; only writes to disk after BOTH
pass validation. If oil's prepareMerge throws, gas was never
touched on disk.
- CLI --merge now invokes mergeBothRegistries. The atomicity guarantee
is documented inline in the helper.
typecheck clean. No new tests because the existing dedup + validate
suites cover the underlying logic; the change is purely about call
ordering for atomicity.
* fix(energy-atlas): deterministic lastEvidenceUpdate + clarify test comment (PR1 review #3)
P2 — lastEvidenceUpdate was non-deterministic (Greptile P2):
- Previous code used new Date().toISOString() per parser run, so two runs
of parseGemPipelines on the same input on different days produced
byte-different output. Quarterly re-imports would produce noisy
full-row diffs even when the upstream GEM data hadn't changed.
- New: resolveEvidenceTimestamp(envelope) derives the timestamp from
envelope.downloadedAt (the operator-recorded date) or sourceVersion
if it parses as ISO. Falls back to 1970-01-01 sentinel when neither
is set — deliberately ugly so reviewers spot the missing field in
the data file diff rather than getting silent today's date.
- Computed once per parse run so every emitted candidate gets the
same timestamp.
P2 — misleading test comment (Greptile P2):
- Comment in tests/import-gem-pipelines.test.mjs:136 said "400_000 bbl/d
÷ 1000 = 400 Mbd" while the assertion correctly expects 0.4 (because
the convention is millions, not thousands). Rewrote the comment to
state the actual rule + arithmetic clearly.
3 new tests for determinism: (a) two parser runs produce identical
output, (b) timestamp derives from downloadedAt, (c) missing date
yields the epoch sentinel (loud failure mode).