mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
main
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8655bd81bc |
feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure) Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure was built for. Per docs/methodology/pipelines.mdx operator runbook. Source releases (CC-BY 4.0, attribution preserved in registry envelope): - GEM-GGIT-Gas-Pipelines-2025-11.xlsx SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6 URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2 URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) → canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs: - status ∈ {operating, construction} - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors. - Country names → ISO 3166-1 alpha-2 via pycountry + alias table. Merge results (via scripts/import-gem-pipelines.mjs --merge): gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6) oil: +259 added, 16 duplicates skipped Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence; GEM rows ship physicalStateSource='gem', classifierConfidence=0.4, operatorStatement=null, sanctionRefs=[]. Floor bump: scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200. Live counts (297/334) leave ~100 rows of jitter headroom so a partial re-import or coverage-narrowing release fails loud rather than halving the registry silently. Tests: - tests/pipelines-registry.test.mts: bumped synthetic-registry Array.from({length:8}) → length:210 to clear new floor; added 'gem' to the evidence-source whitelist for non-flowing badges (parity with the derivePipelinePublicBadge audit done in PR #3397 U1). - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop 3 → 70 to clear new floor. - 51/51 pipeline tests pass; tsc --noEmit clean. vs peer reference site (281 gas + 265 oil): we now match (gas 297) and exceed (oil 334). Functional + visual + data parity for the energy variant is closed; remaining gaps are editorial-cadence (weekly briefing) which is intentionally out of scope per the parity-push plan. * docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh PR #3406 imported the data but didn't land the conversion script that produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py so future operators can reproduce the import deterministically, and rewrites the docs/methodology/pipelines.mdx runbook to match what actually works: - Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns; only the GIS .zip's GeoJSON has both. The original runbook said to download XLSX which would fail at the lat/lon validation step. - Cadence: quarterly refresh, with concrete signals (peer-site comparison, 90-day calendar reminder). - Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so future operators don't re-request the wrong dataset (the Extraction Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure Trackers). - Last-known-good URLs documented + URL pattern explained as fallback when GEM rotates per release. - Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status ∈ {operating, construction}, capacity unit conversion table). - Failure-mode table mapping common errors to fixes. Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON, GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it works for any release without code edits. * fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test Three Greptile findings on PR #3406: P1 — Dedup miss (Dampier-Bunbury): Same physical pipeline existed in both registries — curated `dampier-bunbury` and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM digitized only the southern 60% of the line. The shared Bunbury terminus matched at 13.7 km but the average-endpoint distance was 287 km, just over the 5 km gate. Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit — if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the existing "name collision in different ocean → still added" contract. Added regression test: identical Dampier-Bunbury inputs → 0 added, 1 skipped, matched against `dampier-bunbury`. P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.): GEM source GeoJSON occasionally has a Point geometry or single-coord LineString, producing pipelines where startPoint == endPoint. They render as map-point artifacts and skew aggregate-length stats. Fix (defense in depth): - scripts/_gem-geojson-to-canonical.py drops at conversion time (`zero_length` reason in drop log). - scripts/_pipeline-registry.mjs validateRegistry rejects defensively so even a hand-curated row with degenerate geometry fails loud. P2 — Test repetition coupled to fixture row count: Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently breaks if fixture is trimmed below 3. Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from the floor and current fixture length. Re-run --merge with all fixes applied: gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via name-set-identity short-circuit; -2 zero-length never imported) oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length) Tests: 74/74 pipeline tests pass; tsc --noEmit clean. |
||
|
|
d9a1f6a0f8 |
feat(energy-atlas): GEM pipeline import infrastructure (parity PR 1, plan U1-U4) (#3397)
* feat(energy-atlas): GEM pipeline import infrastructure (PR 1, plan U1-U4)
Lands the parser, dedup helper, validator extensions, and operator runbook
for the Global Energy Monitor (CC-BY 4.0) pipeline-data refresh — closing
~3.6× of the Energy Atlas pipeline-scale gap once the operator runs the
import.
Per docs/plans/2026-04-25-003-feat-energy-parity-pushup-plan.md PR 1.
U1 — Validator + schema extensions:
- Add `'gem'` to VALID_SOURCES in scripts/_pipeline-registry.mjs and to the
evidence-bearing-source whitelist in derivePipelinePublicBadge so GEM-
sourced offline rows derive a `disputed` badge via the external-signal
rule (parity with `press`/`satellite`/`ais-relay`).
- Export VALID_SOURCES so tests assert against the same source-of-truth
the validator uses (matches the VALID_OIL_PRODUCT_CLASSES pattern from
PR #3383).
- Floor bump (MIN_PIPELINES_PER_REGISTRY 8→200) intentionally DEFERRED
to the follow-up data PR — bumping it now would gate the existing 75+75
hand-curated rows below the new floor and break seeder publishes
before the GEM data lands.
U2 — GEM parser (test-first):
- scripts/import-gem-pipelines.mjs reads a local JSON file (operator pre-
converts GEM Excel externally — no `xlsx` dependency added). Schema-
drift sentinel throws on missing columns. Status mapping covers
Operating/Construction/Cancelled/Mothballed/Idle/Shut-in. ProductClass
mapping covers Crude Oil / Refined Products / mixed-flow notes.
Capacity-unit conversion handles bcm/y, bbl/d, Mbd, kbd.
- 22 tests in tests/import-gem-pipelines.test.mjs cover schema sentinel,
fuel split, status mapping, productClass mapping, capacity conversion,
minimum-viable-evidence shape, registry-shape conformance, and bad-
coordinate rejection.
U3 — Deduplication (pure deterministic):
- scripts/_pipeline-dedup.mjs: dedupePipelines(existing, candidates) →
{ toAdd, skippedDuplicates }. Match rule: haversine ≤5km AND name
Jaccard ≥0.6 (BOTH required). Reverse-direction-pair-aware.
- 19 tests cover internal helpers, match logic, id collision, determinism,
and empty inputs.
U4 — Operator runbook (data import deferred):
- docs/methodology/pipelines.mdx: 7-step runbook for the operator to
download GEM, pre-convert Excel→JSON, dry-run with --print-candidates,
merge with --merge, bump the registry floor, and commit with
provenance metadata.
- The actual data import is intentionally OUT OF SCOPE for this agent-
authored PR because GEM downloads are registration-gated. A follow-up
PR will commit the imported scripts/data/pipelines-{gas,oil}.json +
bump MIN_PIPELINES_PER_REGISTRY → 200 + record the GEM release SHA256.
Tests: typecheck clean; 67 tests pass across the three test files.
Codex-approved through 8 review rounds against origin/main @
|