* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)
Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.
Source releases (CC-BY 4.0, attribution preserved in registry envelope):
- GEM-GGIT-Gas-Pipelines-2025-11.xlsx
SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
- GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
- status ∈ {operating, construction}
- length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
- capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
- Country names → ISO 3166-1 alpha-2 via pycountry + alias table.
Merge results (via scripts/import-gem-pipelines.mjs --merge):
gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
oil: +259 added, 16 duplicates skipped
Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
operatorStatement=null, sanctionRefs=[].
Floor bump:
scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
Live counts (297/334) leave ~100 rows of jitter headroom so a partial
re-import or coverage-narrowing release fails loud rather than halving
the registry silently.
Tests:
- tests/pipelines-registry.test.mts: bumped synthetic-registry
Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
the evidence-source whitelist for non-flowing badges (parity with the
derivePipelinePublicBadge audit done in PR #3397 U1).
- tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
3 → 70 to clear new floor.
- 51/51 pipeline tests pass; tsc --noEmit clean.
vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.
* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh
PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:
- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
only the GIS .zip's GeoJSON has both. The original runbook said to download
XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
future operators don't re-request the wrong dataset (the Extraction
Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.
Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.
* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test
Three Greptile findings on PR #3406:
P1 — Dedup miss (Dampier-Bunbury):
Same physical pipeline existed in both registries — curated `dampier-bunbury`
and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
digitized only the southern 60% of the line. The shared Bunbury terminus
matched at 13.7 km but the average-endpoint distance was 287 km, just over
the 5 km gate.
Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
existing "name collision in different ocean → still added" contract.
Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
skipped, matched against `dampier-bunbury`.
P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
GEM source GeoJSON occasionally has a Point geometry or single-coord
LineString, producing pipelines where startPoint == endPoint. They render
as map-point artifacts and skew aggregate-length stats.
Fix (defense in depth):
- scripts/_gem-geojson-to-canonical.py drops at conversion time
(`zero_length` reason in drop log).
- scripts/_pipeline-registry.mjs validateRegistry rejects defensively
so even a hand-curated row with degenerate geometry fails loud.
P2 — Test repetition coupled to fixture row count:
Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
breaks if fixture is trimmed below 3.
Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
the floor and current fixture length.
Re-run --merge with all fixes applied:
gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
name-set-identity short-circuit; -2 zero-length never imported)
oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)
Tests: 74/74 pipeline tests pass; tsc --noEmit clean.