Files
worldmonitor/scripts/_pipeline-registry.mjs
Elie Habib 8655bd81bc feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)

Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.

Source releases (CC-BY 4.0, attribution preserved in registry envelope):
  - GEM-GGIT-Gas-Pipelines-2025-11.xlsx
    SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
  - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
    SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx

Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
  - status ∈ {operating, construction}
  - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
  - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
    bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
  - Country names → ISO 3166-1 alpha-2 via pycountry + alias table.

Merge results (via scripts/import-gem-pipelines.mjs --merge):
  gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
  oil: +259 added, 16 duplicates skipped
  Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
  GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
  operatorStatement=null, sanctionRefs=[].

Floor bump:
  scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
  Live counts (297/334) leave ~100 rows of jitter headroom so a partial
  re-import or coverage-narrowing release fails loud rather than halving
  the registry silently.

Tests:
  - tests/pipelines-registry.test.mts: bumped synthetic-registry
    Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
    the evidence-source whitelist for non-flowing badges (parity with the
    derivePipelinePublicBadge audit done in PR #3397 U1).
  - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
    3 → 70 to clear new floor.
  - 51/51 pipeline tests pass; tsc --noEmit clean.

vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.

* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh

PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:

- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
  only the GIS .zip's GeoJSON has both. The original runbook said to download
  XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
  90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
  future operators don't re-request the wrong dataset (the Extraction
  Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
  Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
  GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
  ∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.

Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.

* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test

Three Greptile findings on PR #3406:

P1 — Dedup miss (Dampier-Bunbury):
  Same physical pipeline existed in both registries — curated `dampier-bunbury`
  and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
  digitized only the southern 60% of the line. The shared Bunbury terminus
  matched at 13.7 km but the average-endpoint distance was 287 km, just over
  the 5 km gate.
  Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
  if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
  pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
  existing "name collision in different ocean → still added" contract.
  Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
  skipped, matched against `dampier-bunbury`.

P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
  GEM source GeoJSON occasionally has a Point geometry or single-coord
  LineString, producing pipelines where startPoint == endPoint. They render
  as map-point artifacts and skew aggregate-length stats.
  Fix (defense in depth):
    - scripts/_gem-geojson-to-canonical.py drops at conversion time
      (`zero_length` reason in drop log).
    - scripts/_pipeline-registry.mjs validateRegistry rejects defensively
      so even a hand-curated row with degenerate geometry fails loud.

P2 — Test repetition coupled to fixture row count:
  Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
  breaks if fixture is trimmed below 3.
  Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
  the floor and current fixture length.

Re-run --merge with all fixes applied:
  gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
       name-set-identity short-circuit; -2 zero-length never imported)
  oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)

Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 18:59:46 +04:00

173 lines
8.1 KiB
JavaScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// @ts-check
//
// Shared utility module for the oil & gas pipeline registries used by the
// Energy Atlas. NOT an entry point — see seed-pipelines-gas.mjs and
// seed-pipelines-oil.mjs for the two runSeed invocations. These are split
// because runSeed() hard-exits the process on its terminal paths (_seed-utils
// has ~9 process.exit sites), so two runSeed calls in one process would leave
// the second key unwritten.
//
// Data is hand-curated in scripts/data/pipelines-{gas,oil}.json.
// Schema + evidence model documented in docs/methodology/pipelines.mdx.
//
// Public badge is DERIVED server-side from the evidence bundle. We publish
// the raw evidence here; the derivation lives in the supply-chain handler
// (upcoming). See §4 of the Global Energy Flow parity plan for why we do
// not publish bare conclusion labels.
import { readFileSync } from 'node:fs';
import { dirname, resolve } from 'node:path';
import { fileURLToPath } from 'node:url';
export const GAS_CANONICAL_KEY = 'energy:pipelines:gas:v1';
export const OIL_CANONICAL_KEY = 'energy:pipelines:oil:v1';
// Per §13.2 of the parity plan: registry fields (geometry/operator/capacity)
// target freshness = 35 days. TTL is 3× interval = weekly refresh.
// Cron interval is 7 days; TTL=21d means a missed cycle still leaves data
// readable. Registry is near-static; badge derivation happens at read time.
export const PIPELINES_TTL_SECONDS = 21 * 24 * 3600;
const VALID_PHYSICAL_STATES = new Set(['flowing', 'reduced', 'offline', 'unknown']);
const VALID_COMMERCIAL_STATES = new Set(['under_contract', 'expired', 'suspended', 'unknown']);
// `gem` covers rows imported from Global Energy Monitor's Oil & Gas
// Infrastructure Trackers (CC-BY 4.0). Treated as an evidence-bearing source
// for non-flowing badges in the same way as `press` / `satellite` / `ais-relay`,
// since GEM is an academic/curated dataset with traceable provenance — not a
// silent default. Exported alongside VALID_OIL_PRODUCT_CLASSES so test suites
// can assert against the same source of truth the validator uses.
export const VALID_SOURCES = new Set(['operator', 'regulator', 'press', 'satellite', 'ais-relay', 'gem']);
// Required on every oil pipeline. `crude` = crude-oil lines (default),
// `products` = refined-product lines (gasoline/diesel/jet), `mixed` =
// dual-use bridges moving both. Gas pipelines don't carry this field
// (commodity is its own class). Exported so the test suite can assert
// against the SAME source of truth the validator uses — otherwise an
// inline copy in tests could silently drift when the enum is extended.
export const VALID_OIL_PRODUCT_CLASSES = new Set(['crude', 'products', 'mixed']);
// Minimum viable registry size. Post-GEM-import floor: 200. Live counts after
// the 2025-11 GGIT + 2025-03 GOIT merge are 297 gas / 334 oil; 200 leaves ~100
// rows of jitter headroom so a partial GEM re-import or a coverage-narrowing
// release fails loud rather than silently halving the registry.
const MIN_PIPELINES_PER_REGISTRY = 200;
function loadRegistry(filename) {
const __dirname = dirname(fileURLToPath(import.meta.url));
const raw = readFileSync(resolve(__dirname, 'data', filename), 'utf-8');
return JSON.parse(raw);
}
/**
* @param {unknown} data
* @returns {boolean}
*/
export function validateRegistry(data) {
if (!data || typeof data !== 'object') return false;
const obj = /** @type {Record<string, unknown>} */ (data);
if (!obj.pipelines || typeof obj.pipelines !== 'object') return false;
const pipelines = /** @type {Record<string, any>} */ (obj.pipelines);
const entries = Object.entries(pipelines);
if (entries.length < MIN_PIPELINES_PER_REGISTRY) return false;
const seenIds = new Set();
for (const [key, p] of entries) {
if (seenIds.has(key)) return false;
seenIds.add(key);
if (p.id !== key) return false;
if (typeof p.name !== 'string' || p.name.length === 0) return false;
if (typeof p.operator !== 'string') return false;
if (p.commodityType !== 'oil' && p.commodityType !== 'gas') return false;
// Oil pipelines must declare a productClass from the enum; gas pipelines
// must NOT carry one (commodity is its own class there).
if (p.commodityType === 'oil') {
if (!VALID_OIL_PRODUCT_CLASSES.has(p.productClass)) return false;
} else if (p.productClass !== undefined) {
return false;
}
if (typeof p.fromCountry !== 'string' || !/^[A-Z]{2}$/.test(p.fromCountry)) return false;
if (typeof p.toCountry !== 'string' || !/^[A-Z]{2}$/.test(p.toCountry)) return false;
if (!Array.isArray(p.transitCountries)) return false;
for (const t of p.transitCountries) {
if (typeof t !== 'string' || !/^[A-Z]{2}$/.test(t)) return false;
}
const hasCapacity =
(p.commodityType === 'gas' && typeof p.capacityBcmYr === 'number' && p.capacityBcmYr > 0) ||
(p.commodityType === 'oil' && typeof p.capacityMbd === 'number' && p.capacityMbd > 0);
if (!hasCapacity) return false;
if (!p.startPoint || typeof p.startPoint.lat !== 'number' || typeof p.startPoint.lon !== 'number') return false;
if (!p.endPoint || typeof p.endPoint.lat !== 'number' || typeof p.endPoint.lon !== 'number') return false;
if (!isValidLatLon(p.startPoint.lat, p.startPoint.lon)) return false;
if (!isValidLatLon(p.endPoint.lat, p.endPoint.lon)) return false;
// Reject degenerate routes where startPoint == endPoint. PR #3406 review
// surfaced 9 GEM rows (incl. Trans-Alaska, Enbridge Line 3, Ichthys)
// whose source GeoJSON had a Point geometry or a single-coord LineString,
// producing zero-length pipelines that render as map-point artifacts and
// skew aggregate-length statistics. Defense in depth — converter also
// drops these — but the validator gate makes the contract explicit.
if (p.startPoint.lat === p.endPoint.lat && p.startPoint.lon === p.endPoint.lon) return false;
if (!p.evidence || typeof p.evidence !== 'object') return false;
const ev = p.evidence;
if (!VALID_PHYSICAL_STATES.has(ev.physicalState)) return false;
if (!VALID_SOURCES.has(ev.physicalStateSource)) return false;
if (!VALID_COMMERCIAL_STATES.has(ev.commercialState)) return false;
if (!Array.isArray(ev.sanctionRefs)) return false;
if (typeof ev.lastEvidenceUpdate !== 'string') return false;
if (typeof ev.classifierVersion !== 'string') return false;
if (typeof ev.classifierConfidence !== 'number' ||
ev.classifierConfidence < 0 || ev.classifierConfidence > 1) return false;
// Every non-`flowing` badge requires at least one evidence field with signal.
// This prevents shipping an `offline` label with zero supporting evidence.
// `gem` joins the evidence-bearing sources because GEM is a curated
// academic dataset with traceable provenance, not a silent default.
if (ev.physicalState !== 'flowing') {
const hasEvidence =
ev.operatorStatement != null ||
ev.sanctionRefs.length > 0 ||
ev.physicalStateSource === 'ais-relay' ||
ev.physicalStateSource === 'satellite' ||
ev.physicalStateSource === 'press' ||
ev.physicalStateSource === 'gem';
if (!hasEvidence) return false;
}
}
return true;
}
function isValidLatLon(lat, lon) {
return Number.isFinite(lat) && Number.isFinite(lon) &&
lat >= -90 && lat <= 90 && lon >= -180 && lon <= 180;
}
export function buildGasPayload() {
const registry = loadRegistry('pipelines-gas.json');
return { ...registry, updatedAt: new Date().toISOString() };
}
export function buildOilPayload() {
const registry = loadRegistry('pipelines-oil.json');
return { ...registry, updatedAt: new Date().toISOString() };
}
/**
* @param {any} data
* @returns {number}
*/
export function recordCount(data) {
return Object.keys(data?.pipelines ?? {}).length;
}
/**
* @param {any} data
* @returns {number}
*/
export function declareRecords(data) {
return recordCount(data);
}
// maxStaleMin per health-maxstalemin-write-cadence skill: registry cron runs
// weekly (7 days = 10_080 min). 2× cadence = 20_160 min. Registry fields are
// slow-moving so a 2× headroom is sufficient.
export const MAX_STALE_MIN = 20_160;