mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure) Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure was built for. Per docs/methodology/pipelines.mdx operator runbook. Source releases (CC-BY 4.0, attribution preserved in registry envelope): - GEM-GGIT-Gas-Pipelines-2025-11.xlsx SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6 URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2 URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) → canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs: - status ∈ {operating, construction} - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors. - Country names → ISO 3166-1 alpha-2 via pycountry + alias table. Merge results (via scripts/import-gem-pipelines.mjs --merge): gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6) oil: +259 added, 16 duplicates skipped Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence; GEM rows ship physicalStateSource='gem', classifierConfidence=0.4, operatorStatement=null, sanctionRefs=[]. Floor bump: scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200. Live counts (297/334) leave ~100 rows of jitter headroom so a partial re-import or coverage-narrowing release fails loud rather than halving the registry silently. Tests: - tests/pipelines-registry.test.mts: bumped synthetic-registry Array.from({length:8}) → length:210 to clear new floor; added 'gem' to the evidence-source whitelist for non-flowing badges (parity with the derivePipelinePublicBadge audit done in PR #3397 U1). - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop 3 → 70 to clear new floor. - 51/51 pipeline tests pass; tsc --noEmit clean. vs peer reference site (281 gas + 265 oil): we now match (gas 297) and exceed (oil 334). Functional + visual + data parity for the energy variant is closed; remaining gaps are editorial-cadence (weekly briefing) which is intentionally out of scope per the parity-push plan. * docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh PR #3406 imported the data but didn't land the conversion script that produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py so future operators can reproduce the import deterministically, and rewrites the docs/methodology/pipelines.mdx runbook to match what actually works: - Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns; only the GIS .zip's GeoJSON has both. The original runbook said to download XLSX which would fail at the lat/lon validation step. - Cadence: quarterly refresh, with concrete signals (peer-site comparison, 90-day calendar reminder). - Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so future operators don't re-request the wrong dataset (the Extraction Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure Trackers). - Last-known-good URLs documented + URL pattern explained as fallback when GEM rotates per release. - Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status ∈ {operating, construction}, capacity unit conversion table). - Failure-mode table mapping common errors to fixes. Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON, GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it works for any release without code edits. * fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test Three Greptile findings on PR #3406: P1 — Dedup miss (Dampier-Bunbury): Same physical pipeline existed in both registries — curated `dampier-bunbury` and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM digitized only the southern 60% of the line. The shared Bunbury terminus matched at 13.7 km but the average-endpoint distance was 287 km, just over the 5 km gate. Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit — if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the existing "name collision in different ocean → still added" contract. Added regression test: identical Dampier-Bunbury inputs → 0 added, 1 skipped, matched against `dampier-bunbury`. P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.): GEM source GeoJSON occasionally has a Point geometry or single-coord LineString, producing pipelines where startPoint == endPoint. They render as map-point artifacts and skew aggregate-length stats. Fix (defense in depth): - scripts/_gem-geojson-to-canonical.py drops at conversion time (`zero_length` reason in drop log). - scripts/_pipeline-registry.mjs validateRegistry rejects defensively so even a hand-curated row with degenerate geometry fails loud. P2 — Test repetition coupled to fixture row count: Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently breaks if fixture is trimmed below 3. Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from the floor and current fixture length. Re-run --merge with all fixes applied: gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via name-set-identity short-circuit; -2 zero-length never imported) oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length) Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
173 lines
8.1 KiB
JavaScript
173 lines
8.1 KiB
JavaScript
// @ts-check
|
||
//
|
||
// Shared utility module for the oil & gas pipeline registries used by the
|
||
// Energy Atlas. NOT an entry point — see seed-pipelines-gas.mjs and
|
||
// seed-pipelines-oil.mjs for the two runSeed invocations. These are split
|
||
// because runSeed() hard-exits the process on its terminal paths (_seed-utils
|
||
// has ~9 process.exit sites), so two runSeed calls in one process would leave
|
||
// the second key unwritten.
|
||
//
|
||
// Data is hand-curated in scripts/data/pipelines-{gas,oil}.json.
|
||
// Schema + evidence model documented in docs/methodology/pipelines.mdx.
|
||
//
|
||
// Public badge is DERIVED server-side from the evidence bundle. We publish
|
||
// the raw evidence here; the derivation lives in the supply-chain handler
|
||
// (upcoming). See §4 of the Global Energy Flow parity plan for why we do
|
||
// not publish bare conclusion labels.
|
||
|
||
import { readFileSync } from 'node:fs';
|
||
import { dirname, resolve } from 'node:path';
|
||
import { fileURLToPath } from 'node:url';
|
||
|
||
export const GAS_CANONICAL_KEY = 'energy:pipelines:gas:v1';
|
||
export const OIL_CANONICAL_KEY = 'energy:pipelines:oil:v1';
|
||
// Per §13.2 of the parity plan: registry fields (geometry/operator/capacity)
|
||
// target freshness = 35 days. TTL is 3× interval = weekly refresh.
|
||
// Cron interval is 7 days; TTL=21d means a missed cycle still leaves data
|
||
// readable. Registry is near-static; badge derivation happens at read time.
|
||
export const PIPELINES_TTL_SECONDS = 21 * 24 * 3600;
|
||
|
||
const VALID_PHYSICAL_STATES = new Set(['flowing', 'reduced', 'offline', 'unknown']);
|
||
const VALID_COMMERCIAL_STATES = new Set(['under_contract', 'expired', 'suspended', 'unknown']);
|
||
// `gem` covers rows imported from Global Energy Monitor's Oil & Gas
|
||
// Infrastructure Trackers (CC-BY 4.0). Treated as an evidence-bearing source
|
||
// for non-flowing badges in the same way as `press` / `satellite` / `ais-relay`,
|
||
// since GEM is an academic/curated dataset with traceable provenance — not a
|
||
// silent default. Exported alongside VALID_OIL_PRODUCT_CLASSES so test suites
|
||
// can assert against the same source of truth the validator uses.
|
||
export const VALID_SOURCES = new Set(['operator', 'regulator', 'press', 'satellite', 'ais-relay', 'gem']);
|
||
// Required on every oil pipeline. `crude` = crude-oil lines (default),
|
||
// `products` = refined-product lines (gasoline/diesel/jet), `mixed` =
|
||
// dual-use bridges moving both. Gas pipelines don't carry this field
|
||
// (commodity is its own class). Exported so the test suite can assert
|
||
// against the SAME source of truth the validator uses — otherwise an
|
||
// inline copy in tests could silently drift when the enum is extended.
|
||
export const VALID_OIL_PRODUCT_CLASSES = new Set(['crude', 'products', 'mixed']);
|
||
|
||
// Minimum viable registry size. Post-GEM-import floor: 200. Live counts after
|
||
// the 2025-11 GGIT + 2025-03 GOIT merge are 297 gas / 334 oil; 200 leaves ~100
|
||
// rows of jitter headroom so a partial GEM re-import or a coverage-narrowing
|
||
// release fails loud rather than silently halving the registry.
|
||
const MIN_PIPELINES_PER_REGISTRY = 200;
|
||
|
||
function loadRegistry(filename) {
|
||
const __dirname = dirname(fileURLToPath(import.meta.url));
|
||
const raw = readFileSync(resolve(__dirname, 'data', filename), 'utf-8');
|
||
return JSON.parse(raw);
|
||
}
|
||
|
||
/**
|
||
* @param {unknown} data
|
||
* @returns {boolean}
|
||
*/
|
||
export function validateRegistry(data) {
|
||
if (!data || typeof data !== 'object') return false;
|
||
const obj = /** @type {Record<string, unknown>} */ (data);
|
||
if (!obj.pipelines || typeof obj.pipelines !== 'object') return false;
|
||
const pipelines = /** @type {Record<string, any>} */ (obj.pipelines);
|
||
const entries = Object.entries(pipelines);
|
||
if (entries.length < MIN_PIPELINES_PER_REGISTRY) return false;
|
||
|
||
const seenIds = new Set();
|
||
for (const [key, p] of entries) {
|
||
if (seenIds.has(key)) return false;
|
||
seenIds.add(key);
|
||
if (p.id !== key) return false;
|
||
if (typeof p.name !== 'string' || p.name.length === 0) return false;
|
||
if (typeof p.operator !== 'string') return false;
|
||
if (p.commodityType !== 'oil' && p.commodityType !== 'gas') return false;
|
||
// Oil pipelines must declare a productClass from the enum; gas pipelines
|
||
// must NOT carry one (commodity is its own class there).
|
||
if (p.commodityType === 'oil') {
|
||
if (!VALID_OIL_PRODUCT_CLASSES.has(p.productClass)) return false;
|
||
} else if (p.productClass !== undefined) {
|
||
return false;
|
||
}
|
||
if (typeof p.fromCountry !== 'string' || !/^[A-Z]{2}$/.test(p.fromCountry)) return false;
|
||
if (typeof p.toCountry !== 'string' || !/^[A-Z]{2}$/.test(p.toCountry)) return false;
|
||
if (!Array.isArray(p.transitCountries)) return false;
|
||
for (const t of p.transitCountries) {
|
||
if (typeof t !== 'string' || !/^[A-Z]{2}$/.test(t)) return false;
|
||
}
|
||
const hasCapacity =
|
||
(p.commodityType === 'gas' && typeof p.capacityBcmYr === 'number' && p.capacityBcmYr > 0) ||
|
||
(p.commodityType === 'oil' && typeof p.capacityMbd === 'number' && p.capacityMbd > 0);
|
||
if (!hasCapacity) return false;
|
||
|
||
if (!p.startPoint || typeof p.startPoint.lat !== 'number' || typeof p.startPoint.lon !== 'number') return false;
|
||
if (!p.endPoint || typeof p.endPoint.lat !== 'number' || typeof p.endPoint.lon !== 'number') return false;
|
||
if (!isValidLatLon(p.startPoint.lat, p.startPoint.lon)) return false;
|
||
if (!isValidLatLon(p.endPoint.lat, p.endPoint.lon)) return false;
|
||
// Reject degenerate routes where startPoint == endPoint. PR #3406 review
|
||
// surfaced 9 GEM rows (incl. Trans-Alaska, Enbridge Line 3, Ichthys)
|
||
// whose source GeoJSON had a Point geometry or a single-coord LineString,
|
||
// producing zero-length pipelines that render as map-point artifacts and
|
||
// skew aggregate-length statistics. Defense in depth — converter also
|
||
// drops these — but the validator gate makes the contract explicit.
|
||
if (p.startPoint.lat === p.endPoint.lat && p.startPoint.lon === p.endPoint.lon) return false;
|
||
|
||
if (!p.evidence || typeof p.evidence !== 'object') return false;
|
||
const ev = p.evidence;
|
||
if (!VALID_PHYSICAL_STATES.has(ev.physicalState)) return false;
|
||
if (!VALID_SOURCES.has(ev.physicalStateSource)) return false;
|
||
if (!VALID_COMMERCIAL_STATES.has(ev.commercialState)) return false;
|
||
if (!Array.isArray(ev.sanctionRefs)) return false;
|
||
if (typeof ev.lastEvidenceUpdate !== 'string') return false;
|
||
if (typeof ev.classifierVersion !== 'string') return false;
|
||
if (typeof ev.classifierConfidence !== 'number' ||
|
||
ev.classifierConfidence < 0 || ev.classifierConfidence > 1) return false;
|
||
|
||
// Every non-`flowing` badge requires at least one evidence field with signal.
|
||
// This prevents shipping an `offline` label with zero supporting evidence.
|
||
// `gem` joins the evidence-bearing sources because GEM is a curated
|
||
// academic dataset with traceable provenance, not a silent default.
|
||
if (ev.physicalState !== 'flowing') {
|
||
const hasEvidence =
|
||
ev.operatorStatement != null ||
|
||
ev.sanctionRefs.length > 0 ||
|
||
ev.physicalStateSource === 'ais-relay' ||
|
||
ev.physicalStateSource === 'satellite' ||
|
||
ev.physicalStateSource === 'press' ||
|
||
ev.physicalStateSource === 'gem';
|
||
if (!hasEvidence) return false;
|
||
}
|
||
}
|
||
return true;
|
||
}
|
||
|
||
function isValidLatLon(lat, lon) {
|
||
return Number.isFinite(lat) && Number.isFinite(lon) &&
|
||
lat >= -90 && lat <= 90 && lon >= -180 && lon <= 180;
|
||
}
|
||
|
||
export function buildGasPayload() {
|
||
const registry = loadRegistry('pipelines-gas.json');
|
||
return { ...registry, updatedAt: new Date().toISOString() };
|
||
}
|
||
|
||
export function buildOilPayload() {
|
||
const registry = loadRegistry('pipelines-oil.json');
|
||
return { ...registry, updatedAt: new Date().toISOString() };
|
||
}
|
||
|
||
/**
|
||
* @param {any} data
|
||
* @returns {number}
|
||
*/
|
||
export function recordCount(data) {
|
||
return Object.keys(data?.pipelines ?? {}).length;
|
||
}
|
||
|
||
/**
|
||
* @param {any} data
|
||
* @returns {number}
|
||
*/
|
||
export function declareRecords(data) {
|
||
return recordCount(data);
|
||
}
|
||
|
||
// maxStaleMin per health-maxstalemin-write-cadence skill: registry cron runs
|
||
// weekly (7 days = 10_080 min). 2× cadence = 20_160 min. Registry fields are
|
||
// slow-moving so a 2× headroom is sufficient.
|
||
export const MAX_STALE_MIN = 20_160;
|