mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure) Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure was built for. Per docs/methodology/pipelines.mdx operator runbook. Source releases (CC-BY 4.0, attribution preserved in registry envelope): - GEM-GGIT-Gas-Pipelines-2025-11.xlsx SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6 URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2 URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) → canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs: - status ∈ {operating, construction} - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors. - Country names → ISO 3166-1 alpha-2 via pycountry + alias table. Merge results (via scripts/import-gem-pipelines.mjs --merge): gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6) oil: +259 added, 16 duplicates skipped Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence; GEM rows ship physicalStateSource='gem', classifierConfidence=0.4, operatorStatement=null, sanctionRefs=[]. Floor bump: scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200. Live counts (297/334) leave ~100 rows of jitter headroom so a partial re-import or coverage-narrowing release fails loud rather than halving the registry silently. Tests: - tests/pipelines-registry.test.mts: bumped synthetic-registry Array.from({length:8}) → length:210 to clear new floor; added 'gem' to the evidence-source whitelist for non-flowing badges (parity with the derivePipelinePublicBadge audit done in PR #3397 U1). - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop 3 → 70 to clear new floor. - 51/51 pipeline tests pass; tsc --noEmit clean. vs peer reference site (281 gas + 265 oil): we now match (gas 297) and exceed (oil 334). Functional + visual + data parity for the energy variant is closed; remaining gaps are editorial-cadence (weekly briefing) which is intentionally out of scope per the parity-push plan. * docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh PR #3406 imported the data but didn't land the conversion script that produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py so future operators can reproduce the import deterministically, and rewrites the docs/methodology/pipelines.mdx runbook to match what actually works: - Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns; only the GIS .zip's GeoJSON has both. The original runbook said to download XLSX which would fail at the lat/lon validation step. - Cadence: quarterly refresh, with concrete signals (peer-site comparison, 90-day calendar reminder). - Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so future operators don't re-request the wrong dataset (the Extraction Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure Trackers). - Last-known-good URLs documented + URL pattern explained as fallback when GEM rotates per release. - Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status ∈ {operating, construction}, capacity unit conversion table). - Failure-mode table mapping common errors to fixes. Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON, GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it works for any release without code edits. * fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test Three Greptile findings on PR #3406: P1 — Dedup miss (Dampier-Bunbury): Same physical pipeline existed in both registries — curated `dampier-bunbury` and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM digitized only the southern 60% of the line. The shared Bunbury terminus matched at 13.7 km but the average-endpoint distance was 287 km, just over the 5 km gate. Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit — if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the existing "name collision in different ocean → still added" contract. Added regression test: identical Dampier-Bunbury inputs → 0 added, 1 skipped, matched against `dampier-bunbury`. P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.): GEM source GeoJSON occasionally has a Point geometry or single-coord LineString, producing pipelines where startPoint == endPoint. They render as map-point artifacts and skew aggregate-length stats. Fix (defense in depth): - scripts/_gem-geojson-to-canonical.py drops at conversion time (`zero_length` reason in drop log). - scripts/_pipeline-registry.mjs validateRegistry rejects defensively so even a hand-curated row with degenerate geometry fails loud. P2 — Test repetition coupled to fixture row count: Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently breaks if fixture is trimmed below 3. Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from the floor and current fixture length. Re-run --merge with all fixes applied: gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via name-set-identity short-circuit; -2 zero-length never imported) oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length) Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
224 lines
8.3 KiB
JavaScript
224 lines
8.3 KiB
JavaScript
// @ts-check
|
||
//
|
||
// Pure deterministic deduplication for the GEM pipeline import. NOT an entry
|
||
// point — see scripts/import-gem-pipelines.mjs for the orchestrator.
|
||
//
|
||
// Match rule (BOTH must hold):
|
||
// 1. Endpoint distance ≤ 5 km (haversine, route-direction-flipped pair-aware
|
||
// so Mozyr→Adamowo and Adamowo→Mozyr count as the same).
|
||
// 2. Name token Jaccard ≥ 0.6 (lowercased word tokens, stopwords removed).
|
||
//
|
||
// Conflict resolution: existing row WINS. Hand-curated rows have richer
|
||
// evidence (operator statements, sanction refs, classifier confidence ≥ 0.7)
|
||
// that GEM's minimum-viable evidence shouldn't overwrite. The dedup function
|
||
// returns { toAdd, skippedDuplicates } so the caller can audit which GEM
|
||
// candidates were absorbed by existing rows.
|
||
//
|
||
// Determinism: zero Date.now() / Math.random() / Set ordering reliance. Two
|
||
// invocations on identical inputs produce identical outputs.
|
||
|
||
const STOPWORDS = new Set([
|
||
'pipeline', 'pipelines', 'system', 'systems', 'line', 'lines', 'network',
|
||
'route', 'project', 'the', 'and', 'of', 'a', 'an',
|
||
]);
|
||
|
||
const MATCH_DISTANCE_KM = 5;
|
||
const MATCH_JACCARD_MIN = 0.6;
|
||
// When the candidate's tokenized name equals the existing row's tokenized
|
||
// name (Jaccard == 1.0 after stopword removal), accept the match if ANY
|
||
// endpoint pairing is within MATCH_NAME_IDENTICAL_DISTANCE_KM. Catches PR
|
||
// #3406 review's Dampier-Bunbury case: GEM digitized only the southern
|
||
// 60% of the line, so the average-endpoint distance was 287km but the
|
||
// shared Bunbury terminus matched within 13.7km. A pure name-only rule
|
||
// would false-positive on coincidental collisions in different oceans
|
||
// (e.g. unrelated "Nord Stream 1" in the Pacific), so we still require
|
||
// SOME geographic anchor.
|
||
const MATCH_NAME_IDENTICAL_DISTANCE_KM = 25;
|
||
const EARTH_RADIUS_KM = 6371;
|
||
|
||
/**
|
||
* Haversine great-circle distance in km between two lat/lon points.
|
||
*/
|
||
function haversineKm(a, b) {
|
||
const toRad = (deg) => (deg * Math.PI) / 180;
|
||
const dLat = toRad(b.lat - a.lat);
|
||
const dLon = toRad(b.lon - a.lon);
|
||
const lat1 = toRad(a.lat);
|
||
const lat2 = toRad(b.lat);
|
||
const x =
|
||
Math.sin(dLat / 2) ** 2 +
|
||
Math.sin(dLon / 2) ** 2 * Math.cos(lat1) * Math.cos(lat2);
|
||
const c = 2 * Math.atan2(Math.sqrt(x), Math.sqrt(1 - x));
|
||
return EARTH_RADIUS_KM * c;
|
||
}
|
||
|
||
/**
|
||
* Average endpoint distance between two pipelines, considering both forward
|
||
* and reversed pairings. The smaller of the two is returned so a route
|
||
* direction flip doesn't appear as a different pipeline.
|
||
*/
|
||
function averageEndpointDistanceKm(a, b) {
|
||
const forward =
|
||
(haversineKm(a.startPoint, b.startPoint) + haversineKm(a.endPoint, b.endPoint)) / 2;
|
||
const reversed =
|
||
(haversineKm(a.startPoint, b.endPoint) + haversineKm(a.endPoint, b.startPoint)) / 2;
|
||
return Math.min(forward, reversed);
|
||
}
|
||
|
||
/**
|
||
* Minimum of all four cross-pairings between candidate and existing endpoints.
|
||
* Used by the name-identical short-circuit: if the candidate digitizes a
|
||
* different segment of the same physical pipeline, only ONE endpoint pair
|
||
* may match closely (e.g. Dampier-Bunbury: shared Bunbury terminus 13.7 km,
|
||
* other end 560 km away because GEM stopped at Onslow vs the full Dampier
|
||
* route). A tight average would miss this; the min of the four pairings
|
||
* doesn't.
|
||
*/
|
||
function minPairwiseEndpointDistanceKm(a, b) {
|
||
return Math.min(
|
||
haversineKm(a.startPoint, b.startPoint),
|
||
haversineKm(a.startPoint, b.endPoint),
|
||
haversineKm(a.endPoint, b.startPoint),
|
||
haversineKm(a.endPoint, b.endPoint),
|
||
);
|
||
}
|
||
|
||
/**
|
||
* Tokenize a name: lowercased word tokens, ASCII-only word boundaries,
|
||
* stopwords removed. Stable across invocations.
|
||
*/
|
||
function tokenize(name) {
|
||
return name
|
||
.toLowerCase()
|
||
.normalize('NFKD')
|
||
// Strip combining marks (diacritics) so "Limón" → "limon", not "limo'n".
|
||
// Range ̀-ͯ covers Combining Diacritical Marks per Unicode.
|
||
.replace(/[̀-ͯ]/g, '')
|
||
.replace(/[^a-z0-9 ]+/g, ' ')
|
||
.split(/\s+/)
|
||
.filter((t) => t.length > 0 && !STOPWORDS.has(t));
|
||
}
|
||
|
||
/**
|
||
* Jaccard similarity = |A ∩ B| / |A ∪ B| over token sets.
|
||
*/
|
||
function jaccard(a, b) {
|
||
const setA = new Set(tokenize(a));
|
||
const setB = new Set(tokenize(b));
|
||
if (setA.size === 0 && setB.size === 0) return 0;
|
||
let intersection = 0;
|
||
for (const t of setA) if (setB.has(t)) intersection++;
|
||
const unionSize = setA.size + setB.size - intersection;
|
||
return unionSize === 0 ? 0 : intersection / unionSize;
|
||
}
|
||
|
||
/**
|
||
* Decide if a candidate matches an existing row.
|
||
*
|
||
* Two acceptance paths:
|
||
* (a) Token sets are IDENTICAL (Jaccard == 1.0 after stopword removal) —
|
||
* the same pipeline regardless of how either source digitized its
|
||
* endpoints. Catches the Dampier-Bunbury case (PR #3406 review):
|
||
* GEM's GeoJSON terminus was 13.7 km from the curated terminus
|
||
* (just over the 5 km distance gate) but both names tokenize to
|
||
* {dampier, to, bunbury, natural, gas}, so they are clearly the
|
||
* same physical pipeline.
|
||
* (b) Distance ≤ 5 km AND Jaccard ≥ 0.6 — the original conjunctive rule
|
||
* for slight name-variation cases (e.g. "Druzhba Pipeline" vs
|
||
* "Druzhba Oil Pipeline").
|
||
*/
|
||
function isDuplicate(candidate, existing) {
|
||
const sim = jaccard(candidate.name, existing.name);
|
||
// Path (a): identical token-set + at least one endpoint pair within 25 km.
|
||
// The geographic anchor distinguishes the Dampier-Bunbury case from a
|
||
// theoretical name-collision in a different ocean.
|
||
if (sim >= 1.0) {
|
||
const minDist = minPairwiseEndpointDistanceKm(candidate, existing);
|
||
if (minDist <= MATCH_NAME_IDENTICAL_DISTANCE_KM) return true;
|
||
// Identical names but no endpoint near each other → distinct pipelines
|
||
// sharing a name (rare but real). Fall through to the conjunctive rule
|
||
// below, which will return false because Jaccard 1.0 with > 25km min
|
||
// pair always exceeds 5 km average.
|
||
}
|
||
const dist = averageEndpointDistanceKm(candidate, existing);
|
||
if (dist > MATCH_DISTANCE_KM) return false;
|
||
return sim >= MATCH_JACCARD_MIN;
|
||
}
|
||
|
||
/**
|
||
* Disambiguate a candidate's id against existing ids by appending -2, -3, ...
|
||
* until unique. Stable: same input → same output.
|
||
*/
|
||
function uniqueId(baseId, takenIds) {
|
||
if (!takenIds.has(baseId)) return baseId;
|
||
let n = 2;
|
||
while (takenIds.has(`${baseId}-${n}`)) n++;
|
||
return `${baseId}-${n}`;
|
||
}
|
||
|
||
/**
|
||
* Pure dedup function.
|
||
*
|
||
* @param {Array<{ id: string, name: string, startPoint: {lat:number,lon:number}, endPoint: {lat:number,lon:number} }>} existing
|
||
* @param {Array<{ id: string, name: string, startPoint: {lat:number,lon:number}, endPoint: {lat:number,lon:number} }>} candidates
|
||
* @returns {{ toAdd: any[], skippedDuplicates: Array<{ candidate: any, matchedExistingId: string, distanceKm: number, jaccard: number }> }}
|
||
*/
|
||
export function dedupePipelines(existing, candidates) {
|
||
const taken = new Set(existing.map((p) => p.id));
|
||
const toAdd = [];
|
||
const skippedDuplicates = [];
|
||
|
||
for (const cand of candidates) {
|
||
// Compare against BOTH existing rows AND candidates already accepted
|
||
// into toAdd. Without this, two GEM rows that match each other but
|
||
// not anything in `existing` would both be added — duplicate-import
|
||
// bug. Existing rows still win on cross-set match (they have richer
|
||
// hand-curated evidence); within-toAdd matches retain the FIRST
|
||
// accepted candidate (deterministic by candidate-list order).
|
||
let matched = null;
|
||
for (const ex of existing) {
|
||
if (isDuplicate(cand, ex)) {
|
||
matched = ex;
|
||
break;
|
||
}
|
||
}
|
||
if (!matched) {
|
||
for (const earlier of toAdd) {
|
||
if (isDuplicate(cand, earlier)) {
|
||
matched = earlier;
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
if (matched) {
|
||
skippedDuplicates.push({
|
||
candidate: cand,
|
||
matchedExistingId: matched.id,
|
||
distanceKm: averageEndpointDistanceKm(cand, matched),
|
||
jaccard: jaccard(cand.name, matched.name),
|
||
});
|
||
continue;
|
||
}
|
||
const finalId = uniqueId(cand.id, taken);
|
||
taken.add(finalId);
|
||
toAdd.push({ ...cand, id: finalId });
|
||
}
|
||
|
||
return { toAdd, skippedDuplicates };
|
||
}
|
||
|
||
// Internal exports for test coverage; not part of the public surface.
|
||
export const _internal = {
|
||
haversineKm,
|
||
averageEndpointDistanceKm,
|
||
minPairwiseEndpointDistanceKm,
|
||
tokenize,
|
||
jaccard,
|
||
isDuplicate,
|
||
uniqueId,
|
||
STOPWORDS,
|
||
MATCH_DISTANCE_KM,
|
||
MATCH_JACCARD_MIN,
|
||
MATCH_NAME_IDENTICAL_DISTANCE_KM,
|
||
};
|