Files
worldmonitor/scripts/_pipeline-dedup.mjs
Elie Habib 8655bd81bc feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)

Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.

Source releases (CC-BY 4.0, attribution preserved in registry envelope):
  - GEM-GGIT-Gas-Pipelines-2025-11.xlsx
    SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
  - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
    SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx

Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
  - status ∈ {operating, construction}
  - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
  - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
    bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
  - Country names → ISO 3166-1 alpha-2 via pycountry + alias table.

Merge results (via scripts/import-gem-pipelines.mjs --merge):
  gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
  oil: +259 added, 16 duplicates skipped
  Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
  GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
  operatorStatement=null, sanctionRefs=[].

Floor bump:
  scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
  Live counts (297/334) leave ~100 rows of jitter headroom so a partial
  re-import or coverage-narrowing release fails loud rather than halving
  the registry silently.

Tests:
  - tests/pipelines-registry.test.mts: bumped synthetic-registry
    Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
    the evidence-source whitelist for non-flowing badges (parity with the
    derivePipelinePublicBadge audit done in PR #3397 U1).
  - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
    3 → 70 to clear new floor.
  - 51/51 pipeline tests pass; tsc --noEmit clean.

vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.

* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh

PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:

- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
  only the GIS .zip's GeoJSON has both. The original runbook said to download
  XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
  90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
  future operators don't re-request the wrong dataset (the Extraction
  Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
  Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
  GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
  ∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.

Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.

* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test

Three Greptile findings on PR #3406:

P1 — Dedup miss (Dampier-Bunbury):
  Same physical pipeline existed in both registries — curated `dampier-bunbury`
  and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
  digitized only the southern 60% of the line. The shared Bunbury terminus
  matched at 13.7 km but the average-endpoint distance was 287 km, just over
  the 5 km gate.
  Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
  if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
  pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
  existing "name collision in different ocean → still added" contract.
  Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
  skipped, matched against `dampier-bunbury`.

P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
  GEM source GeoJSON occasionally has a Point geometry or single-coord
  LineString, producing pipelines where startPoint == endPoint. They render
  as map-point artifacts and skew aggregate-length stats.
  Fix (defense in depth):
    - scripts/_gem-geojson-to-canonical.py drops at conversion time
      (`zero_length` reason in drop log).
    - scripts/_pipeline-registry.mjs validateRegistry rejects defensively
      so even a hand-curated row with degenerate geometry fails loud.

P2 — Test repetition coupled to fixture row count:
  Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
  breaks if fixture is trimmed below 3.
  Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
  the floor and current fixture length.

Re-run --merge with all fixes applied:
  gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
       name-set-identity short-circuit; -2 zero-length never imported)
  oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)

Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 18:59:46 +04:00

224 lines
8.3 KiB
JavaScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// @ts-check
//
// Pure deterministic deduplication for the GEM pipeline import. NOT an entry
// point — see scripts/import-gem-pipelines.mjs for the orchestrator.
//
// Match rule (BOTH must hold):
// 1. Endpoint distance ≤ 5 km (haversine, route-direction-flipped pair-aware
// so Mozyr→Adamowo and Adamowo→Mozyr count as the same).
// 2. Name token Jaccard ≥ 0.6 (lowercased word tokens, stopwords removed).
//
// Conflict resolution: existing row WINS. Hand-curated rows have richer
// evidence (operator statements, sanction refs, classifier confidence ≥ 0.7)
// that GEM's minimum-viable evidence shouldn't overwrite. The dedup function
// returns { toAdd, skippedDuplicates } so the caller can audit which GEM
// candidates were absorbed by existing rows.
//
// Determinism: zero Date.now() / Math.random() / Set ordering reliance. Two
// invocations on identical inputs produce identical outputs.
const STOPWORDS = new Set([
'pipeline', 'pipelines', 'system', 'systems', 'line', 'lines', 'network',
'route', 'project', 'the', 'and', 'of', 'a', 'an',
]);
const MATCH_DISTANCE_KM = 5;
const MATCH_JACCARD_MIN = 0.6;
// When the candidate's tokenized name equals the existing row's tokenized
// name (Jaccard == 1.0 after stopword removal), accept the match if ANY
// endpoint pairing is within MATCH_NAME_IDENTICAL_DISTANCE_KM. Catches PR
// #3406 review's Dampier-Bunbury case: GEM digitized only the southern
// 60% of the line, so the average-endpoint distance was 287km but the
// shared Bunbury terminus matched within 13.7km. A pure name-only rule
// would false-positive on coincidental collisions in different oceans
// (e.g. unrelated "Nord Stream 1" in the Pacific), so we still require
// SOME geographic anchor.
const MATCH_NAME_IDENTICAL_DISTANCE_KM = 25;
const EARTH_RADIUS_KM = 6371;
/**
* Haversine great-circle distance in km between two lat/lon points.
*/
function haversineKm(a, b) {
const toRad = (deg) => (deg * Math.PI) / 180;
const dLat = toRad(b.lat - a.lat);
const dLon = toRad(b.lon - a.lon);
const lat1 = toRad(a.lat);
const lat2 = toRad(b.lat);
const x =
Math.sin(dLat / 2) ** 2 +
Math.sin(dLon / 2) ** 2 * Math.cos(lat1) * Math.cos(lat2);
const c = 2 * Math.atan2(Math.sqrt(x), Math.sqrt(1 - x));
return EARTH_RADIUS_KM * c;
}
/**
* Average endpoint distance between two pipelines, considering both forward
* and reversed pairings. The smaller of the two is returned so a route
* direction flip doesn't appear as a different pipeline.
*/
function averageEndpointDistanceKm(a, b) {
const forward =
(haversineKm(a.startPoint, b.startPoint) + haversineKm(a.endPoint, b.endPoint)) / 2;
const reversed =
(haversineKm(a.startPoint, b.endPoint) + haversineKm(a.endPoint, b.startPoint)) / 2;
return Math.min(forward, reversed);
}
/**
* Minimum of all four cross-pairings between candidate and existing endpoints.
* Used by the name-identical short-circuit: if the candidate digitizes a
* different segment of the same physical pipeline, only ONE endpoint pair
* may match closely (e.g. Dampier-Bunbury: shared Bunbury terminus 13.7 km,
* other end 560 km away because GEM stopped at Onslow vs the full Dampier
* route). A tight average would miss this; the min of the four pairings
* doesn't.
*/
function minPairwiseEndpointDistanceKm(a, b) {
return Math.min(
haversineKm(a.startPoint, b.startPoint),
haversineKm(a.startPoint, b.endPoint),
haversineKm(a.endPoint, b.startPoint),
haversineKm(a.endPoint, b.endPoint),
);
}
/**
* Tokenize a name: lowercased word tokens, ASCII-only word boundaries,
* stopwords removed. Stable across invocations.
*/
function tokenize(name) {
return name
.toLowerCase()
.normalize('NFKD')
// Strip combining marks (diacritics) so "Limón" → "limon", not "limo'n".
// Range ̀-ͯ covers Combining Diacritical Marks per Unicode.
.replace(/[̀-ͯ]/g, '')
.replace(/[^a-z0-9 ]+/g, ' ')
.split(/\s+/)
.filter((t) => t.length > 0 && !STOPWORDS.has(t));
}
/**
* Jaccard similarity = |A ∩ B| / |A B| over token sets.
*/
function jaccard(a, b) {
const setA = new Set(tokenize(a));
const setB = new Set(tokenize(b));
if (setA.size === 0 && setB.size === 0) return 0;
let intersection = 0;
for (const t of setA) if (setB.has(t)) intersection++;
const unionSize = setA.size + setB.size - intersection;
return unionSize === 0 ? 0 : intersection / unionSize;
}
/**
* Decide if a candidate matches an existing row.
*
* Two acceptance paths:
* (a) Token sets are IDENTICAL (Jaccard == 1.0 after stopword removal) —
* the same pipeline regardless of how either source digitized its
* endpoints. Catches the Dampier-Bunbury case (PR #3406 review):
* GEM's GeoJSON terminus was 13.7 km from the curated terminus
* (just over the 5 km distance gate) but both names tokenize to
* {dampier, to, bunbury, natural, gas}, so they are clearly the
* same physical pipeline.
* (b) Distance ≤ 5 km AND Jaccard ≥ 0.6 — the original conjunctive rule
* for slight name-variation cases (e.g. "Druzhba Pipeline" vs
* "Druzhba Oil Pipeline").
*/
function isDuplicate(candidate, existing) {
const sim = jaccard(candidate.name, existing.name);
// Path (a): identical token-set + at least one endpoint pair within 25 km.
// The geographic anchor distinguishes the Dampier-Bunbury case from a
// theoretical name-collision in a different ocean.
if (sim >= 1.0) {
const minDist = minPairwiseEndpointDistanceKm(candidate, existing);
if (minDist <= MATCH_NAME_IDENTICAL_DISTANCE_KM) return true;
// Identical names but no endpoint near each other → distinct pipelines
// sharing a name (rare but real). Fall through to the conjunctive rule
// below, which will return false because Jaccard 1.0 with > 25km min
// pair always exceeds 5 km average.
}
const dist = averageEndpointDistanceKm(candidate, existing);
if (dist > MATCH_DISTANCE_KM) return false;
return sim >= MATCH_JACCARD_MIN;
}
/**
* Disambiguate a candidate's id against existing ids by appending -2, -3, ...
* until unique. Stable: same input → same output.
*/
function uniqueId(baseId, takenIds) {
if (!takenIds.has(baseId)) return baseId;
let n = 2;
while (takenIds.has(`${baseId}-${n}`)) n++;
return `${baseId}-${n}`;
}
/**
* Pure dedup function.
*
* @param {Array<{ id: string, name: string, startPoint: {lat:number,lon:number}, endPoint: {lat:number,lon:number} }>} existing
* @param {Array<{ id: string, name: string, startPoint: {lat:number,lon:number}, endPoint: {lat:number,lon:number} }>} candidates
* @returns {{ toAdd: any[], skippedDuplicates: Array<{ candidate: any, matchedExistingId: string, distanceKm: number, jaccard: number }> }}
*/
export function dedupePipelines(existing, candidates) {
const taken = new Set(existing.map((p) => p.id));
const toAdd = [];
const skippedDuplicates = [];
for (const cand of candidates) {
// Compare against BOTH existing rows AND candidates already accepted
// into toAdd. Without this, two GEM rows that match each other but
// not anything in `existing` would both be added — duplicate-import
// bug. Existing rows still win on cross-set match (they have richer
// hand-curated evidence); within-toAdd matches retain the FIRST
// accepted candidate (deterministic by candidate-list order).
let matched = null;
for (const ex of existing) {
if (isDuplicate(cand, ex)) {
matched = ex;
break;
}
}
if (!matched) {
for (const earlier of toAdd) {
if (isDuplicate(cand, earlier)) {
matched = earlier;
break;
}
}
}
if (matched) {
skippedDuplicates.push({
candidate: cand,
matchedExistingId: matched.id,
distanceKm: averageEndpointDistanceKm(cand, matched),
jaccard: jaccard(cand.name, matched.name),
});
continue;
}
const finalId = uniqueId(cand.id, taken);
taken.add(finalId);
toAdd.push({ ...cand, id: finalId });
}
return { toAdd, skippedDuplicates };
}
// Internal exports for test coverage; not part of the public surface.
export const _internal = {
haversineKm,
averageEndpointDistanceKm,
minPairwiseEndpointDistanceKm,
tokenize,
jaccard,
isDuplicate,
uniqueId,
STOPWORDS,
MATCH_DISTANCE_KM,
MATCH_JACCARD_MIN,
MATCH_NAME_IDENTICAL_DISTANCE_KM,
};