Files
worldmonitor/tests/pipeline-dedup.test.mjs
Elie Habib 8655bd81bc feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)

Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.

Source releases (CC-BY 4.0, attribution preserved in registry envelope):
  - GEM-GGIT-Gas-Pipelines-2025-11.xlsx
    SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
  - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
    SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx

Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
  - status ∈ {operating, construction}
  - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
  - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
    bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
  - Country names → ISO 3166-1 alpha-2 via pycountry + alias table.

Merge results (via scripts/import-gem-pipelines.mjs --merge):
  gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
  oil: +259 added, 16 duplicates skipped
  Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
  GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
  operatorStatement=null, sanctionRefs=[].

Floor bump:
  scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
  Live counts (297/334) leave ~100 rows of jitter headroom so a partial
  re-import or coverage-narrowing release fails loud rather than halving
  the registry silently.

Tests:
  - tests/pipelines-registry.test.mts: bumped synthetic-registry
    Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
    the evidence-source whitelist for non-flowing badges (parity with the
    derivePipelinePublicBadge audit done in PR #3397 U1).
  - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
    3 → 70 to clear new floor.
  - 51/51 pipeline tests pass; tsc --noEmit clean.

vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.

* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh

PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:

- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
  only the GIS .zip's GeoJSON has both. The original runbook said to download
  XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
  90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
  future operators don't re-request the wrong dataset (the Extraction
  Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
  Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
  GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
  ∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.

Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.

* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test

Three Greptile findings on PR #3406:

P1 — Dedup miss (Dampier-Bunbury):
  Same physical pipeline existed in both registries — curated `dampier-bunbury`
  and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
  digitized only the southern 60% of the line. The shared Bunbury terminus
  matched at 13.7 km but the average-endpoint distance was 287 km, just over
  the 5 km gate.
  Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
  if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
  pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
  existing "name collision in different ocean → still added" contract.
  Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
  skipped, matched against `dampier-bunbury`.

P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
  GEM source GeoJSON occasionally has a Point geometry or single-coord
  LineString, producing pipelines where startPoint == endPoint. They render
  as map-point artifacts and skew aggregate-length stats.
  Fix (defense in depth):
    - scripts/_gem-geojson-to-canonical.py drops at conversion time
      (`zero_length` reason in drop log).
    - scripts/_pipeline-registry.mjs validateRegistry rejects defensively
      so even a hand-curated row with degenerate geometry fails loud.

P2 — Test repetition coupled to fixture row count:
  Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
  breaks if fixture is trimmed below 3.
  Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
  the floor and current fixture length.

Re-run --merge with all fixes applied:
  gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
       name-set-identity short-circuit; -2 zero-length never imported)
  oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)

Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 18:59:46 +04:00

259 lines
12 KiB
JavaScript
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// @ts-check
//
// Tests for scripts/_pipeline-dedup.mjs — the haversine + Jaccard dedup
// helper. Both criteria (≤5km AND ≥0.6) must hold for a match. Existing rows
// always win to preserve hand-curated evidence.
import { strict as assert } from 'node:assert';
import { test, describe } from 'node:test';
import { dedupePipelines, _internal } from '../scripts/_pipeline-dedup.mjs';
const { jaccard, averageEndpointDistanceKm, tokenize, uniqueId } = _internal;
function makePipeline(id, name, startLat, startLon, endLat, endLon) {
return {
id,
name,
startPoint: { lat: startLat, lon: startLon },
endPoint: { lat: endLat, lon: endLon },
};
}
describe('pipeline-dedup — internal helpers', () => {
test('tokenize lowercases, splits, drops stopwords', () => {
const tokens = tokenize('Trans-Siberian Pipeline System');
assert.deepEqual(tokens.sort(), ['siberian', 'trans']);
});
test('tokenize removes punctuation and accents', () => {
const tokens = tokenize('Caño LimónCoveñas Pipeline');
// After NFKD normalization + ascii-only filter, accented chars survive
// as their base letter; we accept either exact or close behaviour.
assert.ok(tokens.includes('limon') || tokens.includes('lim'),
`expected Limón to tokenize; got ${tokens.join(',')}`);
});
test('jaccard returns 1.0 for identical token sets', () => {
assert.equal(jaccard('Test Pipeline System', 'Test Pipeline'), 1.0);
});
test('jaccard returns 0 for fully disjoint names', () => {
assert.equal(jaccard('Druzhba North', 'Nord Stream'), 0);
});
test('jaccard 0.5 for half-overlap', () => {
assert.equal(jaccard('Trans Adriatic', 'Trans Caspian'), 1 / 3);
});
test('haversine distance is symmetric', () => {
const a = makePipeline('a', 'A', 60, 30, 54, 13);
const b = makePipeline('b', 'B', 60.001, 30.001, 54.001, 13.001);
assert.ok(averageEndpointDistanceKm(a, b) < 1, 'sub-km on tiny offsets');
});
test('haversine distance for far-apart pipelines is large', () => {
const a = makePipeline('a', 'A', 60, 30, 54, 13); // RU→DE
const b = makePipeline('b', 'B', 30, -90, 25, -85); // Gulf of Mexico
assert.ok(averageEndpointDistanceKm(a, b) > 5000);
});
test('uniqueId preserves base when free, suffixes when taken', () => {
const taken = new Set(['foo', 'foo-2']);
assert.equal(uniqueId('bar', taken), 'bar');
assert.equal(uniqueId('foo', taken), 'foo-3');
});
});
describe('pipeline-dedup — match logic', () => {
test('happy path: completely-different name + far endpoints → added', () => {
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline (Northern Branch)',
52.6, 49.4, 52.32, 14.06)];
const candidates = [makePipeline('nord-stream-1', 'Nord Stream 1',
60.08, 29.05, 54.14, 13.66)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1);
assert.equal(skippedDuplicates.length, 0);
});
test('match by both criteria: close endpoints + similar name → skipped (existing wins)', () => {
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline',
52.6, 49.4, 52.32, 14.06)];
const candidates = [makePipeline('druzhba-import', 'Druzhba Pipeline',
52.601, 49.401, 52.321, 14.061)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
assert.equal(skippedDuplicates[0].matchedExistingId, 'druzhba-north');
});
test('identical names + one shared terminus (≤25 km) → deduped (PR #3406 Dampier-Bunbury regression)', () => {
// Real-world case from PR #3406 review: GEM digitized only the southern
// 60% of the line, so the shared Bunbury terminus matched at 13.7 km
// but the average-endpoint distance was 287 km (over the 5 km gate).
// Identical token sets + ≥1 close pairing = same physical pipeline.
const existing = [makePipeline('dampier-bunbury', 'Dampier to Bunbury Natural Gas Pipeline',
-20.68, 116.72, -33.33, 115.63)];
const candidates = [makePipeline('dampier-to-bunbury-natural-gas-pipeline-au',
'Dampier to Bunbury Natural Gas Pipeline',
-33.265797, 115.755682, -24.86854, 113.674968)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
assert.equal(skippedDuplicates[0].matchedExistingId, 'dampier-bunbury');
});
test('name-match only (endpoints in different ocean) → added', () => {
const existing = [makePipeline('nord-stream-1', 'Nord Stream 1',
60.08, 29.05, 54.14, 13.66)];
const candidates = [makePipeline('imposter', 'Nord Stream 1',
40.0, -100.0, 35.0, -90.0)]; // different continent
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'low haversine confidence overrides high name match');
assert.equal(skippedDuplicates.length, 0);
});
test('endpoint-match only (different name) → added (real distinct pipelines can share endpoints)', () => {
const existing = [makePipeline('yamal-europe', 'YamalEurope',
67.0, 75.0, 52.0, 14.0)];
const candidates = [makePipeline('different-route', 'Trans-Siberian Coal Slurry',
67.001, 75.001, 52.001, 14.001)];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'name disambiguates: same endpoints, different infrastructure');
});
test('reverse-direction match: candidate endpoints flipped → still detected', () => {
const existing = [makePipeline('druzhba', 'Druzhba',
52.6, 49.4, 52.32, 14.06)];
// Same pipeline, route described in reverse direction
const candidates = [makePipeline('druzhba-flipped', 'Druzhba',
52.32, 14.06, 52.6, 49.4)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
});
test('stopword-only difference: "Pipeline System" vs "Line" → matches by Jaccard', () => {
const existing = [makePipeline('trans-sib', 'Trans-Siberian Pipeline System',
55, 30, 60, 90)];
const candidates = [makePipeline('trans-sib-cand', 'Trans-Siberian Line',
55.001, 30.001, 60.001, 90.001)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
assert.ok(skippedDuplicates[0].jaccard >= 0.6);
});
});
describe('pipeline-dedup — id collision', () => {
test('candidate with id colliding existing gets suffixed -2', () => {
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
const candidates = [makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60)];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1);
assert.equal(toAdd[0].id, 'foo-2');
});
test('three candidates colliding the same existing id get -2, -3, -4', () => {
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
const candidates = [
makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60),
makePipeline('foo', 'Baz Pipeline', 70, 70, 80, 80),
makePipeline('foo', 'Qux Pipeline', 30, -30, 40, -40),
];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 3);
assert.deepEqual(
toAdd.map((p) => p.id).sort(),
['foo-2', 'foo-3', 'foo-4'],
);
});
});
describe('pipeline-dedup — determinism', () => {
test('two invocations on identical inputs produce identical output', () => {
const existing = [
makePipeline('a', 'Alpha Pipeline', 10, 10, 20, 20),
makePipeline('b', 'Beta Pipeline', 30, 30, 40, 40),
];
const candidates = [
makePipeline('a', 'Alpha Pipeline', 10.001, 10.001, 20.001, 20.001),
makePipeline('c', 'Gamma Pipeline', 50, 50, 60, 60),
];
const r1 = dedupePipelines(existing, candidates);
const r2 = dedupePipelines(existing, candidates);
assert.deepEqual(
r1.toAdd.map((p) => p.id),
r2.toAdd.map((p) => p.id),
);
assert.deepEqual(
r1.skippedDuplicates.map((d) => d.matchedExistingId),
r2.skippedDuplicates.map((d) => d.matchedExistingId),
);
});
});
describe('pipeline-dedup — within-batch dedup (review fix)', () => {
test('two candidates that match each other but not any existing → only first is added', () => {
// Regression: pre-fix, dedup compared each candidate ONLY against the
// original `existing` array, so two GEM rows for the same pipeline (e.g.
// a primary entry and a duplicate from a different source spreadsheet)
// would BOTH end up in the registry.
const candidates = [
makePipeline('east-west-saudi', 'East-West Crude Pipeline', 25, 49, 24, 38),
// Same pipeline, slightly different name + endpoints (within match
// tolerance). Should be skipped as a duplicate of the first candidate.
makePipeline('saudi-petroline', 'East-West Crude', 25.001, 49.001, 24.001, 38.001),
];
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 1, 'second matching candidate must be skipped');
assert.equal(skippedDuplicates.length, 1);
assert.equal(toAdd[0].id, 'east-west-saudi', 'first-accepted candidate wins (deterministic)');
assert.equal(skippedDuplicates[0].matchedExistingId, 'east-west-saudi',
'skipped candidate matches the earlier-accepted one, not anything in `existing`');
});
test('three candidates with transitive matches collapse to one', () => {
const candidates = [
makePipeline('a', 'Druzhba', 52.6, 49.4, 52.32, 14.06),
makePipeline('b', 'Druzhba Pipeline', 52.601, 49.401, 52.321, 14.061),
makePipeline('c', 'Druzhba Line', 52.602, 49.402, 52.322, 14.062),
];
const { toAdd } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 1, 'three matching candidates must collapse to the first one accepted');
});
test('existing wins over already-accepted candidate', () => {
// If a candidate matches an existing row, it must be reported as
// matching the existing row (existing-vs-toAdd precedence). Names
// chosen so Jaccard exceeds 0.6 after stopword removal.
const existing = [makePipeline('canon', 'Druzhba Northern', 52.6, 49.4, 52.32, 14.06)];
const candidates = [
makePipeline('cand-1', 'Druzhba Northern', 60, 30, 50, 14), // doesn't match existing (far endpoints)
makePipeline('cand-2', 'Druzhba Northern', 52.601, 49.401, 52.321, 14.061), // matches existing (near + Jaccard=1)
];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'cand-1 added; cand-2 skipped against existing');
assert.equal(skippedDuplicates[0].matchedExistingId, 'canon',
'cand-2 should be reported as matching the existing canon, not the earlier candidate');
});
});
describe('pipeline-dedup — empty inputs', () => {
test('empty existing + N candidates → all N added, none skipped', () => {
const candidates = [
makePipeline('a', 'A', 0, 0, 1, 1),
makePipeline('b', 'B', 5, 5, 6, 6),
];
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 2);
assert.equal(skippedDuplicates.length, 0);
});
test('N existing + empty candidates → empty result', () => {
const existing = [makePipeline('a', 'A', 0, 0, 1, 1)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, []);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 0);
});
});