mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure) Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure was built for. Per docs/methodology/pipelines.mdx operator runbook. Source releases (CC-BY 4.0, attribution preserved in registry envelope): - GEM-GGIT-Gas-Pipelines-2025-11.xlsx SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6 URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2 URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) → canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs: - status ∈ {operating, construction} - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors. - Country names → ISO 3166-1 alpha-2 via pycountry + alias table. Merge results (via scripts/import-gem-pipelines.mjs --merge): gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6) oil: +259 added, 16 duplicates skipped Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence; GEM rows ship physicalStateSource='gem', classifierConfidence=0.4, operatorStatement=null, sanctionRefs=[]. Floor bump: scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200. Live counts (297/334) leave ~100 rows of jitter headroom so a partial re-import or coverage-narrowing release fails loud rather than halving the registry silently. Tests: - tests/pipelines-registry.test.mts: bumped synthetic-registry Array.from({length:8}) → length:210 to clear new floor; added 'gem' to the evidence-source whitelist for non-flowing badges (parity with the derivePipelinePublicBadge audit done in PR #3397 U1). - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop 3 → 70 to clear new floor. - 51/51 pipeline tests pass; tsc --noEmit clean. vs peer reference site (281 gas + 265 oil): we now match (gas 297) and exceed (oil 334). Functional + visual + data parity for the energy variant is closed; remaining gaps are editorial-cadence (weekly briefing) which is intentionally out of scope per the parity-push plan. * docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh PR #3406 imported the data but didn't land the conversion script that produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py so future operators can reproduce the import deterministically, and rewrites the docs/methodology/pipelines.mdx runbook to match what actually works: - Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns; only the GIS .zip's GeoJSON has both. The original runbook said to download XLSX which would fail at the lat/lon validation step. - Cadence: quarterly refresh, with concrete signals (peer-site comparison, 90-day calendar reminder). - Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so future operators don't re-request the wrong dataset (the Extraction Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure Trackers). - Last-known-good URLs documented + URL pattern explained as fallback when GEM rotates per release. - Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status ∈ {operating, construction}, capacity unit conversion table). - Failure-mode table mapping common errors to fixes. Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON, GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it works for any release without code edits. * fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test Three Greptile findings on PR #3406: P1 — Dedup miss (Dampier-Bunbury): Same physical pipeline existed in both registries — curated `dampier-bunbury` and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM digitized only the southern 60% of the line. The shared Bunbury terminus matched at 13.7 km but the average-endpoint distance was 287 km, just over the 5 km gate. Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit — if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the existing "name collision in different ocean → still added" contract. Added regression test: identical Dampier-Bunbury inputs → 0 added, 1 skipped, matched against `dampier-bunbury`. P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.): GEM source GeoJSON occasionally has a Point geometry or single-coord LineString, producing pipelines where startPoint == endPoint. They render as map-point artifacts and skew aggregate-length stats. Fix (defense in depth): - scripts/_gem-geojson-to-canonical.py drops at conversion time (`zero_length` reason in drop log). - scripts/_pipeline-registry.mjs validateRegistry rejects defensively so even a hand-curated row with degenerate geometry fails loud. P2 — Test repetition coupled to fixture row count: Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently breaks if fixture is trimmed below 3. Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from the floor and current fixture length. Re-run --merge with all fixes applied: gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via name-set-identity short-circuit; -2 zero-length never imported) oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length) Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
259 lines
12 KiB
JavaScript
259 lines
12 KiB
JavaScript
// @ts-check
|
||
//
|
||
// Tests for scripts/_pipeline-dedup.mjs — the haversine + Jaccard dedup
|
||
// helper. Both criteria (≤5km AND ≥0.6) must hold for a match. Existing rows
|
||
// always win to preserve hand-curated evidence.
|
||
|
||
import { strict as assert } from 'node:assert';
|
||
import { test, describe } from 'node:test';
|
||
import { dedupePipelines, _internal } from '../scripts/_pipeline-dedup.mjs';
|
||
|
||
const { jaccard, averageEndpointDistanceKm, tokenize, uniqueId } = _internal;
|
||
|
||
function makePipeline(id, name, startLat, startLon, endLat, endLon) {
|
||
return {
|
||
id,
|
||
name,
|
||
startPoint: { lat: startLat, lon: startLon },
|
||
endPoint: { lat: endLat, lon: endLon },
|
||
};
|
||
}
|
||
|
||
describe('pipeline-dedup — internal helpers', () => {
|
||
test('tokenize lowercases, splits, drops stopwords', () => {
|
||
const tokens = tokenize('Trans-Siberian Pipeline System');
|
||
assert.deepEqual(tokens.sort(), ['siberian', 'trans']);
|
||
});
|
||
|
||
test('tokenize removes punctuation and accents', () => {
|
||
const tokens = tokenize('Caño Limón–Coveñas Pipeline');
|
||
// After NFKD normalization + ascii-only filter, accented chars survive
|
||
// as their base letter; we accept either exact or close behaviour.
|
||
assert.ok(tokens.includes('limon') || tokens.includes('lim'),
|
||
`expected Limón to tokenize; got ${tokens.join(',')}`);
|
||
});
|
||
|
||
test('jaccard returns 1.0 for identical token sets', () => {
|
||
assert.equal(jaccard('Test Pipeline System', 'Test Pipeline'), 1.0);
|
||
});
|
||
|
||
test('jaccard returns 0 for fully disjoint names', () => {
|
||
assert.equal(jaccard('Druzhba North', 'Nord Stream'), 0);
|
||
});
|
||
|
||
test('jaccard 0.5 for half-overlap', () => {
|
||
assert.equal(jaccard('Trans Adriatic', 'Trans Caspian'), 1 / 3);
|
||
});
|
||
|
||
test('haversine distance is symmetric', () => {
|
||
const a = makePipeline('a', 'A', 60, 30, 54, 13);
|
||
const b = makePipeline('b', 'B', 60.001, 30.001, 54.001, 13.001);
|
||
assert.ok(averageEndpointDistanceKm(a, b) < 1, 'sub-km on tiny offsets');
|
||
});
|
||
|
||
test('haversine distance for far-apart pipelines is large', () => {
|
||
const a = makePipeline('a', 'A', 60, 30, 54, 13); // RU→DE
|
||
const b = makePipeline('b', 'B', 30, -90, 25, -85); // Gulf of Mexico
|
||
assert.ok(averageEndpointDistanceKm(a, b) > 5000);
|
||
});
|
||
|
||
test('uniqueId preserves base when free, suffixes when taken', () => {
|
||
const taken = new Set(['foo', 'foo-2']);
|
||
assert.equal(uniqueId('bar', taken), 'bar');
|
||
assert.equal(uniqueId('foo', taken), 'foo-3');
|
||
});
|
||
});
|
||
|
||
describe('pipeline-dedup — match logic', () => {
|
||
test('happy path: completely-different name + far endpoints → added', () => {
|
||
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline (Northern Branch)',
|
||
52.6, 49.4, 52.32, 14.06)];
|
||
const candidates = [makePipeline('nord-stream-1', 'Nord Stream 1',
|
||
60.08, 29.05, 54.14, 13.66)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 1);
|
||
assert.equal(skippedDuplicates.length, 0);
|
||
});
|
||
|
||
test('match by both criteria: close endpoints + similar name → skipped (existing wins)', () => {
|
||
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline',
|
||
52.6, 49.4, 52.32, 14.06)];
|
||
const candidates = [makePipeline('druzhba-import', 'Druzhba Pipeline',
|
||
52.601, 49.401, 52.321, 14.061)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 0);
|
||
assert.equal(skippedDuplicates.length, 1);
|
||
assert.equal(skippedDuplicates[0].matchedExistingId, 'druzhba-north');
|
||
});
|
||
|
||
test('identical names + one shared terminus (≤25 km) → deduped (PR #3406 Dampier-Bunbury regression)', () => {
|
||
// Real-world case from PR #3406 review: GEM digitized only the southern
|
||
// 60% of the line, so the shared Bunbury terminus matched at 13.7 km
|
||
// but the average-endpoint distance was 287 km (over the 5 km gate).
|
||
// Identical token sets + ≥1 close pairing = same physical pipeline.
|
||
const existing = [makePipeline('dampier-bunbury', 'Dampier to Bunbury Natural Gas Pipeline',
|
||
-20.68, 116.72, -33.33, 115.63)];
|
||
const candidates = [makePipeline('dampier-to-bunbury-natural-gas-pipeline-au',
|
||
'Dampier to Bunbury Natural Gas Pipeline',
|
||
-33.265797, 115.755682, -24.86854, 113.674968)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 0);
|
||
assert.equal(skippedDuplicates.length, 1);
|
||
assert.equal(skippedDuplicates[0].matchedExistingId, 'dampier-bunbury');
|
||
});
|
||
|
||
test('name-match only (endpoints in different ocean) → added', () => {
|
||
const existing = [makePipeline('nord-stream-1', 'Nord Stream 1',
|
||
60.08, 29.05, 54.14, 13.66)];
|
||
const candidates = [makePipeline('imposter', 'Nord Stream 1',
|
||
40.0, -100.0, 35.0, -90.0)]; // different continent
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 1, 'low haversine confidence overrides high name match');
|
||
assert.equal(skippedDuplicates.length, 0);
|
||
});
|
||
|
||
test('endpoint-match only (different name) → added (real distinct pipelines can share endpoints)', () => {
|
||
const existing = [makePipeline('yamal-europe', 'Yamal–Europe',
|
||
67.0, 75.0, 52.0, 14.0)];
|
||
const candidates = [makePipeline('different-route', 'Trans-Siberian Coal Slurry',
|
||
67.001, 75.001, 52.001, 14.001)];
|
||
const { toAdd } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 1, 'name disambiguates: same endpoints, different infrastructure');
|
||
});
|
||
|
||
test('reverse-direction match: candidate endpoints flipped → still detected', () => {
|
||
const existing = [makePipeline('druzhba', 'Druzhba',
|
||
52.6, 49.4, 52.32, 14.06)];
|
||
// Same pipeline, route described in reverse direction
|
||
const candidates = [makePipeline('druzhba-flipped', 'Druzhba',
|
||
52.32, 14.06, 52.6, 49.4)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 0);
|
||
assert.equal(skippedDuplicates.length, 1);
|
||
});
|
||
|
||
test('stopword-only difference: "Pipeline System" vs "Line" → matches by Jaccard', () => {
|
||
const existing = [makePipeline('trans-sib', 'Trans-Siberian Pipeline System',
|
||
55, 30, 60, 90)];
|
||
const candidates = [makePipeline('trans-sib-cand', 'Trans-Siberian Line',
|
||
55.001, 30.001, 60.001, 90.001)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 0);
|
||
assert.equal(skippedDuplicates.length, 1);
|
||
assert.ok(skippedDuplicates[0].jaccard >= 0.6);
|
||
});
|
||
});
|
||
|
||
describe('pipeline-dedup — id collision', () => {
|
||
test('candidate with id colliding existing gets suffixed -2', () => {
|
||
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
|
||
const candidates = [makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60)];
|
||
const { toAdd } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 1);
|
||
assert.equal(toAdd[0].id, 'foo-2');
|
||
});
|
||
|
||
test('three candidates colliding the same existing id get -2, -3, -4', () => {
|
||
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
|
||
const candidates = [
|
||
makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60),
|
||
makePipeline('foo', 'Baz Pipeline', 70, 70, 80, 80),
|
||
makePipeline('foo', 'Qux Pipeline', 30, -30, 40, -40),
|
||
];
|
||
const { toAdd } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 3);
|
||
assert.deepEqual(
|
||
toAdd.map((p) => p.id).sort(),
|
||
['foo-2', 'foo-3', 'foo-4'],
|
||
);
|
||
});
|
||
});
|
||
|
||
describe('pipeline-dedup — determinism', () => {
|
||
test('two invocations on identical inputs produce identical output', () => {
|
||
const existing = [
|
||
makePipeline('a', 'Alpha Pipeline', 10, 10, 20, 20),
|
||
makePipeline('b', 'Beta Pipeline', 30, 30, 40, 40),
|
||
];
|
||
const candidates = [
|
||
makePipeline('a', 'Alpha Pipeline', 10.001, 10.001, 20.001, 20.001),
|
||
makePipeline('c', 'Gamma Pipeline', 50, 50, 60, 60),
|
||
];
|
||
const r1 = dedupePipelines(existing, candidates);
|
||
const r2 = dedupePipelines(existing, candidates);
|
||
assert.deepEqual(
|
||
r1.toAdd.map((p) => p.id),
|
||
r2.toAdd.map((p) => p.id),
|
||
);
|
||
assert.deepEqual(
|
||
r1.skippedDuplicates.map((d) => d.matchedExistingId),
|
||
r2.skippedDuplicates.map((d) => d.matchedExistingId),
|
||
);
|
||
});
|
||
});
|
||
|
||
describe('pipeline-dedup — within-batch dedup (review fix)', () => {
|
||
test('two candidates that match each other but not any existing → only first is added', () => {
|
||
// Regression: pre-fix, dedup compared each candidate ONLY against the
|
||
// original `existing` array, so two GEM rows for the same pipeline (e.g.
|
||
// a primary entry and a duplicate from a different source spreadsheet)
|
||
// would BOTH end up in the registry.
|
||
const candidates = [
|
||
makePipeline('east-west-saudi', 'East-West Crude Pipeline', 25, 49, 24, 38),
|
||
// Same pipeline, slightly different name + endpoints (within match
|
||
// tolerance). Should be skipped as a duplicate of the first candidate.
|
||
makePipeline('saudi-petroline', 'East-West Crude', 25.001, 49.001, 24.001, 38.001),
|
||
];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
|
||
assert.equal(toAdd.length, 1, 'second matching candidate must be skipped');
|
||
assert.equal(skippedDuplicates.length, 1);
|
||
assert.equal(toAdd[0].id, 'east-west-saudi', 'first-accepted candidate wins (deterministic)');
|
||
assert.equal(skippedDuplicates[0].matchedExistingId, 'east-west-saudi',
|
||
'skipped candidate matches the earlier-accepted one, not anything in `existing`');
|
||
});
|
||
|
||
test('three candidates with transitive matches collapse to one', () => {
|
||
const candidates = [
|
||
makePipeline('a', 'Druzhba', 52.6, 49.4, 52.32, 14.06),
|
||
makePipeline('b', 'Druzhba Pipeline', 52.601, 49.401, 52.321, 14.061),
|
||
makePipeline('c', 'Druzhba Line', 52.602, 49.402, 52.322, 14.062),
|
||
];
|
||
const { toAdd } = dedupePipelines([], candidates);
|
||
assert.equal(toAdd.length, 1, 'three matching candidates must collapse to the first one accepted');
|
||
});
|
||
|
||
test('existing wins over already-accepted candidate', () => {
|
||
// If a candidate matches an existing row, it must be reported as
|
||
// matching the existing row (existing-vs-toAdd precedence). Names
|
||
// chosen so Jaccard exceeds 0.6 after stopword removal.
|
||
const existing = [makePipeline('canon', 'Druzhba Northern', 52.6, 49.4, 52.32, 14.06)];
|
||
const candidates = [
|
||
makePipeline('cand-1', 'Druzhba Northern', 60, 30, 50, 14), // doesn't match existing (far endpoints)
|
||
makePipeline('cand-2', 'Druzhba Northern', 52.601, 49.401, 52.321, 14.061), // matches existing (near + Jaccard=1)
|
||
];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
|
||
assert.equal(toAdd.length, 1, 'cand-1 added; cand-2 skipped against existing');
|
||
assert.equal(skippedDuplicates[0].matchedExistingId, 'canon',
|
||
'cand-2 should be reported as matching the existing canon, not the earlier candidate');
|
||
});
|
||
});
|
||
|
||
describe('pipeline-dedup — empty inputs', () => {
|
||
test('empty existing + N candidates → all N added, none skipped', () => {
|
||
const candidates = [
|
||
makePipeline('a', 'A', 0, 0, 1, 1),
|
||
makePipeline('b', 'B', 5, 5, 6, 6),
|
||
];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
|
||
assert.equal(toAdd.length, 2);
|
||
assert.equal(skippedDuplicates.length, 0);
|
||
});
|
||
|
||
test('N existing + empty candidates → empty result', () => {
|
||
const existing = [makePipeline('a', 'A', 0, 0, 1, 1)];
|
||
const { toAdd, skippedDuplicates } = dedupePipelines(existing, []);
|
||
assert.equal(toAdd.length, 0);
|
||
assert.equal(skippedDuplicates.length, 0);
|
||
});
|
||
});
|