Files
worldmonitor/tests/import-gem-pipelines.test.mjs
Elie Habib 8655bd81bc feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)

Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.

Source releases (CC-BY 4.0, attribution preserved in registry envelope):
  - GEM-GGIT-Gas-Pipelines-2025-11.xlsx
    SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
  - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
    SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx

Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
  - status ∈ {operating, construction}
  - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
  - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
    bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
  - Country names → ISO 3166-1 alpha-2 via pycountry + alias table.

Merge results (via scripts/import-gem-pipelines.mjs --merge):
  gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
  oil: +259 added, 16 duplicates skipped
  Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
  GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
  operatorStatement=null, sanctionRefs=[].

Floor bump:
  scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
  Live counts (297/334) leave ~100 rows of jitter headroom so a partial
  re-import or coverage-narrowing release fails loud rather than halving
  the registry silently.

Tests:
  - tests/pipelines-registry.test.mts: bumped synthetic-registry
    Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
    the evidence-source whitelist for non-flowing badges (parity with the
    derivePipelinePublicBadge audit done in PR #3397 U1).
  - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
    3 → 70 to clear new floor.
  - 51/51 pipeline tests pass; tsc --noEmit clean.

vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.

* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh

PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:

- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
  only the GIS .zip's GeoJSON has both. The original runbook said to download
  XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
  90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
  future operators don't re-request the wrong dataset (the Extraction
  Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
  Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
  GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
  ∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.

Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.

* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test

Three Greptile findings on PR #3406:

P1 — Dedup miss (Dampier-Bunbury):
  Same physical pipeline existed in both registries — curated `dampier-bunbury`
  and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
  digitized only the southern 60% of the line. The shared Bunbury terminus
  matched at 13.7 km but the average-endpoint distance was 287 km, just over
  the 5 km gate.
  Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
  if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
  pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
  existing "name collision in different ocean → still added" contract.
  Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
  skipped, matched against `dampier-bunbury`.

P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
  GEM source GeoJSON occasionally has a Point geometry or single-coord
  LineString, producing pipelines where startPoint == endPoint. They render
  as map-point artifacts and skew aggregate-length stats.
  Fix (defense in depth):
    - scripts/_gem-geojson-to-canonical.py drops at conversion time
      (`zero_length` reason in drop log).
    - scripts/_pipeline-registry.mjs validateRegistry rejects defensively
      so even a hand-curated row with degenerate geometry fails loud.

P2 — Test repetition coupled to fixture row count:
  Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
  breaks if fixture is trimmed below 3.
  Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
  the floor and current fixture length.

Re-run --merge with all fixes applied:
  gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
       name-set-identity short-circuit; -2 zero-length never imported)
  oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)

Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 18:59:46 +04:00

282 lines
10 KiB
JavaScript

// @ts-check
//
// Tests for scripts/import-gem-pipelines.mjs — the GEM Oil & Gas Infrastructure
// Tracker → registry-shape parser. Test-first per the plan's Execution note: the
// schema-sentinel + status/productClass/capacity-unit mapping is the highest-
// risk failure mode, so coverage for it lands before the implementation does.
//
// Fixture: tests/fixtures/gem-pipelines-sample.json — operator-shape JSON
// (Excel pre-converted externally; the parser is local-file-only, no xlsx
// dep, no runtime URL fetch).
import { strict as assert } from 'node:assert';
import { test, describe } from 'node:test';
import { readFileSync } from 'node:fs';
import { resolve, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
import { parseGemPipelines, REQUIRED_COLUMNS } from '../scripts/import-gem-pipelines.mjs';
import { validateRegistry } from '../scripts/_pipeline-registry.mjs';
const __dirname = dirname(fileURLToPath(import.meta.url));
const fixturePath = resolve(__dirname, 'fixtures/gem-pipelines-sample.json');
const fixture = JSON.parse(readFileSync(fixturePath, 'utf-8'));
describe('import-gem-pipelines — schema sentinel', () => {
test('REQUIRED_COLUMNS is exported and non-empty', () => {
assert.ok(Array.isArray(REQUIRED_COLUMNS));
assert.ok(REQUIRED_COLUMNS.length >= 5);
});
test('throws on missing required column', () => {
const broken = {
...fixture,
pipelines: fixture.pipelines.map((p) => {
const { name: _drop, ...rest } = p;
return rest;
}),
};
assert.throws(
() => parseGemPipelines(broken),
/missing|name|schema/i,
'parser must throw on column drift, not silently accept',
);
});
test('throws on non-object input', () => {
assert.throws(() => parseGemPipelines(null), /input/i);
assert.throws(() => parseGemPipelines([]), /input|pipelines/i);
});
test('throws when pipelines field is missing', () => {
assert.throws(() => parseGemPipelines({ source: 'test' }), /pipelines/i);
});
});
describe('import-gem-pipelines — fuel split', () => {
test('splits gas + oil into two arrays', () => {
const { gas, oil } = parseGemPipelines(fixture);
assert.equal(gas.length, 3, 'fixture has 3 gas rows');
assert.equal(oil.length, 3, 'fixture has 3 oil rows');
});
test('gas pipelines do NOT carry productClass (gas registry forbids it)', () => {
const { gas } = parseGemPipelines(fixture);
for (const p of gas) {
assert.equal(p.productClass, undefined, `${p.name}: gas should not have productClass`);
}
});
test('every oil pipeline declares a productClass from the enum', () => {
const { oil } = parseGemPipelines(fixture);
for (const p of oil) {
assert.ok(
['crude', 'products', 'mixed'].includes(p.productClass),
`${p.name} has invalid productClass: ${p.productClass}`,
);
}
});
});
describe('import-gem-pipelines — status mapping', () => {
test("'Operating' maps to physicalState='flowing'", () => {
const { gas, oil } = parseGemPipelines(fixture);
const op = [...gas, ...oil].filter((p) => p.name.includes('Operating'));
assert.ok(op.length > 0);
for (const p of op) {
assert.equal(p.evidence.physicalState, 'flowing');
}
});
test("'Construction' maps to physicalState='unknown' (planned/not commissioned)", () => {
const { gas } = parseGemPipelines(fixture);
const ctr = gas.find((p) => p.name.includes('Construction'));
assert.ok(ctr);
assert.equal(ctr.evidence.physicalState, 'unknown');
});
test("'Cancelled' / 'Mothballed' map to physicalState='offline'", () => {
const { gas, oil } = parseGemPipelines(fixture);
const cancelled = gas.find((p) => p.name.includes('Cancelled'));
const mothballed = oil.find((p) => p.name.includes('Mothballed'));
assert.ok(cancelled);
assert.ok(mothballed);
assert.equal(cancelled.evidence.physicalState, 'offline');
assert.equal(mothballed.evidence.physicalState, 'offline');
});
});
describe('import-gem-pipelines — productClass mapping', () => {
test("'Crude Oil' product → productClass='crude'", () => {
const { oil } = parseGemPipelines(fixture);
const crude = oil.find((p) => p.name.includes('Crude Oil Trunk'));
assert.ok(crude);
assert.equal(crude.productClass, 'crude');
});
test("'Refined Products' product → productClass='products'", () => {
const { oil } = parseGemPipelines(fixture);
const refined = oil.find((p) => p.name.includes('Refined Products'));
assert.ok(refined);
assert.equal(refined.productClass, 'products');
});
});
describe('import-gem-pipelines — capacity-unit conversion', () => {
test('gas capacity in bcm/y is preserved unchanged', () => {
const { gas } = parseGemPipelines(fixture);
const opGas = gas.find((p) => p.name.includes('Operating'));
assert.ok(opGas);
assert.equal(opGas.capacityBcmYr, 24);
});
test('oil capacity in bbl/d is converted to Mbd (thousand barrels per day)', () => {
const { oil } = parseGemPipelines(fixture);
const crude = oil.find((p) => p.name.includes('Crude Oil Trunk'));
assert.ok(crude);
// Schema convention: the field is named `capacityMbd` (the customary
// industry abbreviation) but the VALUE is in millions of barrels per
// day, NOT thousands — matching the existing on-main hand-curated rows
// (e.g. CPC pipeline ships as `capacityMbd: 1.4` for 1.4M bbl/d).
// So 400_000 bbl/d ÷ 1_000_000 = 0.4 capacityMbd.
assert.equal(crude.capacityMbd, 0.4);
});
test('oil capacity already in Mbd is preserved unchanged', () => {
const { oil } = parseGemPipelines(fixture);
const refined = oil.find((p) => p.name.includes('Refined Products'));
assert.ok(refined);
assert.equal(refined.capacityMbd, 0.65);
});
});
describe('import-gem-pipelines — minimum-viable evidence', () => {
test('every emitted candidate has physicalStateSource=gem', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.equal(p.evidence.physicalStateSource, 'gem');
}
});
test('every emitted candidate has classifierVersion=gem-import-v1', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.equal(p.evidence.classifierVersion, 'gem-import-v1');
}
});
test('every emitted candidate has classifierConfidence ≤ 0.5', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.ok(p.evidence.classifierConfidence <= 0.5);
assert.ok(p.evidence.classifierConfidence >= 0);
}
});
test('every emitted candidate has empty sanctionRefs and null operatorStatement', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.deepEqual(p.evidence.sanctionRefs, []);
assert.equal(p.evidence.operatorStatement, null);
}
});
});
describe('import-gem-pipelines — registry-shape conformance', () => {
// Compute the repeat count from the floor + the fixture row count so this
// test stays correct if the fixture is trimmed or the floor is raised. The
// hardcoded `for (let i = 0; i < 70; i++)` was fragile — Greptile P2 on PR
// #3406. +5 over the floor leaves a safety margin without inflating the test.
const REGISTRY_FLOOR = 200;
test('emitted gas registry passes validateRegistry', () => {
const { gas } = parseGemPipelines(fixture);
const reps = Math.ceil(REGISTRY_FLOOR / gas.length) + 5;
const repeated = [];
for (let i = 0; i < reps; i++) {
for (const p of gas) repeated.push({ ...p, id: `${p.id}-rep${i}` });
}
const reg = {
pipelines: Object.fromEntries(repeated.map((p) => [p.id, p])),
};
assert.equal(validateRegistry(reg), true);
});
test('emitted oil registry passes validateRegistry', () => {
const { oil } = parseGemPipelines(fixture);
const reps = Math.ceil(REGISTRY_FLOOR / oil.length) + 5;
const repeated = [];
for (let i = 0; i < reps; i++) {
for (const p of oil) repeated.push({ ...p, id: `${p.id}-rep${i}` });
}
const reg = {
pipelines: Object.fromEntries(repeated.map((p) => [p.id, p])),
};
assert.equal(validateRegistry(reg), true);
});
});
describe('import-gem-pipelines — determinism (review-fix #3)', () => {
test('two parser runs on identical input produce identical output', () => {
// Regression: pre-fix, lastEvidenceUpdate used new Date() per run, so
// re-running parseGemPipelines on the same JSON on different days
// produced different output → noisy diffs every quarterly re-import.
// Now derived from envelope.downloadedAt, so output is byte-identical.
const r1 = JSON.stringify(parseGemPipelines(fixture));
const r2 = JSON.stringify(parseGemPipelines(fixture));
assert.equal(r1, r2);
});
test('lastEvidenceUpdate derives from envelope.downloadedAt', () => {
// Fixture has downloadedAt: 2026-04-25 → emitted as 2026-04-25T00:00:00Z.
const { gas } = parseGemPipelines(fixture);
for (const p of gas) {
assert.equal(p.evidence.lastEvidenceUpdate, '2026-04-25T00:00:00Z');
}
});
test('missing downloadedAt → epoch sentinel (loud failure, not silent today)', () => {
// If the operator forgets the date field, the emitted timestamp should
// be obviously wrong rather than today's wall clock — surfaces the
// gap in code review of the data file.
const noDate = { ...fixture };
delete noDate.downloadedAt;
delete noDate.sourceVersion;
const { gas } = parseGemPipelines(noDate);
for (const p of gas) {
assert.equal(p.evidence.lastEvidenceUpdate, '1970-01-01T00:00:00Z');
}
});
});
describe('import-gem-pipelines — coordinate validity', () => {
test('rows with invalid lat/lon are dropped (not silently kept with lat=0)', () => {
const broken = {
...fixture,
pipelines: [
...fixture.pipelines,
{
name: 'Test Bad Coords',
operator: 'X',
fuel: 'Natural Gas',
product: '',
fromCountry: 'XX',
toCountry: 'YY',
transitCountries: [],
capacity: 5,
capacityUnit: 'bcm/y',
lengthKm: 100,
status: 'Operating',
startYear: 2020,
startLat: 200, // out of range
startLon: 0,
endLat: 0,
endLon: 0,
},
],
};
const { gas } = parseGemPipelines(broken);
const bad = gas.find((p) => p.name.includes('Bad Coords'));
assert.equal(bad, undefined, 'row with out-of-range lat must be dropped, not coerced');
});
});