Files
worldmonitor/tests/import-gem-pipelines.test.mjs
Elie Habib d9a1f6a0f8 feat(energy-atlas): GEM pipeline import infrastructure (parity PR 1, plan U1-U4) (#3397)
* feat(energy-atlas): GEM pipeline import infrastructure (PR 1, plan U1-U4)

Lands the parser, dedup helper, validator extensions, and operator runbook
for the Global Energy Monitor (CC-BY 4.0) pipeline-data refresh — closing
~3.6× of the Energy Atlas pipeline-scale gap once the operator runs the
import.

Per docs/plans/2026-04-25-003-feat-energy-parity-pushup-plan.md PR 1.

U1 — Validator + schema extensions:
- Add `'gem'` to VALID_SOURCES in scripts/_pipeline-registry.mjs and to the
  evidence-bearing-source whitelist in derivePipelinePublicBadge so GEM-
  sourced offline rows derive a `disputed` badge via the external-signal
  rule (parity with `press`/`satellite`/`ais-relay`).
- Export VALID_SOURCES so tests assert against the same source-of-truth
  the validator uses (matches the VALID_OIL_PRODUCT_CLASSES pattern from
  PR #3383).
- Floor bump (MIN_PIPELINES_PER_REGISTRY 8→200) intentionally DEFERRED
  to the follow-up data PR — bumping it now would gate the existing 75+75
  hand-curated rows below the new floor and break seeder publishes
  before the GEM data lands.

U2 — GEM parser (test-first):
- scripts/import-gem-pipelines.mjs reads a local JSON file (operator pre-
  converts GEM Excel externally — no `xlsx` dependency added). Schema-
  drift sentinel throws on missing columns. Status mapping covers
  Operating/Construction/Cancelled/Mothballed/Idle/Shut-in. ProductClass
  mapping covers Crude Oil / Refined Products / mixed-flow notes.
  Capacity-unit conversion handles bcm/y, bbl/d, Mbd, kbd.
- 22 tests in tests/import-gem-pipelines.test.mjs cover schema sentinel,
  fuel split, status mapping, productClass mapping, capacity conversion,
  minimum-viable-evidence shape, registry-shape conformance, and bad-
  coordinate rejection.

U3 — Deduplication (pure deterministic):
- scripts/_pipeline-dedup.mjs: dedupePipelines(existing, candidates) →
  { toAdd, skippedDuplicates }. Match rule: haversine ≤5km AND name
  Jaccard ≥0.6 (BOTH required). Reverse-direction-pair-aware.
- 19 tests cover internal helpers, match logic, id collision, determinism,
  and empty inputs.

U4 — Operator runbook (data import deferred):
- docs/methodology/pipelines.mdx: 7-step runbook for the operator to
  download GEM, pre-convert Excel→JSON, dry-run with --print-candidates,
  merge with --merge, bump the registry floor, and commit with
  provenance metadata.
- The actual data import is intentionally OUT OF SCOPE for this agent-
  authored PR because GEM downloads are registration-gated. A follow-up
  PR will commit the imported scripts/data/pipelines-{gas,oil}.json +
  bump MIN_PIPELINES_PER_REGISTRY → 200 + record the GEM release SHA256.

Tests: typecheck clean; 67 tests pass across the three test files.

Codex-approved through 8 review rounds against origin/main @ 050073354.

* fix(energy-atlas): wire --merge to dedupePipelines + within-batch dedup (PR1 review)

P1 — --merge was a TODO no-op (import-gem-pipelines.mjs:291):
- Previously exited with code 2 + a "TODO: wire dedup once U3 lands"
  message. The PR body and the methodology runbook both advertised
  --merge as the operator path.
- Add mergeIntoRegistry(filename, candidates) helper that loads the
  existing envelope, runs dedupePipelines() against the candidate
  list, sorts new entries alphabetically by id (stable diff on rerun),
  validates the merged registry via validateRegistry(), and writes
  to disk only after validation passes. CLI --merge now invokes it
  for both gas and oil + prints a per-fuel summary.
- Source attribution: the registry envelope's `source` field is
  upgraded to mention GEM (CC-BY 4.0) on first merge so the data file
  itself documents provenance.

P2 — dedup transitive-match bug (_pipeline-dedup.mjs:120):
- Pre-fix loop checked each candidate ONLY against the original
  `existing` array. Two GEM rows that match each other but not anything
  in `existing` would BOTH be added, defeating the dedup contract for
  same-batch duplicates (real example: a primary GEM entry plus a
  duplicate row from a regional supplemental sheet).
- Now compares against existing FIRST (existing wins on cross-set
  match — preserves richer hand-curated evidence), then falls back to
  the already-accepted toAdd set. Within-batch matches retain the FIRST
  accepted candidate (deterministic by candidate-list order).

Tests: 22 in tests/pipeline-dedup.test.mjs (3 new) cover the
within-batch dedup, transitive collapse, and existing-wins-over-
already-accepted scenarios. typecheck clean.

* fix(energy-atlas): cross-file-atomic --merge (PR1 review #2)

P1 — partial-import on disk if oil validation fails after gas writes
(import-gem-pipelines.mjs:329 / :350):
- Previous flow ran `mergeIntoRegistry('pipelines-gas.json', gas)` which
  wrote to disk, then `mergeIntoRegistry('pipelines-oil.json', oil)`. If
  oil validation failed, the operator was left with a half-imported
  state: gas had GEM rows committed to disk but oil didn't.
- Refactor into a two-phase API:
  1. prepareMerge(filename, candidates) — pure, no disk I/O. Builds the
     merged envelope, validates it, throws on validation failure.
  2. mergeBothRegistries(gasCandidates, oilCandidates) — calls
     prepareMerge for BOTH fuels first; only writes to disk after BOTH
     pass validation. If oil's prepareMerge throws, gas was never
     touched on disk.
- CLI --merge now invokes mergeBothRegistries. The atomicity guarantee
  is documented inline in the helper.

typecheck clean. No new tests because the existing dedup + validate
suites cover the underlying logic; the change is purely about call
ordering for atomicity.

* fix(energy-atlas): deterministic lastEvidenceUpdate + clarify test comment (PR1 review #3)

P2 — lastEvidenceUpdate was non-deterministic (Greptile P2):
- Previous code used new Date().toISOString() per parser run, so two runs
  of parseGemPipelines on the same input on different days produced
  byte-different output. Quarterly re-imports would produce noisy
  full-row diffs even when the upstream GEM data hadn't changed.
- New: resolveEvidenceTimestamp(envelope) derives the timestamp from
  envelope.downloadedAt (the operator-recorded date) or sourceVersion
  if it parses as ISO. Falls back to 1970-01-01 sentinel when neither
  is set — deliberately ugly so reviewers spot the missing field in
  the data file diff rather than getting silent today's date.
- Computed once per parse run so every emitted candidate gets the
  same timestamp.

P2 — misleading test comment (Greptile P2):
- Comment in tests/import-gem-pipelines.test.mjs:136 said "400_000 bbl/d
  ÷ 1000 = 400 Mbd" while the assertion correctly expects 0.4 (because
  the convention is millions, not thousands). Rewrote the comment to
  state the actual rule + arithmetic clearly.

3 new tests for determinism: (a) two parser runs produce identical
output, (b) timestamp derives from downloadedAt, (c) missing date
yields the epoch sentinel (loud failure mode).
2026-04-25 17:55:45 +04:00

277 lines
10 KiB
JavaScript

// @ts-check
//
// Tests for scripts/import-gem-pipelines.mjs — the GEM Oil & Gas Infrastructure
// Tracker → registry-shape parser. Test-first per the plan's Execution note: the
// schema-sentinel + status/productClass/capacity-unit mapping is the highest-
// risk failure mode, so coverage for it lands before the implementation does.
//
// Fixture: tests/fixtures/gem-pipelines-sample.json — operator-shape JSON
// (Excel pre-converted externally; the parser is local-file-only, no xlsx
// dep, no runtime URL fetch).
import { strict as assert } from 'node:assert';
import { test, describe } from 'node:test';
import { readFileSync } from 'node:fs';
import { resolve, dirname } from 'node:path';
import { fileURLToPath } from 'node:url';
import { parseGemPipelines, REQUIRED_COLUMNS } from '../scripts/import-gem-pipelines.mjs';
import { validateRegistry } from '../scripts/_pipeline-registry.mjs';
const __dirname = dirname(fileURLToPath(import.meta.url));
const fixturePath = resolve(__dirname, 'fixtures/gem-pipelines-sample.json');
const fixture = JSON.parse(readFileSync(fixturePath, 'utf-8'));
describe('import-gem-pipelines — schema sentinel', () => {
test('REQUIRED_COLUMNS is exported and non-empty', () => {
assert.ok(Array.isArray(REQUIRED_COLUMNS));
assert.ok(REQUIRED_COLUMNS.length >= 5);
});
test('throws on missing required column', () => {
const broken = {
...fixture,
pipelines: fixture.pipelines.map((p) => {
const { name: _drop, ...rest } = p;
return rest;
}),
};
assert.throws(
() => parseGemPipelines(broken),
/missing|name|schema/i,
'parser must throw on column drift, not silently accept',
);
});
test('throws on non-object input', () => {
assert.throws(() => parseGemPipelines(null), /input/i);
assert.throws(() => parseGemPipelines([]), /input|pipelines/i);
});
test('throws when pipelines field is missing', () => {
assert.throws(() => parseGemPipelines({ source: 'test' }), /pipelines/i);
});
});
describe('import-gem-pipelines — fuel split', () => {
test('splits gas + oil into two arrays', () => {
const { gas, oil } = parseGemPipelines(fixture);
assert.equal(gas.length, 3, 'fixture has 3 gas rows');
assert.equal(oil.length, 3, 'fixture has 3 oil rows');
});
test('gas pipelines do NOT carry productClass (gas registry forbids it)', () => {
const { gas } = parseGemPipelines(fixture);
for (const p of gas) {
assert.equal(p.productClass, undefined, `${p.name}: gas should not have productClass`);
}
});
test('every oil pipeline declares a productClass from the enum', () => {
const { oil } = parseGemPipelines(fixture);
for (const p of oil) {
assert.ok(
['crude', 'products', 'mixed'].includes(p.productClass),
`${p.name} has invalid productClass: ${p.productClass}`,
);
}
});
});
describe('import-gem-pipelines — status mapping', () => {
test("'Operating' maps to physicalState='flowing'", () => {
const { gas, oil } = parseGemPipelines(fixture);
const op = [...gas, ...oil].filter((p) => p.name.includes('Operating'));
assert.ok(op.length > 0);
for (const p of op) {
assert.equal(p.evidence.physicalState, 'flowing');
}
});
test("'Construction' maps to physicalState='unknown' (planned/not commissioned)", () => {
const { gas } = parseGemPipelines(fixture);
const ctr = gas.find((p) => p.name.includes('Construction'));
assert.ok(ctr);
assert.equal(ctr.evidence.physicalState, 'unknown');
});
test("'Cancelled' / 'Mothballed' map to physicalState='offline'", () => {
const { gas, oil } = parseGemPipelines(fixture);
const cancelled = gas.find((p) => p.name.includes('Cancelled'));
const mothballed = oil.find((p) => p.name.includes('Mothballed'));
assert.ok(cancelled);
assert.ok(mothballed);
assert.equal(cancelled.evidence.physicalState, 'offline');
assert.equal(mothballed.evidence.physicalState, 'offline');
});
});
describe('import-gem-pipelines — productClass mapping', () => {
test("'Crude Oil' product → productClass='crude'", () => {
const { oil } = parseGemPipelines(fixture);
const crude = oil.find((p) => p.name.includes('Crude Oil Trunk'));
assert.ok(crude);
assert.equal(crude.productClass, 'crude');
});
test("'Refined Products' product → productClass='products'", () => {
const { oil } = parseGemPipelines(fixture);
const refined = oil.find((p) => p.name.includes('Refined Products'));
assert.ok(refined);
assert.equal(refined.productClass, 'products');
});
});
describe('import-gem-pipelines — capacity-unit conversion', () => {
test('gas capacity in bcm/y is preserved unchanged', () => {
const { gas } = parseGemPipelines(fixture);
const opGas = gas.find((p) => p.name.includes('Operating'));
assert.ok(opGas);
assert.equal(opGas.capacityBcmYr, 24);
});
test('oil capacity in bbl/d is converted to Mbd (thousand barrels per day)', () => {
const { oil } = parseGemPipelines(fixture);
const crude = oil.find((p) => p.name.includes('Crude Oil Trunk'));
assert.ok(crude);
// Schema convention: the field is named `capacityMbd` (the customary
// industry abbreviation) but the VALUE is in millions of barrels per
// day, NOT thousands — matching the existing on-main hand-curated rows
// (e.g. CPC pipeline ships as `capacityMbd: 1.4` for 1.4M bbl/d).
// So 400_000 bbl/d ÷ 1_000_000 = 0.4 capacityMbd.
assert.equal(crude.capacityMbd, 0.4);
});
test('oil capacity already in Mbd is preserved unchanged', () => {
const { oil } = parseGemPipelines(fixture);
const refined = oil.find((p) => p.name.includes('Refined Products'));
assert.ok(refined);
assert.equal(refined.capacityMbd, 0.65);
});
});
describe('import-gem-pipelines — minimum-viable evidence', () => {
test('every emitted candidate has physicalStateSource=gem', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.equal(p.evidence.physicalStateSource, 'gem');
}
});
test('every emitted candidate has classifierVersion=gem-import-v1', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.equal(p.evidence.classifierVersion, 'gem-import-v1');
}
});
test('every emitted candidate has classifierConfidence ≤ 0.5', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.ok(p.evidence.classifierConfidence <= 0.5);
assert.ok(p.evidence.classifierConfidence >= 0);
}
});
test('every emitted candidate has empty sanctionRefs and null operatorStatement', () => {
const { gas, oil } = parseGemPipelines(fixture);
for (const p of [...gas, ...oil]) {
assert.deepEqual(p.evidence.sanctionRefs, []);
assert.equal(p.evidence.operatorStatement, null);
}
});
});
describe('import-gem-pipelines — registry-shape conformance', () => {
test('emitted gas registry passes validateRegistry', () => {
// Build a synthetic registry of just the GEM-emitted gas rows; meets the
// validator's MIN_PIPELINES_PER_REGISTRY=8 floor by repeating the 3 fixture
// rows so we exercise the schema, not the count.
const { gas } = parseGemPipelines(fixture);
const repeated = [];
for (let i = 0; i < 3; i++) {
for (const p of gas) repeated.push({ ...p, id: `${p.id}-rep${i}` });
}
const reg = {
pipelines: Object.fromEntries(repeated.map((p) => [p.id, p])),
};
assert.equal(validateRegistry(reg), true);
});
test('emitted oil registry passes validateRegistry', () => {
const { oil } = parseGemPipelines(fixture);
const repeated = [];
for (let i = 0; i < 3; i++) {
for (const p of oil) repeated.push({ ...p, id: `${p.id}-rep${i}` });
}
const reg = {
pipelines: Object.fromEntries(repeated.map((p) => [p.id, p])),
};
assert.equal(validateRegistry(reg), true);
});
});
describe('import-gem-pipelines — determinism (review-fix #3)', () => {
test('two parser runs on identical input produce identical output', () => {
// Regression: pre-fix, lastEvidenceUpdate used new Date() per run, so
// re-running parseGemPipelines on the same JSON on different days
// produced different output → noisy diffs every quarterly re-import.
// Now derived from envelope.downloadedAt, so output is byte-identical.
const r1 = JSON.stringify(parseGemPipelines(fixture));
const r2 = JSON.stringify(parseGemPipelines(fixture));
assert.equal(r1, r2);
});
test('lastEvidenceUpdate derives from envelope.downloadedAt', () => {
// Fixture has downloadedAt: 2026-04-25 → emitted as 2026-04-25T00:00:00Z.
const { gas } = parseGemPipelines(fixture);
for (const p of gas) {
assert.equal(p.evidence.lastEvidenceUpdate, '2026-04-25T00:00:00Z');
}
});
test('missing downloadedAt → epoch sentinel (loud failure, not silent today)', () => {
// If the operator forgets the date field, the emitted timestamp should
// be obviously wrong rather than today's wall clock — surfaces the
// gap in code review of the data file.
const noDate = { ...fixture };
delete noDate.downloadedAt;
delete noDate.sourceVersion;
const { gas } = parseGemPipelines(noDate);
for (const p of gas) {
assert.equal(p.evidence.lastEvidenceUpdate, '1970-01-01T00:00:00Z');
}
});
});
describe('import-gem-pipelines — coordinate validity', () => {
test('rows with invalid lat/lon are dropped (not silently kept with lat=0)', () => {
const broken = {
...fixture,
pipelines: [
...fixture.pipelines,
{
name: 'Test Bad Coords',
operator: 'X',
fuel: 'Natural Gas',
product: '',
fromCountry: 'XX',
toCountry: 'YY',
transitCountries: [],
capacity: 5,
capacityUnit: 'bcm/y',
lengthKm: 100,
status: 'Operating',
startYear: 2020,
startLat: 200, // out of range
startLon: 0,
endLat: 0,
endLon: 0,
},
],
};
const { gas } = parseGemPipelines(broken);
const bad = gas.find((p) => p.name.includes('Bad Coords'));
assert.equal(bad, undefined, 'row with out-of-range lat must be dropped, not coerced');
});
});