Files
worldmonitor/tests/pipeline-dedup.test.mjs
Elie Habib d9a1f6a0f8 feat(energy-atlas): GEM pipeline import infrastructure (parity PR 1, plan U1-U4) (#3397)
* feat(energy-atlas): GEM pipeline import infrastructure (PR 1, plan U1-U4)

Lands the parser, dedup helper, validator extensions, and operator runbook
for the Global Energy Monitor (CC-BY 4.0) pipeline-data refresh — closing
~3.6× of the Energy Atlas pipeline-scale gap once the operator runs the
import.

Per docs/plans/2026-04-25-003-feat-energy-parity-pushup-plan.md PR 1.

U1 — Validator + schema extensions:
- Add `'gem'` to VALID_SOURCES in scripts/_pipeline-registry.mjs and to the
  evidence-bearing-source whitelist in derivePipelinePublicBadge so GEM-
  sourced offline rows derive a `disputed` badge via the external-signal
  rule (parity with `press`/`satellite`/`ais-relay`).
- Export VALID_SOURCES so tests assert against the same source-of-truth
  the validator uses (matches the VALID_OIL_PRODUCT_CLASSES pattern from
  PR #3383).
- Floor bump (MIN_PIPELINES_PER_REGISTRY 8→200) intentionally DEFERRED
  to the follow-up data PR — bumping it now would gate the existing 75+75
  hand-curated rows below the new floor and break seeder publishes
  before the GEM data lands.

U2 — GEM parser (test-first):
- scripts/import-gem-pipelines.mjs reads a local JSON file (operator pre-
  converts GEM Excel externally — no `xlsx` dependency added). Schema-
  drift sentinel throws on missing columns. Status mapping covers
  Operating/Construction/Cancelled/Mothballed/Idle/Shut-in. ProductClass
  mapping covers Crude Oil / Refined Products / mixed-flow notes.
  Capacity-unit conversion handles bcm/y, bbl/d, Mbd, kbd.
- 22 tests in tests/import-gem-pipelines.test.mjs cover schema sentinel,
  fuel split, status mapping, productClass mapping, capacity conversion,
  minimum-viable-evidence shape, registry-shape conformance, and bad-
  coordinate rejection.

U3 — Deduplication (pure deterministic):
- scripts/_pipeline-dedup.mjs: dedupePipelines(existing, candidates) →
  { toAdd, skippedDuplicates }. Match rule: haversine ≤5km AND name
  Jaccard ≥0.6 (BOTH required). Reverse-direction-pair-aware.
- 19 tests cover internal helpers, match logic, id collision, determinism,
  and empty inputs.

U4 — Operator runbook (data import deferred):
- docs/methodology/pipelines.mdx: 7-step runbook for the operator to
  download GEM, pre-convert Excel→JSON, dry-run with --print-candidates,
  merge with --merge, bump the registry floor, and commit with
  provenance metadata.
- The actual data import is intentionally OUT OF SCOPE for this agent-
  authored PR because GEM downloads are registration-gated. A follow-up
  PR will commit the imported scripts/data/pipelines-{gas,oil}.json +
  bump MIN_PIPELINES_PER_REGISTRY → 200 + record the GEM release SHA256.

Tests: typecheck clean; 67 tests pass across the three test files.

Codex-approved through 8 review rounds against origin/main @ 050073354.

* fix(energy-atlas): wire --merge to dedupePipelines + within-batch dedup (PR1 review)

P1 — --merge was a TODO no-op (import-gem-pipelines.mjs:291):
- Previously exited with code 2 + a "TODO: wire dedup once U3 lands"
  message. The PR body and the methodology runbook both advertised
  --merge as the operator path.
- Add mergeIntoRegistry(filename, candidates) helper that loads the
  existing envelope, runs dedupePipelines() against the candidate
  list, sorts new entries alphabetically by id (stable diff on rerun),
  validates the merged registry via validateRegistry(), and writes
  to disk only after validation passes. CLI --merge now invokes it
  for both gas and oil + prints a per-fuel summary.
- Source attribution: the registry envelope's `source` field is
  upgraded to mention GEM (CC-BY 4.0) on first merge so the data file
  itself documents provenance.

P2 — dedup transitive-match bug (_pipeline-dedup.mjs:120):
- Pre-fix loop checked each candidate ONLY against the original
  `existing` array. Two GEM rows that match each other but not anything
  in `existing` would BOTH be added, defeating the dedup contract for
  same-batch duplicates (real example: a primary GEM entry plus a
  duplicate row from a regional supplemental sheet).
- Now compares against existing FIRST (existing wins on cross-set
  match — preserves richer hand-curated evidence), then falls back to
  the already-accepted toAdd set. Within-batch matches retain the FIRST
  accepted candidate (deterministic by candidate-list order).

Tests: 22 in tests/pipeline-dedup.test.mjs (3 new) cover the
within-batch dedup, transitive collapse, and existing-wins-over-
already-accepted scenarios. typecheck clean.

* fix(energy-atlas): cross-file-atomic --merge (PR1 review #2)

P1 — partial-import on disk if oil validation fails after gas writes
(import-gem-pipelines.mjs:329 / :350):
- Previous flow ran `mergeIntoRegistry('pipelines-gas.json', gas)` which
  wrote to disk, then `mergeIntoRegistry('pipelines-oil.json', oil)`. If
  oil validation failed, the operator was left with a half-imported
  state: gas had GEM rows committed to disk but oil didn't.
- Refactor into a two-phase API:
  1. prepareMerge(filename, candidates) — pure, no disk I/O. Builds the
     merged envelope, validates it, throws on validation failure.
  2. mergeBothRegistries(gasCandidates, oilCandidates) — calls
     prepareMerge for BOTH fuels first; only writes to disk after BOTH
     pass validation. If oil's prepareMerge throws, gas was never
     touched on disk.
- CLI --merge now invokes mergeBothRegistries. The atomicity guarantee
  is documented inline in the helper.

typecheck clean. No new tests because the existing dedup + validate
suites cover the underlying logic; the change is purely about call
ordering for atomicity.

* fix(energy-atlas): deterministic lastEvidenceUpdate + clarify test comment (PR1 review #3)

P2 — lastEvidenceUpdate was non-deterministic (Greptile P2):
- Previous code used new Date().toISOString() per parser run, so two runs
  of parseGemPipelines on the same input on different days produced
  byte-different output. Quarterly re-imports would produce noisy
  full-row diffs even when the upstream GEM data hadn't changed.
- New: resolveEvidenceTimestamp(envelope) derives the timestamp from
  envelope.downloadedAt (the operator-recorded date) or sourceVersion
  if it parses as ISO. Falls back to 1970-01-01 sentinel when neither
  is set — deliberately ugly so reviewers spot the missing field in
  the data file diff rather than getting silent today's date.
- Computed once per parse run so every emitted candidate gets the
  same timestamp.

P2 — misleading test comment (Greptile P2):
- Comment in tests/import-gem-pipelines.test.mjs:136 said "400_000 bbl/d
  ÷ 1000 = 400 Mbd" while the assertion correctly expects 0.4 (because
  the convention is millions, not thousands). Rewrote the comment to
  state the actual rule + arithmetic clearly.

3 new tests for determinism: (a) two parser runs produce identical
output, (b) timestamp derives from downloadedAt, (c) missing date
yields the epoch sentinel (loud failure mode).
2026-04-25 17:55:45 +04:00

243 lines
11 KiB
JavaScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// @ts-check
//
// Tests for scripts/_pipeline-dedup.mjs — the haversine + Jaccard dedup
// helper. Both criteria (≤5km AND ≥0.6) must hold for a match. Existing rows
// always win to preserve hand-curated evidence.
import { strict as assert } from 'node:assert';
import { test, describe } from 'node:test';
import { dedupePipelines, _internal } from '../scripts/_pipeline-dedup.mjs';
const { jaccard, averageEndpointDistanceKm, tokenize, uniqueId } = _internal;
function makePipeline(id, name, startLat, startLon, endLat, endLon) {
return {
id,
name,
startPoint: { lat: startLat, lon: startLon },
endPoint: { lat: endLat, lon: endLon },
};
}
describe('pipeline-dedup — internal helpers', () => {
test('tokenize lowercases, splits, drops stopwords', () => {
const tokens = tokenize('Trans-Siberian Pipeline System');
assert.deepEqual(tokens.sort(), ['siberian', 'trans']);
});
test('tokenize removes punctuation and accents', () => {
const tokens = tokenize('Caño LimónCoveñas Pipeline');
// After NFKD normalization + ascii-only filter, accented chars survive
// as their base letter; we accept either exact or close behaviour.
assert.ok(tokens.includes('limon') || tokens.includes('lim'),
`expected Limón to tokenize; got ${tokens.join(',')}`);
});
test('jaccard returns 1.0 for identical token sets', () => {
assert.equal(jaccard('Test Pipeline System', 'Test Pipeline'), 1.0);
});
test('jaccard returns 0 for fully disjoint names', () => {
assert.equal(jaccard('Druzhba North', 'Nord Stream'), 0);
});
test('jaccard 0.5 for half-overlap', () => {
assert.equal(jaccard('Trans Adriatic', 'Trans Caspian'), 1 / 3);
});
test('haversine distance is symmetric', () => {
const a = makePipeline('a', 'A', 60, 30, 54, 13);
const b = makePipeline('b', 'B', 60.001, 30.001, 54.001, 13.001);
assert.ok(averageEndpointDistanceKm(a, b) < 1, 'sub-km on tiny offsets');
});
test('haversine distance for far-apart pipelines is large', () => {
const a = makePipeline('a', 'A', 60, 30, 54, 13); // RU→DE
const b = makePipeline('b', 'B', 30, -90, 25, -85); // Gulf of Mexico
assert.ok(averageEndpointDistanceKm(a, b) > 5000);
});
test('uniqueId preserves base when free, suffixes when taken', () => {
const taken = new Set(['foo', 'foo-2']);
assert.equal(uniqueId('bar', taken), 'bar');
assert.equal(uniqueId('foo', taken), 'foo-3');
});
});
describe('pipeline-dedup — match logic', () => {
test('happy path: completely-different name + far endpoints → added', () => {
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline (Northern Branch)',
52.6, 49.4, 52.32, 14.06)];
const candidates = [makePipeline('nord-stream-1', 'Nord Stream 1',
60.08, 29.05, 54.14, 13.66)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1);
assert.equal(skippedDuplicates.length, 0);
});
test('match by both criteria: close endpoints + similar name → skipped (existing wins)', () => {
const existing = [makePipeline('druzhba-north', 'Druzhba Pipeline',
52.6, 49.4, 52.32, 14.06)];
const candidates = [makePipeline('druzhba-import', 'Druzhba Pipeline',
52.601, 49.401, 52.321, 14.061)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
assert.equal(skippedDuplicates[0].matchedExistingId, 'druzhba-north');
});
test('name-match only (endpoints in different ocean) → added', () => {
const existing = [makePipeline('nord-stream-1', 'Nord Stream 1',
60.08, 29.05, 54.14, 13.66)];
const candidates = [makePipeline('imposter', 'Nord Stream 1',
40.0, -100.0, 35.0, -90.0)]; // different continent
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'low haversine confidence overrides high name match');
assert.equal(skippedDuplicates.length, 0);
});
test('endpoint-match only (different name) → added (real distinct pipelines can share endpoints)', () => {
const existing = [makePipeline('yamal-europe', 'YamalEurope',
67.0, 75.0, 52.0, 14.0)];
const candidates = [makePipeline('different-route', 'Trans-Siberian Coal Slurry',
67.001, 75.001, 52.001, 14.001)];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'name disambiguates: same endpoints, different infrastructure');
});
test('reverse-direction match: candidate endpoints flipped → still detected', () => {
const existing = [makePipeline('druzhba', 'Druzhba',
52.6, 49.4, 52.32, 14.06)];
// Same pipeline, route described in reverse direction
const candidates = [makePipeline('druzhba-flipped', 'Druzhba',
52.32, 14.06, 52.6, 49.4)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
});
test('stopword-only difference: "Pipeline System" vs "Line" → matches by Jaccard', () => {
const existing = [makePipeline('trans-sib', 'Trans-Siberian Pipeline System',
55, 30, 60, 90)];
const candidates = [makePipeline('trans-sib-cand', 'Trans-Siberian Line',
55.001, 30.001, 60.001, 90.001)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 1);
assert.ok(skippedDuplicates[0].jaccard >= 0.6);
});
});
describe('pipeline-dedup — id collision', () => {
test('candidate with id colliding existing gets suffixed -2', () => {
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
const candidates = [makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60)];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1);
assert.equal(toAdd[0].id, 'foo-2');
});
test('three candidates colliding the same existing id get -2, -3, -4', () => {
const existing = [makePipeline('foo', 'Foo Pipeline', 0, 0, 1, 1)];
const candidates = [
makePipeline('foo', 'Bar Pipeline', 50, 50, 60, 60),
makePipeline('foo', 'Baz Pipeline', 70, 70, 80, 80),
makePipeline('foo', 'Qux Pipeline', 30, -30, 40, -40),
];
const { toAdd } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 3);
assert.deepEqual(
toAdd.map((p) => p.id).sort(),
['foo-2', 'foo-3', 'foo-4'],
);
});
});
describe('pipeline-dedup — determinism', () => {
test('two invocations on identical inputs produce identical output', () => {
const existing = [
makePipeline('a', 'Alpha Pipeline', 10, 10, 20, 20),
makePipeline('b', 'Beta Pipeline', 30, 30, 40, 40),
];
const candidates = [
makePipeline('a', 'Alpha Pipeline', 10.001, 10.001, 20.001, 20.001),
makePipeline('c', 'Gamma Pipeline', 50, 50, 60, 60),
];
const r1 = dedupePipelines(existing, candidates);
const r2 = dedupePipelines(existing, candidates);
assert.deepEqual(
r1.toAdd.map((p) => p.id),
r2.toAdd.map((p) => p.id),
);
assert.deepEqual(
r1.skippedDuplicates.map((d) => d.matchedExistingId),
r2.skippedDuplicates.map((d) => d.matchedExistingId),
);
});
});
describe('pipeline-dedup — within-batch dedup (review fix)', () => {
test('two candidates that match each other but not any existing → only first is added', () => {
// Regression: pre-fix, dedup compared each candidate ONLY against the
// original `existing` array, so two GEM rows for the same pipeline (e.g.
// a primary entry and a duplicate from a different source spreadsheet)
// would BOTH end up in the registry.
const candidates = [
makePipeline('east-west-saudi', 'East-West Crude Pipeline', 25, 49, 24, 38),
// Same pipeline, slightly different name + endpoints (within match
// tolerance). Should be skipped as a duplicate of the first candidate.
makePipeline('saudi-petroline', 'East-West Crude', 25.001, 49.001, 24.001, 38.001),
];
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 1, 'second matching candidate must be skipped');
assert.equal(skippedDuplicates.length, 1);
assert.equal(toAdd[0].id, 'east-west-saudi', 'first-accepted candidate wins (deterministic)');
assert.equal(skippedDuplicates[0].matchedExistingId, 'east-west-saudi',
'skipped candidate matches the earlier-accepted one, not anything in `existing`');
});
test('three candidates with transitive matches collapse to one', () => {
const candidates = [
makePipeline('a', 'Druzhba', 52.6, 49.4, 52.32, 14.06),
makePipeline('b', 'Druzhba Pipeline', 52.601, 49.401, 52.321, 14.061),
makePipeline('c', 'Druzhba Line', 52.602, 49.402, 52.322, 14.062),
];
const { toAdd } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 1, 'three matching candidates must collapse to the first one accepted');
});
test('existing wins over already-accepted candidate', () => {
// If a candidate matches an existing row, it must be reported as
// matching the existing row (existing-vs-toAdd precedence). Names
// chosen so Jaccard exceeds 0.6 after stopword removal.
const existing = [makePipeline('canon', 'Druzhba Northern', 52.6, 49.4, 52.32, 14.06)];
const candidates = [
makePipeline('cand-1', 'Druzhba Northern', 60, 30, 50, 14), // doesn't match existing (far endpoints)
makePipeline('cand-2', 'Druzhba Northern', 52.601, 49.401, 52.321, 14.061), // matches existing (near + Jaccard=1)
];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
assert.equal(toAdd.length, 1, 'cand-1 added; cand-2 skipped against existing');
assert.equal(skippedDuplicates[0].matchedExistingId, 'canon',
'cand-2 should be reported as matching the existing canon, not the earlier candidate');
});
});
describe('pipeline-dedup — empty inputs', () => {
test('empty existing + N candidates → all N added, none skipped', () => {
const candidates = [
makePipeline('a', 'A', 0, 0, 1, 1),
makePipeline('b', 'B', 5, 5, 6, 6),
];
const { toAdd, skippedDuplicates } = dedupePipelines([], candidates);
assert.equal(toAdd.length, 2);
assert.equal(skippedDuplicates.length, 0);
});
test('N existing + empty candidates → empty result', () => {
const existing = [makePipeline('a', 'A', 0, 0, 1, 1)];
const { toAdd, skippedDuplicates } = dedupePipelines(existing, []);
assert.equal(toAdd.length, 0);
assert.equal(skippedDuplicates.length, 0);
});
});