Files
worldmonitor/tests/pipelines-registry.test.mts
Elie Habib 8655bd81bc feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)
* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure)

Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure
was built for. Per docs/methodology/pipelines.mdx operator runbook.

Source releases (CC-BY 4.0, attribution preserved in registry envelope):
  - GEM-GGIT-Gas-Pipelines-2025-11.xlsx
    SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx
  - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx
    SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2
    URL:    https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx

Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) →
canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs:
  - status ∈ {operating, construction}
  - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class
  - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day,
    bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors.
  - Country names → ISO 3166-1 alpha-2 via pycountry + alias table.

Merge results (via scripts/import-gem-pipelines.mjs --merge):
  gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6)
  oil: +259 added, 16 duplicates skipped
  Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence;
  GEM rows ship physicalStateSource='gem', classifierConfidence=0.4,
  operatorStatement=null, sanctionRefs=[].

Floor bump:
  scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200.
  Live counts (297/334) leave ~100 rows of jitter headroom so a partial
  re-import or coverage-narrowing release fails loud rather than halving
  the registry silently.

Tests:
  - tests/pipelines-registry.test.mts: bumped synthetic-registry
    Array.from({length:8}) → length:210 to clear new floor; added 'gem' to
    the evidence-source whitelist for non-flowing badges (parity with the
    derivePipelinePublicBadge audit done in PR #3397 U1).
  - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop
    3 → 70 to clear new floor.
  - 51/51 pipeline tests pass; tsc --noEmit clean.

vs peer reference site (281 gas + 265 oil): we now match (gas 297) and
exceed (oil 334). Functional + visual + data parity for the energy variant
is closed; remaining gaps are editorial-cadence (weekly briefing) which
is intentionally out of scope per the parity-push plan.

* docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh

PR #3406 imported the data but didn't land the conversion script that
produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py
so future operators can reproduce the import deterministically, and rewrites
the docs/methodology/pipelines.mdx runbook to match what actually works:

- Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns;
  only the GIS .zip's GeoJSON has both. The original runbook said to download
  XLSX which would fail at the lat/lon validation step.
- Cadence: quarterly refresh, with concrete signals (peer-site comparison,
  90-day calendar reminder).
- Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so
  future operators don't re-request the wrong dataset (the Extraction
  Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure
  Trackers).
- Last-known-good URLs documented + URL pattern explained as fallback when
  GEM rotates per release.
- Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status
  ∈ {operating, construction}, capacity unit conversion table).
- Failure-mode table mapping common errors to fixes.

Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON,
GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it
works for any release without code edits.

* fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test

Three Greptile findings on PR #3406:

P1 — Dedup miss (Dampier-Bunbury):
  Same physical pipeline existed in both registries — curated `dampier-bunbury`
  and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM
  digitized only the southern 60% of the line. The shared Bunbury terminus
  matched at 13.7 km but the average-endpoint distance was 287 km, just over
  the 5 km gate.
  Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit —
  if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint
  pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the
  existing "name collision in different ocean → still added" contract.
  Added regression test: identical Dampier-Bunbury inputs → 0 added, 1
  skipped, matched against `dampier-bunbury`.

P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.):
  GEM source GeoJSON occasionally has a Point geometry or single-coord
  LineString, producing pipelines where startPoint == endPoint. They render
  as map-point artifacts and skew aggregate-length stats.
  Fix (defense in depth):
    - scripts/_gem-geojson-to-canonical.py drops at conversion time
      (`zero_length` reason in drop log).
    - scripts/_pipeline-registry.mjs validateRegistry rejects defensively
      so even a hand-curated row with degenerate geometry fails loud.

P2 — Test repetition coupled to fixture row count:
  Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently
  breaks if fixture is trimmed below 3.
  Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from
  the floor and current fixture length.

Re-run --merge with all fixes applied:
  gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via
       name-set-identity short-circuit; -2 zero-length never imported)
  oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length)

Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 18:59:46 +04:00

304 lines
11 KiB
TypeScript

// @ts-check
import { strict as assert } from 'node:assert';
import { test, describe } from 'node:test';
import { readFileSync } from 'node:fs';
import { resolve } from 'node:path';
import { fileURLToPath } from 'node:url';
import { dirname } from 'node:path';
import {
validateRegistry,
recordCount,
GAS_CANONICAL_KEY,
OIL_CANONICAL_KEY,
VALID_OIL_PRODUCT_CLASSES,
VALID_SOURCES,
} from '../scripts/_pipeline-registry.mjs';
const __dirname = dirname(fileURLToPath(import.meta.url));
const gasRaw = readFileSync(resolve(__dirname, '../scripts/data/pipelines-gas.json'), 'utf-8');
const oilRaw = readFileSync(resolve(__dirname, '../scripts/data/pipelines-oil.json'), 'utf-8');
const gas = JSON.parse(gasRaw) as { pipelines: Record<string, any> };
const oil = JSON.parse(oilRaw) as { pipelines: Record<string, any> };
describe('pipeline registries — schema', () => {
test('gas registry passes validateRegistry', () => {
assert.equal(validateRegistry(gas), true);
});
test('oil registry passes validateRegistry', () => {
assert.equal(validateRegistry(oil), true);
});
test('canonical keys are stable strings', () => {
assert.equal(GAS_CANONICAL_KEY, 'energy:pipelines:gas:v1');
assert.equal(OIL_CANONICAL_KEY, 'energy:pipelines:oil:v1');
});
test('recordCount returns non-zero for both registries', () => {
assert.ok(recordCount(gas) >= 8);
assert.ok(recordCount(oil) >= 8);
});
});
describe('pipeline registries — identity + geometry', () => {
test('all ids are unique across gas + oil (no collisions)', () => {
const gasIds = Object.keys(gas.pipelines);
const oilIds = Object.keys(oil.pipelines);
const overlap = gasIds.filter(id => oilIds.includes(id));
assert.equal(overlap.length, 0, `overlapping ids: ${overlap.join(',')}`);
});
test('every pipeline.id matches its object key', () => {
for (const [key, p] of Object.entries(gas.pipelines)) {
assert.equal(p.id, key, `gas: ${key} -> id=${p.id}`);
}
for (const [key, p] of Object.entries(oil.pipelines)) {
assert.equal(p.id, key, `oil: ${key} -> id=${p.id}`);
}
});
test('every country code is ISO 3166-1 alpha-2', () => {
const iso2 = /^[A-Z]{2}$/;
const all = [...Object.values(gas.pipelines), ...Object.values(oil.pipelines)];
for (const p of all) {
assert.ok(iso2.test(p.fromCountry), `bad fromCountry on ${p.id}: ${p.fromCountry}`);
assert.ok(iso2.test(p.toCountry), `bad toCountry on ${p.id}: ${p.toCountry}`);
for (const t of p.transitCountries) {
assert.ok(iso2.test(t), `bad transitCountry on ${p.id}: ${t}`);
}
}
});
test('endpoint coordinates are within Earth bounds', () => {
const all = [...Object.values(gas.pipelines), ...Object.values(oil.pipelines)];
for (const p of all) {
assert.ok(p.startPoint.lat >= -90 && p.startPoint.lat <= 90, `${p.id} startPoint.lat OOB`);
assert.ok(p.startPoint.lon >= -180 && p.startPoint.lon <= 180, `${p.id} startPoint.lon OOB`);
assert.ok(p.endPoint.lat >= -90 && p.endPoint.lat <= 90, `${p.id} endPoint.lat OOB`);
assert.ok(p.endPoint.lon >= -180 && p.endPoint.lon <= 180, `${p.id} endPoint.lon OOB`);
}
});
});
describe('pipeline registries — evidence', () => {
test('non-flowing badges carry at least one evidence source', () => {
const all = [...Object.values(gas.pipelines), ...Object.values(oil.pipelines)];
for (const p of all) {
if (p.evidence.physicalState === 'flowing') continue;
const hasEvidence =
p.evidence.operatorStatement != null ||
p.evidence.sanctionRefs.length > 0 ||
['ais-relay', 'satellite', 'press', 'gem'].includes(p.evidence.physicalStateSource);
assert.ok(hasEvidence, `${p.id} has no supporting evidence for state=${p.evidence.physicalState}`);
}
});
test('classifier confidence is within 0..1', () => {
const all = [...Object.values(gas.pipelines), ...Object.values(oil.pipelines)];
for (const p of all) {
const c = p.evidence.classifierConfidence;
assert.ok(c >= 0 && c <= 1, `${p.id} bad classifierConfidence: ${c}`);
}
});
test('sanctionRefs entries carry {authority, date, url}', () => {
const all = [...Object.values(gas.pipelines), ...Object.values(oil.pipelines)];
for (const p of all) {
for (const ref of p.evidence.sanctionRefs) {
assert.equal(typeof ref.authority, 'string', `${p.id} ref missing authority`);
assert.equal(typeof ref.date, 'string', `${p.id} ref missing date`);
assert.equal(typeof ref.url, 'string', `${p.id} ref missing url`);
assert.ok(ref.url.startsWith('http'), `${p.id} ref url not http(s)`);
}
}
});
});
describe('pipeline registries — commodity-capacity pairing', () => {
test('gas pipelines have capacityBcmYr (not capacityMbd)', () => {
for (const p of Object.values(gas.pipelines)) {
assert.equal(p.commodityType, 'gas', `${p.id} should be commodityType=gas`);
assert.equal(typeof p.capacityBcmYr, 'number', `${p.id} missing capacityBcmYr`);
assert.ok(p.capacityBcmYr > 0, `${p.id} capacityBcmYr must be > 0`);
}
});
test('oil pipelines have capacityMbd (not capacityBcmYr)', () => {
for (const p of Object.values(oil.pipelines)) {
assert.equal(p.commodityType, 'oil', `${p.id} should be commodityType=oil`);
assert.equal(typeof p.capacityMbd, 'number', `${p.id} missing capacityMbd`);
assert.ok(p.capacityMbd > 0, `${p.id} capacityMbd must be > 0`);
}
});
});
describe('pipeline registries — productClass', () => {
test('every oil pipeline declares a productClass from the enum', () => {
for (const p of Object.values(oil.pipelines)) {
assert.ok(
VALID_OIL_PRODUCT_CLASSES.has(p.productClass),
`${p.id} has invalid productClass: ${p.productClass}`,
);
}
});
test('gas pipelines do not carry a productClass field', () => {
for (const p of Object.values(gas.pipelines)) {
assert.equal(
p.productClass,
undefined,
`${p.id} should not have productClass (gas pipelines use commodity as their class)`,
);
}
});
test('validateRegistry rejects oil pipeline without productClass', () => {
const oilSample = oil.pipelines[Object.keys(oil.pipelines)[0]!];
const { productClass: _drop, ...stripped } = oilSample;
const bad = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [`p${i}`, { ...stripped, id: `p${i}` }]),
),
};
assert.equal(validateRegistry(bad), false);
});
test('validateRegistry rejects oil pipeline with unknown productClass', () => {
const oilSample = oil.pipelines[Object.keys(oil.pipelines)[0]!];
const bad = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [
`p${i}`,
{ ...oilSample, id: `p${i}`, productClass: 'diesel-only' },
]),
),
};
assert.equal(validateRegistry(bad), false);
});
test('validateRegistry rejects gas pipeline carrying productClass', () => {
const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
const bad = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [
`p${i}`,
{ ...gasSample, id: `p${i}`, productClass: 'crude' },
]),
),
};
assert.equal(validateRegistry(bad), false);
});
});
describe('pipeline registries — validateRegistry rejects bad input', () => {
test('rejects empty object', () => {
assert.equal(validateRegistry({}), false);
});
test('rejects null', () => {
assert.equal(validateRegistry(null), false);
});
test('rejects a pipeline with no evidence', () => {
const bad = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [`p${i}`, {
id: `p${i}`, name: 'x', operator: 'y', commodityType: 'gas',
fromCountry: 'US', toCountry: 'CA', transitCountries: [],
capacityBcmYr: 1, startPoint: { lat: 0, lon: 0 }, endPoint: { lat: 1, lon: 1 },
}])
),
};
assert.equal(validateRegistry(bad), false);
});
test('rejects below MIN_PIPELINES_PER_REGISTRY', () => {
const bad = { pipelines: { onlyOne: gas.pipelines[Object.keys(gas.pipelines)[0]!] } };
assert.equal(validateRegistry(bad), false);
});
});
describe('pipeline registries — GEM source enum', () => {
test('VALID_SOURCES exported and includes the existing six members plus gem', () => {
// Same source-of-truth pattern as VALID_OIL_PRODUCT_CLASSES (PR #3383):
// export the Set so future tests can't drift from the validator.
assert.ok(VALID_SOURCES.has('operator'));
assert.ok(VALID_SOURCES.has('regulator'));
assert.ok(VALID_SOURCES.has('press'));
assert.ok(VALID_SOURCES.has('satellite'));
assert.ok(VALID_SOURCES.has('ais-relay'));
assert.ok(VALID_SOURCES.has('gem'));
});
test('validateRegistry accepts GEM-sourced minimum-viable evidence (state=unknown)', () => {
// GEM rows ship as state=unknown until classifier promotes them.
// physicalStateSource='gem' is sufficient evidence per the audit.
const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
const good = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [`p${i}`, {
...gasSample,
id: `p${i}`,
evidence: {
physicalState: 'unknown',
physicalStateSource: 'gem',
commercialState: 'unknown',
operatorStatement: null,
sanctionRefs: [],
classifierVersion: 'gem-import-v1',
classifierConfidence: 0.4,
lastEvidenceUpdate: '2026-04-25T00:00:00Z',
},
}])
),
};
assert.equal(validateRegistry(good), true);
});
test('validateRegistry accepts GEM-sourced offline row (state=offline + only source=gem)', () => {
// Per plan U1 audit: 'gem' is evidence-bearing for non-flowing badges,
// parity with press/satellite/ais-relay. An offline row with no operator
// statement and no sanctionRefs but physicalStateSource='gem' should pass
// validation (the public-badge derivation downstream will then map it
// to "disputed" via the external-signal rule).
const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
const good = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [`p${i}`, {
...gasSample,
id: `p${i}`,
evidence: {
physicalState: 'offline',
physicalStateSource: 'gem',
commercialState: 'unknown',
operatorStatement: null,
sanctionRefs: [],
classifierVersion: 'gem-import-v1',
classifierConfidence: 0.4,
lastEvidenceUpdate: '2026-04-25T00:00:00Z',
},
}])
),
};
assert.equal(validateRegistry(good), true);
});
test('validateRegistry still rejects unknown physicalStateSource values', () => {
// Adding 'gem' must not loosen the enum — unknown sources still fail.
const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
const bad = {
pipelines: Object.fromEntries(
Array.from({ length: 210 }, (_, i) => [`p${i}`, {
...gasSample,
id: `p${i}`,
evidence: {
...gasSample.evidence,
physicalStateSource: 'rumor',
},
}])
),
};
assert.equal(validateRegistry(bad), false);
});
});