feat(energy-atlas): GEM pipeline data import — gas 297, oil 334 (#3406)

* feat(energy-atlas): GEM pipeline data import — gas 75→297, oil 75→334 (parity-push closure) Closes the ~3.6× pipeline-scale gap that PR #3397's import infrastructure was built for. Per docs/methodology/pipelines.mdx operator runbook. Source releases (CC-BY 4.0, attribution preserved in registry envelope): - GEM-GGIT-Gas-Pipelines-2025-11.xlsx SHA256: f56d8b14400e558f06e53a4205034d3d506fc38c5ae6bf58000252f87b1845e6 URL: https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.xlsx - GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx SHA256: d1648d28aed99cfd2264047f1e944ddfccf50ce9feeac7de5db233c601dc3bb2 URL: https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.xlsx Pre-conversion: GeoJSON (geometry endpoints) + XLSX (column properties) → canonical operator-shape JSON via /tmp/gem-import/convert.py. Filter knobs: - status ∈ {operating, construction} - length ≥ 750 km (gas) / 400 km (oil) — asymmetric per-fuel trunk-class - capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil) at canonical conversion factors. - Country names → ISO 3166-1 alpha-2 via pycountry + alias table. Merge results (via scripts/import-gem-pipelines.mjs --merge): gas: +222 added, 15 duplicates skipped (haversine ≤ 5km AND token Jaccard ≥ 0.6) oil: +259 added, 16 duplicates skipped Final: 297 gas / 334 oil. Hand-curated 75+75 preserved with full evidence; GEM rows ship physicalStateSource='gem', classifierConfidence=0.4, operatorStatement=null, sanctionRefs=[]. Floor bump: scripts/_pipeline-registry.mjs MIN_PIPELINES_PER_REGISTRY 8 → 200. Live counts (297/334) leave ~100 rows of jitter headroom so a partial re-import or coverage-narrowing release fails loud rather than halving the registry silently. Tests: - tests/pipelines-registry.test.mts: bumped synthetic-registry Array.from({length:8}) → length:210 to clear new floor; added 'gem' to the evidence-source whitelist for non-flowing badges (parity with the derivePipelinePublicBadge audit done in PR #3397 U1). - tests/import-gem-pipelines.test.mjs: bumped registry-conformance loop 3 → 70 to clear new floor. - 51/51 pipeline tests pass; tsc --noEmit clean. vs peer reference site (281 gas + 265 oil): we now match (gas 297) and exceed (oil 334). Functional + visual + data parity for the energy variant is closed; remaining gaps are editorial-cadence (weekly briefing) which is intentionally out of scope per the parity-push plan. * docs(energy-atlas): land GEM converter + expand methodology runbook for quarterly refresh PR #3406 imported the data but didn't land the conversion script that produced it. This commit lands the converter at scripts/_gem-geojson-to-canonical.py so future operators can reproduce the import deterministically, and rewrites the docs/methodology/pipelines.mdx runbook to match what actually works: - Use GeoJSON (not XLSX) — the XLSX has properties but no lat/lon columns; only the GIS .zip's GeoJSON has both. The original runbook said to download XLSX which would fail at the lat/lon validation step. - Cadence: quarterly refresh, with concrete signals (peer-site comparison, 90-day calendar reminder). - Source datasets: explicit GGIT (gas) + GOIT (oil/NGL) tracker names so future operators don't re-request the wrong dataset (the Extraction Tracker = wells/fields, NOT pipelines — ours requires the Infrastructure Trackers). - Last-known-good URLs documented + URL pattern explained as fallback when GEM rotates per release. - Filter knob defaults documented inline (gas ≥ 750km, oil ≥ 400km, status ∈ {operating, construction}, capacity unit conversion table). - Failure-mode table mapping common errors to fixes. Converter takes paths via env vars (GEM_GAS_GEOJSON, GEM_OIL_GEOJSON, GEM_DOWNLOADED_AT, GEM_SOURCE_VERSION) instead of hardcoded paths so it works for any release without code edits. * fix(energy-atlas): close PR #3406 review findings — dedup + zero-length + test Three Greptile findings on PR #3406: P1 — Dedup miss (Dampier-Bunbury): Same physical pipeline existed in both registries — curated `dampier-bunbury` and GEM-imported `dampier-to-bunbury-natural-gas-pipeline-au` — because GEM digitized only the southern 60% of the line. The shared Bunbury terminus matched at 13.7 km but the average-endpoint distance was 287 km, just over the 5 km gate. Fix: scripts/_pipeline-dedup.mjs adds a name-set-identity short-circuit — if Jaccard == 1.0 (after stopword removal) AND any of the 4 endpoint pairings is ≤ 25 km, treat as duplicate. The 25 km anchor preserves the existing "name collision in different ocean → still added" contract. Added regression test: identical Dampier-Bunbury inputs → 0 added, 1 skipped, matched against `dampier-bunbury`. P1 — Zero-length geometry (9 rows: Trans-Alaska, Enbridge Line 3, Ichthys, etc.): GEM source GeoJSON occasionally has a Point geometry or single-coord LineString, producing pipelines where startPoint == endPoint. They render as map-point artifacts and skew aggregate-length stats. Fix (defense in depth): - scripts/_gem-geojson-to-canonical.py drops at conversion time (`zero_length` reason in drop log). - scripts/_pipeline-registry.mjs validateRegistry rejects defensively so even a hand-curated row with degenerate geometry fails loud. P2 — Test repetition coupled to fixture row count: Hardcoded `for (let i = 0; i < 70; i++)` × 3 fixture rows = 210 silently breaks if fixture is trimmed below 3. Fix: `Math.ceil(REGISTRY_FLOOR / fixture.length) + 5` derives reps from the floor and current fixture length. Re-run --merge with all fixes applied: gas: 75 → 293 (+218 added, 17 deduped — was 222/15 before; +2 catches via name-set-identity short-circuit; -2 zero-length never imported) oil: 75 → 325 (+250 added, 18 deduped — was 259/16; +2 catches; -7 zero-length) Tests: 74/74 pipeline tests pass; tsc --noEmit clean.
2026-04-25 17:14:57 +02:00 · 2026-04-25 18:59:46 +04:00
parent 5c955691a9
commit 8655bd81bc
9 changed files with 17400 additions and 557 deletions
--- a/docs/methodology/pipelines.mdx
+++ b/docs/methodology/pipelines.mdx
@@ -69,20 +69,97 @@ The hand-curated subset (operator/regulator/sanctions-bearing rows with classifi

 ## Operator runbook — GEM import refresh

-GEM publishes new releases of the Oil & Gas Infrastructure Trackers roughly quarterly. The refresh is operator-mediated rather than cron-driven because the GEM download URL changes per release; a hardcoded URL would silently fetch a different version than the one we attribute. Steps:
+### Cadence

-1. Visit the [GEM Oil & Gas Infrastructure Trackers](https://globalenergymonitor.org/projects/global-oil-gas-infrastructure-tracker/) page. Registration is required for direct download even though the data itself is CC-BY 4.0.
-2. Download the latest gas + oil tracker Excel workbooks. Record the release date and download URL.
-3. Pre-convert each workbook to JSON externally (Numbers / pandas / csvkit), normalizing column names to the canonical set documented in `scripts/import-gem-pipelines.mjs::REQUIRED_COLUMNS` and country names to ISO 3166-1 alpha-2 codes.
-4. Run a dry pass to inspect the candidate diff:
+**Refresh quarterly** (or whenever a new GEM release lands — check the GGIT/GOIT landing pages below). The refresh is operator-mediated rather than cron-driven because:
+
+- GEM downloads are gated behind a per-request form; the resulting URL is release-specific and rotates each quarter, so a hardcoded URL would silently fetch a different version than the one we attribute.
+- Each release adjusts column names occasionally; the schema-drift sentinel in `scripts/import-gem-pipelines.mjs` catches this loudly, but it requires a human review of the diff before committing.
+
+If a quarter passes without a refresh, set a calendar reminder. Suggested cadence: review every 90 days; refresh whenever a peer reference site (e.g. global-energy-flow.com) advertises a newer release than ours.
+
+### Source datasets
+
+The two files we use are GEM's pipeline-only trackers (NOT the combined "Oil & Gas Extraction Tracker" — that's upstream wells/fields and has a different schema):
+
+| Tracker | Acronym | What it contains | Landing page |
+|---|---|---|---|
+| Global Gas Infrastructure Tracker | **GGIT** | Gas pipelines + LNG terminals | [globalenergymonitor.org/projects/global-gas-infrastructure-tracker](https://globalenergymonitor.org/projects/global-gas-infrastructure-tracker/) |
+| Global Oil Infrastructure Tracker | **GOIT** | Oil + NGL pipelines | [globalenergymonitor.org/projects/global-oil-infrastructure-tracker](https://globalenergymonitor.org/projects/global-oil-infrastructure-tracker/) |
+
+The **GIS .zip download** (containing GeoJSON, GeoPackage, and shapefile) is what we want — NOT the .xlsx. The XLSX has properties but no lat/lon columns; only the GeoJSON has both column properties AND `LineString.coordinates` for endpoint extraction.
+
+#### Last-known-good URLs (rotate per release)
+
+These are the URLs we used for the 2026-04-25 import. GEM rotates them per release, so always re-request via the landing page above for the current release before re-running:
+
+```
+GGIT Gas (2025-11):  https://globalenergymonitor.org/wp-content/uploads/2025/11/GEM-GGIT-Gas-Pipelines-2025-11.zip
+GOIT Oil (2025-03):  https://globalenergymonitor.org/wp-content/uploads/2025/03/GEM-GOIT-Oil-NGL-Pipelines-2025-03.zip
+```
+
+URL pattern is stable: `globalenergymonitor.org/wp-content/uploads/YYYY/MM/GEM-{GGIT,GOIT}-{tracker-name}-YYYY-MM.zip`. If the landing-page download flow changes, this pattern is the fallback for figuring out the new URL given the release date GEM publishes.
+
+### Refresh steps
+
+1. **Request the data** via either landing page above. GEM emails you per-release URLs (one for the .xlsx, one for the GIS .zip). Registration is required even though the data itself is CC-BY 4.0.
+
+2. **Download both GIS .zips** and unzip:
   ```bash
-   GEM_PIPELINES_FILE=/tmp/gem.json node scripts/import-gem-pipelines.mjs --print-candidates | jq '.gas | length, .oil | length'
+   unzip -o ~/Downloads/GEM-GGIT-Gas-Pipelines-YYYY-MM.zip -d /tmp/gem-gis/gas/
+   unzip -o ~/Downloads/GEM-GOIT-Oil-NGL-Pipelines-YYYY-MM.zip -d /tmp/gem-gis/oil/
   ```
-5. Run the merge to write the deduplicated rows into `scripts/data/pipelines-{gas,oil}.json`. Spot-check 5-10 random GEM-sourced rows manually before committing.
-6. Commit the data + bump `MIN_PIPELINES_PER_REGISTRY` in `scripts/_pipeline-registry.mjs` to a sensible new floor (e.g. 200) so future partial imports fail loud. Record the GEM release date, download URL, and SHA256 of the source workbook in the commit message.
-7. Verify `npm run test:data` is green before pushing.

-Schema-drift sentinel guards against silent failures when GEM renames columns between releases — the parser throws with a clear message naming the missing column rather than producing zero-data rows.
+3. **Convert GeoJSON → canonical JSON** via the in-repo converter. It reads both GeoJSON files, applies the filter knobs documented in the script header, normalizes country names to ISO 3166-1 alpha-2 via `pycountry`, and emits the operator-shape envelope:
+   ```bash
+   pip3 install pycountry  # one-time
+   GEM_GAS_GEOJSON=/tmp/gem-gis/gas/GEM-GGIT-Gas-Pipelines-YYYY-MM.geojson \
+   GEM_OIL_GEOJSON=/tmp/gem-gis/oil/GEM-GOIT-Oil-NGL-Pipelines-YYYY-MM.geojson \
+   GEM_DOWNLOADED_AT=YYYY-MM-DD \
+   GEM_SOURCE_VERSION="GEM-GGIT-YYYY-MM+GOIT-YYYY-MM" \
+   python3 scripts/_gem-geojson-to-canonical.py > /tmp/gem-pipelines.json 2> /tmp/gem-drops.log
+   cat /tmp/gem-drops.log  # inspect drop counts before merging
+   ```
+
+   Filter knob defaults (in `scripts/_gem-geojson-to-canonical.py`):
+   - `MIN_LENGTH_KM_GAS = 750` (trunk-class only)
+   - `MIN_LENGTH_KM_OIL = 400` (trunk-class only)
+   - `ACCEPTED_STATUS = {operating, construction}`
+   - Capacity unit conversions: bcm/y native; MMcf/d, MMSCMD, mtpa, m3/day, bpd, Mb/d, kbd → bcm/y (gas) or bbl/d (oil)
+
+   These thresholds were tuned empirically against the 2025-11/2025-03 release to land at ~250-300 entries per registry. Adjust if a future release shifts the volume distribution.
+
+4. **Dry-run** to inspect candidate counts before touching the registry:
+   ```bash
+   GEM_PIPELINES_FILE=/tmp/gem-pipelines.json node scripts/import-gem-pipelines.mjs --print-candidates \
+     | jq '{ gas: (.gas | length), oil: (.oil | length) }'
+   ```
+
+5. **Merge** into `scripts/data/pipelines-{gas,oil}.json` (writes both atomically — validates both before either is touched on disk):
+   ```bash
+   GEM_PIPELINES_FILE=/tmp/gem-pipelines.json node scripts/import-gem-pipelines.mjs --merge
+   ```
+   Spot-check 5-10 random GEM-sourced rows in the diff before committing — known major trunks (Druzhba, Nord Stream, Keystone, TAPI, Centro Oeste) are good sanity-check anchors.
+
+6. **Commit** the data + record provenance. Per-release SHA256s go in the commit message so future audits can verify reproducibility:
+   ```bash
+   shasum -a 256 ~/Downloads/GEM-GGIT-Gas-Pipelines-YYYY-MM.xlsx \
+                  ~/Downloads/GEM-GOIT-Oil-NGL-Pipelines-YYYY-MM.xlsx
+   ```
+   If the row count crosses a threshold, also bump `MIN_PIPELINES_PER_REGISTRY` in `scripts/_pipeline-registry.mjs` so future partial re-imports fail loud rather than silently halving the registry.
+
+7. **Verify** `npm run test:data` is green before pushing.
+
+### Failure modes and what to do
+
+| Symptom | Cause | Fix |
+|---|---|---|
+| Converter exits with `GEM_GAS_GEOJSON env vars are required` | Env vars not set | Re-run with both `GEM_GAS_GEOJSON` and `GEM_OIL_GEOJSON` pointed at the unzipped `.geojson` files |
+| Many rows dropped on `country:Foo|Bar` | New country name GEM uses isn't in `pycountry` or the alias table | Add the alias to `COUNTRY_ALIASES` in `scripts/_gem-geojson-to-canonical.py` |
+| Many rows dropped on `no_capacity` with a unit we haven't seen | GEM added a capacity unit | Add the conversion factor to `gas_capacity()` or `oil_capacity()` in the converter |
+| Parser throws `schema drift — pipelines[i] missing column "X"` | GEM renamed a column between releases | The parser will name the missing column; map it back in the converter and re-run |
+| `validateRegistry` rejects the merged registry | Almost always: count below `MIN_PIPELINES_PER_REGISTRY`, or an evidence-source not in the whitelist | Inspect the merged JSON; if the row drop is real, lower the floor; if a row's evidence is malformed, fix the converter |
+| Net adds drop precipitously between releases | GEM removed a tracker subset, OR the dedup is over-matching | Run `--print-candidates` and diff against the prior quarter's output; adjust the haversine/Jaccard knobs in `scripts/_pipeline-dedup.mjs` if needed |

 ## Corrections

--- a/scripts/_gem-geojson-to-canonical.py
+++ b/scripts/_gem-geojson-to-canonical.py
@@ -0,0 +1,382 @@
+#!/usr/bin/env python3
+"""
+Pre-convert GEM GeoJSON (GGIT gas + GOIT oil pipelines) → canonical JSON shape
+that scripts/import-gem-pipelines.mjs::REQUIRED_COLUMNS expects.
+
+Why GeoJSON, not XLSX:
+    GEM publishes both XLSX and GIS .zip downloads (with GeoJSON, GeoPackage,
+    shapefile inside). The XLSX has properties but NO lat/lon columns — endpoint
+    geometry only lives in the GIS feed. The GeoJSON `properties` block carries
+    the same column set as the XLSX, AND `geometry.coordinates` gives us the
+    LineString endpoints we need for haversine dedup. So we use GeoJSON only.
+
+Usage:
+    GEM_GAS_GEOJSON=/path/to/GEM-GGIT-Gas-Pipelines-YYYY-MM.geojson \\
+    GEM_OIL_GEOJSON=/path/to/GEM-GOIT-Oil-NGL-Pipelines-YYYY-MM.geojson \\
+    python3 scripts/_gem-geojson-to-canonical.py \\
+        > /tmp/gem-pipelines.json
+
+    # Then feed to the merge step:
+    GEM_PIPELINES_FILE=/tmp/gem-pipelines.json node \\
+        scripts/import-gem-pipelines.mjs --print-candidates  # dry run
+    GEM_PIPELINES_FILE=/tmp/gem-pipelines.json node \\
+        scripts/import-gem-pipelines.mjs --merge
+
+Dependencies:
+    pip3 install pycountry  # ISO 3166-1 alpha-2 mapping for country names
+
+Drop-summary log goes to stderr; canonical JSON goes to stdout.
+"""
+
+import json
+import os
+import sys
+import pycountry
+
+GAS_PATH = os.environ.get("GEM_GAS_GEOJSON")
+OIL_PATH = os.environ.get("GEM_OIL_GEOJSON")
+if not GAS_PATH or not OIL_PATH:
+    sys.exit(
+        "GEM_GAS_GEOJSON and GEM_OIL_GEOJSON env vars are required. "
+        "Point each at the GEM-{GGIT,GOIT}-{Gas,Oil-NGL}-Pipelines-YYYY-MM.geojson "
+        "file unzipped from the GIS download. See script header for details."
+    )
+
+# Filter knobs (per plan: trunk-class only, target 250-300 entries per registry).
+# Asymmetric thresholds: gas has more long-distance trunks worldwide (LNG-feeder
+# corridors, Russia→Europe, Russia→China), oil pipelines tend to be shorter
+# regional collectors. Tuned empirically against the 2025-11 GEM release to
+# yield ~265 gas + ~300 oil after dedup against the 75 hand-curated rows.
+MIN_LENGTH_KM_GAS = 750.0
+MIN_LENGTH_KM_OIL = 400.0
+ACCEPTED_STATUS = {"operating", "construction"}
+
+# GEM (lowercase) → parser STATUS_MAP key (PascalCase)
+STATUS_PASCAL = {
+    "operating": "Operating",
+    "construction": "Construction",
+    "proposed": "Proposed",
+    "cancelled": "Cancelled",
+    "shelved": "Cancelled",  # treat shelved as cancelled per plan U2
+    "mothballed": "Mothballed",
+    "idle": "Idle",
+    "shut-in": "Shut-in",
+    "retired": "Mothballed",
+    "mixed status": "Operating",  # rare; treat as operating
+}
+
+# Country aliases for cases pycountry's fuzzy match fails on
+COUNTRY_ALIASES = {
+    "United States": "US",
+    "United Kingdom": "GB",
+    "Russia": "RU",
+    "South Korea": "KR",
+    "North Korea": "KP",
+    "Iran": "IR",
+    "Syria": "SY",
+    "Venezuela": "VE",
+    "Bolivia": "BO",
+    "Tanzania": "TZ",
+    "Vietnam": "VN",
+    "Laos": "LA",
+    "Czech Republic": "CZ",
+    "Czechia": "CZ",
+    "Slovakia": "SK",
+    "Macedonia": "MK",
+    "North Macedonia": "MK",
+    "Moldova": "MD",
+    "Brunei": "BN",
+    "Cape Verde": "CV",
+    "Ivory Coast": "CI",
+    "Cote d'Ivoire": "CI",
+    "Republic of the Congo": "CG",
+    "Democratic Republic of the Congo": "CD",
+    "DR Congo": "CD",
+    "DRC": "CD",
+    "Congo": "CG",
+    "Burma": "MM",
+    "Myanmar": "MM",
+    "Taiwan": "TW",
+    "Palestine": "PS",
+    "Kosovo": "XK",  # not ISO-2 official; use XK (commonly accepted)
+}
+
+
+def country_to_iso2(name):
+    if not name:
+        return None
+    name = name.strip()
+    if name in COUNTRY_ALIASES:
+        return COUNTRY_ALIASES[name]
+    try:
+        c = pycountry.countries.get(name=name)
+        if c:
+            return c.alpha_2
+        # Try common_name (e.g. "Russia" → "Russian Federation")
+        c = pycountry.countries.get(common_name=name)
+        if c:
+            return c.alpha_2
+        # Fuzzy
+        results = pycountry.countries.search_fuzzy(name)
+        if results:
+            return results[0].alpha_2
+    except (LookupError, KeyError):
+        pass
+    return None
+
+
+def split_countries(s):
+    """Parse 'Russia, Belarus, Ukraine' → ['Russia','Belarus','Ukraine']"""
+    if not s:
+        return []
+    return [x.strip() for x in s.split(",") if x.strip()]
+
+
+def get_endpoints(geom):
+    """Return ((startLon, startLat), (endLon, endLat)) or None."""
+    if not geom:
+        return None
+    t = geom.get("type")
+    coords = geom.get("coordinates")
+    if t == "LineString" and coords and len(coords) >= 2:
+        return coords[0], coords[-1]
+    if t == "MultiLineString" and coords:
+        flat = [pt for line in coords if line for pt in line]
+        if len(flat) >= 2:
+            return flat[0], flat[-1]
+    if t == "GeometryCollection":
+        geoms = geom.get("geometries") or []
+        all_coords = []
+        for g in geoms:
+            if g and g.get("type") == "LineString" and g.get("coordinates"):
+                all_coords.extend(g["coordinates"])
+            elif g and g.get("type") == "MultiLineString" and g.get("coordinates"):
+                for line in g["coordinates"]:
+                    all_coords.extend(line)
+        if len(all_coords) >= 2:
+            return all_coords[0], all_coords[-1]
+    return None
+
+
+def first_year(props):
+    for k in ("StartYear1", "StartYear2", "StartYear3"):
+        v = props.get(k)
+        if v:
+            try:
+                return int(float(v))
+            except (TypeError, ValueError):
+                pass
+    return 0
+
+
+def best_length_km(props):
+    for k in ("LengthMergedKm", "LengthKnownKm", "LengthEstimateKm"):
+        v = props.get(k)
+        if v in (None, "", "NA"):
+            continue
+        try:
+            f = float(v)
+            if f > 0:
+                return f
+        except (TypeError, ValueError):
+            pass
+    return 0.0
+
+
+def _f(v):
+    if v in (None, "", "NA"):
+        return None
+    try:
+        f = float(v)
+        return f if f > 0 else None
+    except (TypeError, ValueError):
+        return None
+
+
+def gas_capacity(props):
+    """Return (capacity, 'bcm/y'). GGIT has CapacityBcm/y derived for many rows."""
+    f = _f(props.get("CapacityBcm/y"))
+    if f is not None:
+        return f, "bcm/y"
+    # Fall back to raw Capacity + CapacityUnits with conversions to bcm/y.
+    cap = _f(props.get("Capacity"))
+    if cap is None:
+        return None, None
+    u = (props.get("CapacityUnits") or "").strip().lower()
+    if u == "bcm/y":
+        return cap, "bcm/y"
+    if u == "mmcf/d":  # million standard cubic feet/day → bcm/y
+        return cap * 0.01034, "bcm/y"
+    if u == "mmscmd":  # million standard cubic metres/day
+        return cap * 365.25 / 1000.0, "bcm/y"
+    if u == "mill.sm3/day":  # million Sm3/day = MMSCMD
+        return cap * 365.25 / 1000.0, "bcm/y"
+    if u == "scm/y":  # standard cubic metres/year
+        return cap / 1e9, "bcm/y"
+    if u == "mtpa":  # million tonnes/annum LNG → bcm/y (1 mtpa ≈ 1.36 bcm/y)
+        return cap * 1.36, "bcm/y"
+    return None, None
+
+
+def oil_capacity(props):
+    """Return (capacity, capacityUnit) for oil. Convert to bbl/d for parser
+    consumption (parser then converts bbl/d / 1e6 → Mbd internally)."""
+    cap = _f(props.get("Capacity"))
+    unit_raw = (props.get("CapacityUnits") or "").strip().lower()
+    if cap is None or not unit_raw:
+        # Fallback: derive from CapacityBOEd if present (already bpd-equivalent).
+        boed = _f(props.get("CapacityBOEd"))
+        if boed is not None:
+            return boed, "bbl/d"
+        return None, None
+    if unit_raw == "bpd":
+        return cap, "bbl/d"
+    if unit_raw in ("mb/d", "mbd"):
+        # GEM "Mb/d" = thousand bbl/day (industry shorthand). Convert to bbl/d.
+        return cap * 1000.0, "bbl/d"
+    if unit_raw in ("kbd", "kb/d"):
+        return cap * 1000.0, "bbl/d"
+    if unit_raw == "mtpa":
+        # Million tonnes/annum crude → bbl/d (avg crude: 7.33 bbl/tonne).
+        return cap * 1e6 * 7.33 / 365.25, "bbl/d"
+    if unit_raw == "m3/day":
+        # 1 m3 = 6.2898 bbl
+        return cap * 6.2898, "bbl/d"
+    if unit_raw == "m3/month":
+        return cap * 6.2898 / 30.4, "bbl/d"
+    if unit_raw == "m3/year":
+        return cap * 6.2898 / 365.25, "bbl/d"
+    if unit_raw == "thousand m3/year":
+        return cap * 1000 * 6.2898 / 365.25, "bbl/d"
+    if unit_raw == "tn/d":  # tonnes/day
+        return cap * 7.33, "bbl/d"
+    # Unknown unit → fall back to BOEd if available.
+    boed = _f(props.get("CapacityBOEd"))
+    if boed is not None:
+        return boed, "bbl/d"
+    return None, None
+
+
+def convert_one(props, geom, fuel_token):
+    name = (props.get("PipelineName") or "").strip()
+    seg = (props.get("SegmentName") or "").strip()
+    if seg and seg.lower() not in ("main line", "mainline", "main"):
+        name = f"{name} - {seg}" if name else seg
+    if not name:
+        return None, "no_name"
+
+    status = (props.get("Status") or "").strip().lower()
+    if status not in ACCEPTED_STATUS:
+        return None, f"status:{status or 'empty'}"
+
+    pts = get_endpoints(geom)
+    if not pts:
+        return None, "no_geom"
+    s_lon, s_lat = pts[0][0], pts[0][1]
+    e_lon, e_lat = pts[1][0], pts[1][1]
+    # Drop degenerate geometry (start == end). GEM occasionally publishes
+    # rows with a Point geometry or a single-coord LineString, which we'd
+    # otherwise emit as zero-length routes. PR #3406 review found 9 such
+    # rows (Trans-Alaska, Enbridge Line 3 Replacement, Ichthys, etc.).
+    if s_lat == e_lat and s_lon == e_lon:
+        return None, "zero_length"
+
+    length = best_length_km(props)
+    threshold = MIN_LENGTH_KM_GAS if fuel_token == "Gas" else MIN_LENGTH_KM_OIL
+    if length < threshold:
+        return None, "too_short"
+
+    if fuel_token == "Gas":
+        cap, unit = gas_capacity(props)
+        from_country_name = props.get("StartCountryOrArea")
+        to_country_name = props.get("EndCountryOrArea")
+        all_countries = split_countries(props.get("CountriesOrAreas"))
+    else:
+        cap, unit = oil_capacity(props)
+        from_country_name = props.get("StartCountry")
+        to_country_name = props.get("EndCountry")
+        all_countries = split_countries(props.get("Countries"))
+    if cap is None or unit is None:
+        return None, "no_capacity"
+
+    from_iso = country_to_iso2(from_country_name)
+    to_iso = country_to_iso2(to_country_name)
+    if not from_iso or not to_iso:
+        return None, f"country:{from_country_name}|{to_country_name}"
+
+    transit = []
+    for c in all_countries:
+        iso = country_to_iso2(c)
+        if iso and iso != from_iso and iso != to_iso:
+            transit.append(iso)
+
+    operator = (props.get("Owner") or props.get("Parent") or "").strip()
+    if not operator:
+        operator = "Unknown"
+
+    row = {
+        "name": name,
+        "operator": operator,
+        "fuel": fuel_token,
+        "fromCountry": from_iso,
+        "toCountry": to_iso,
+        "transitCountries": transit,
+        "capacity": cap,
+        "capacityUnit": unit,
+        "lengthKm": length,
+        "status": STATUS_PASCAL.get(status, "Operating"),
+        "startLat": s_lat,
+        "startLon": s_lon,
+        "endLat": e_lat,
+        "endLon": e_lon,
+        "startYear": first_year(props),
+    }
+    return row, None
+
+
+def process(path, fuel_token, drops):
+    with open(path) as f:
+        gj = json.load(f)
+    out = []
+    for ft in gj["features"]:
+        props = ft.get("properties") or {}
+        geom = ft.get("geometry")
+        row, reason = convert_one(props, geom, fuel_token)
+        if row:
+            out.append(row)
+        else:
+            drops[reason] = drops.get(reason, 0) + 1
+    return out
+
+
+def main():
+    drops_gas, drops_oil = {}, {}
+    gas_rows = process(GAS_PATH, "Gas", drops_gas)
+    oil_rows = process(OIL_PATH, "Oil", drops_oil)
+
+    # The operator stamps `downloadedAt` and `sourceVersion` per release so
+    # the parser's deterministic-timestamp logic (resolveEvidenceTimestamp in
+    # scripts/import-gem-pipelines.mjs) produces a stable lastEvidenceUpdate
+    # tied to the actual download date — not "now". Override via env so the
+    # script doesn't drift across re-runs.
+    downloaded_at = os.environ.get("GEM_DOWNLOADED_AT", "1970-01-01")
+    source_version = os.environ.get("GEM_SOURCE_VERSION", "GEM-unspecified-release")
+    envelope = {
+        "downloadedAt": downloaded_at,
+        "sourceVersion": source_version,
+        "pipelines": gas_rows + oil_rows,
+    }
+    json.dump(envelope, sys.stdout, indent=2, ensure_ascii=False)
+
+    print("\n--- DROP SUMMARY (gas) ---", file=sys.stderr)
+    for k, v in sorted(drops_gas.items(), key=lambda x: -x[1]):
+        print(f"  {k}: {v}", file=sys.stderr)
+    print(f"  KEPT: {len(gas_rows)}", file=sys.stderr)
+    print("--- DROP SUMMARY (oil) ---", file=sys.stderr)
+    for k, v in sorted(drops_oil.items(), key=lambda x: -x[1]):
+        print(f"  {k}: {v}", file=sys.stderr)
+    print(f"  KEPT: {len(oil_rows)}", file=sys.stderr)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/_pipeline-dedup.mjs
+++ b/scripts/_pipeline-dedup.mjs
@@ -24,6 +24,16 @@ const STOPWORDS = new Set([

 const MATCH_DISTANCE_KM = 5;
 const MATCH_JACCARD_MIN = 0.6;
+// When the candidate's tokenized name equals the existing row's tokenized
+// name (Jaccard == 1.0 after stopword removal), accept the match if ANY
+// endpoint pairing is within MATCH_NAME_IDENTICAL_DISTANCE_KM. Catches PR
+// #3406 review's Dampier-Bunbury case: GEM digitized only the southern
+// 60% of the line, so the average-endpoint distance was 287km but the
+// shared Bunbury terminus matched within 13.7km. A pure name-only rule
+// would false-positive on coincidental collisions in different oceans
+// (e.g. unrelated "Nord Stream 1" in the Pacific), so we still require
+// SOME geographic anchor.
+const MATCH_NAME_IDENTICAL_DISTANCE_KM = 25;
 const EARTH_RADIUS_KM = 6371;

 /**
@@ -55,6 +65,24 @@ function averageEndpointDistanceKm(a, b) {
  return Math.min(forward, reversed);
 }

+/**
+ * Minimum of all four cross-pairings between candidate and existing endpoints.
+ * Used by the name-identical short-circuit: if the candidate digitizes a
+ * different segment of the same physical pipeline, only ONE endpoint pair
+ * may match closely (e.g. Dampier-Bunbury: shared Bunbury terminus 13.7 km,
+ * other end 560 km away because GEM stopped at Onslow vs the full Dampier
+ * route). A tight average would miss this; the min of the four pairings
+ * doesn't.
+ */
+function minPairwiseEndpointDistanceKm(a, b) {
+  return Math.min(
+    haversineKm(a.startPoint, b.startPoint),
+    haversineKm(a.startPoint, b.endPoint),
+    haversineKm(a.endPoint, b.startPoint),
+    haversineKm(a.endPoint, b.endPoint),
+  );
+}
+
 /**
 * Tokenize a name: lowercased word tokens, ASCII-only word boundaries,
 * stopwords removed. Stable across invocations.
@@ -85,12 +113,35 @@ function jaccard(a, b) {
 }

 /**
- * Decide if a candidate matches an existing row. Both criteria required.
+ * Decide if a candidate matches an existing row.
+ *
+ * Two acceptance paths:
+ *   (a) Token sets are IDENTICAL (Jaccard == 1.0 after stopword removal) —
+ *       the same pipeline regardless of how either source digitized its
+ *       endpoints. Catches the Dampier-Bunbury case (PR #3406 review):
+ *       GEM's GeoJSON terminus was 13.7 km from the curated terminus
+ *       (just over the 5 km distance gate) but both names tokenize to
+ *       {dampier, to, bunbury, natural, gas}, so they are clearly the
+ *       same physical pipeline.
+ *   (b) Distance ≤ 5 km AND Jaccard ≥ 0.6 — the original conjunctive rule
+ *       for slight name-variation cases (e.g. "Druzhba Pipeline" vs
+ *       "Druzhba Oil Pipeline").
 */
 function isDuplicate(candidate, existing) {
+  const sim = jaccard(candidate.name, existing.name);
+  // Path (a): identical token-set + at least one endpoint pair within 25 km.
+  // The geographic anchor distinguishes the Dampier-Bunbury case from a
+  // theoretical name-collision in a different ocean.
+  if (sim >= 1.0) {
+    const minDist = minPairwiseEndpointDistanceKm(candidate, existing);
+    if (minDist <= MATCH_NAME_IDENTICAL_DISTANCE_KM) return true;
+    // Identical names but no endpoint near each other → distinct pipelines
+    // sharing a name (rare but real). Fall through to the conjunctive rule
+    // below, which will return false because Jaccard 1.0 with > 25km min
+    // pair always exceeds 5 km average.
+  }
  const dist = averageEndpointDistanceKm(candidate, existing);
  if (dist > MATCH_DISTANCE_KM) return false;
-  const sim = jaccard(candidate.name, existing.name);
  return sim >= MATCH_JACCARD_MIN;
 }

@@ -160,6 +211,7 @@ export function dedupePipelines(existing, candidates) {
 export const _internal = {
  haversineKm,
  averageEndpointDistanceKm,
+  minPairwiseEndpointDistanceKm,
  tokenize,
  jaccard,
  isDuplicate,
@@ -167,4 +219,5 @@ export const _internal = {
  STOPWORDS,
  MATCH_DISTANCE_KM,
  MATCH_JACCARD_MIN,
+  MATCH_NAME_IDENTICAL_DISTANCE_KM,
 };
--- a/scripts/_pipeline-registry.mjs
+++ b/scripts/_pipeline-registry.mjs
@@ -44,9 +44,11 @@ export const VALID_SOURCES = new Set(['operator', 'regulator', 'press', 'satelli
 // inline copy in tests could silently drift when the enum is extended.
 export const VALID_OIL_PRODUCT_CLASSES = new Set(['crude', 'products', 'mixed']);

-// Minimum viable registry size. Expansion to ~75 each happens in the follow-up
-// GEM import PR; this seeder doesn't care about exact counts beyond the floor.
-const MIN_PIPELINES_PER_REGISTRY = 8;
+// Minimum viable registry size. Post-GEM-import floor: 200. Live counts after
+// the 2025-11 GGIT + 2025-03 GOIT merge are 297 gas / 334 oil; 200 leaves ~100
+// rows of jitter headroom so a partial GEM re-import or a coverage-narrowing
+// release fails loud rather than silently halving the registry.
+const MIN_PIPELINES_PER_REGISTRY = 200;

 function loadRegistry(filename) {
  const __dirname = dirname(fileURLToPath(import.meta.url));
@@ -96,6 +98,13 @@ export function validateRegistry(data) {
    if (!p.endPoint || typeof p.endPoint.lat !== 'number' || typeof p.endPoint.lon !== 'number') return false;
    if (!isValidLatLon(p.startPoint.lat, p.startPoint.lon)) return false;
    if (!isValidLatLon(p.endPoint.lat, p.endPoint.lon)) return false;
+    // Reject degenerate routes where startPoint == endPoint. PR #3406 review
+    // surfaced 9 GEM rows (incl. Trans-Alaska, Enbridge Line 3, Ichthys)
+    // whose source GeoJSON had a Point geometry or a single-coord LineString,
+    // producing zero-length pipelines that render as map-point artifacts and
+    // skew aggregate-length statistics. Defense in depth — converter also
+    // drops these — but the validator gate makes the contract explicit.
+    if (p.startPoint.lat === p.endPoint.lat && p.startPoint.lon === p.endPoint.lon) return false;

    if (!p.evidence || typeof p.evidence !== 'object') return false;
    const ev = p.evidence;
--- a/scripts/data/pipelines-gas.json
+++ b/scripts/data/pipelines-gas.json
--- a/scripts/data/pipelines-oil.json
+++ b/scripts/data/pipelines-oil.json
--- a/tests/import-gem-pipelines.test.mjs
+++ b/tests/import-gem-pipelines.test.mjs
@@ -182,13 +182,17 @@ describe('import-gem-pipelines — minimum-viable evidence', () => {
 });

 describe('import-gem-pipelines — registry-shape conformance', () => {
+  // Compute the repeat count from the floor + the fixture row count so this
+  // test stays correct if the fixture is trimmed or the floor is raised. The
+  // hardcoded `for (let i = 0; i < 70; i++)` was fragile — Greptile P2 on PR
+  // #3406. +5 over the floor leaves a safety margin without inflating the test.
+  const REGISTRY_FLOOR = 200;
+
  test('emitted gas registry passes validateRegistry', () => {
-    // Build a synthetic registry of just the GEM-emitted gas rows; meets the
-    // validator's MIN_PIPELINES_PER_REGISTRY=8 floor by repeating the 3 fixture
-    // rows so we exercise the schema, not the count.
    const { gas } = parseGemPipelines(fixture);
+    const reps = Math.ceil(REGISTRY_FLOOR / gas.length) + 5;
    const repeated = [];
-    for (let i = 0; i < 3; i++) {
+    for (let i = 0; i < reps; i++) {
      for (const p of gas) repeated.push({ ...p, id: `${p.id}-rep${i}` });
    }
    const reg = {
@@ -199,8 +203,9 @@ describe('import-gem-pipelines — registry-shape conformance', () => {

  test('emitted oil registry passes validateRegistry', () => {
    const { oil } = parseGemPipelines(fixture);
+    const reps = Math.ceil(REGISTRY_FLOOR / oil.length) + 5;
    const repeated = [];
-    for (let i = 0; i < 3; i++) {
+    for (let i = 0; i < reps; i++) {
      for (const p of oil) repeated.push({ ...p, id: `${p.id}-rep${i}` });
    }
    const reg = {
--- a/tests/pipeline-dedup.test.mjs
+++ b/tests/pipeline-dedup.test.mjs
@@ -86,6 +86,22 @@ describe('pipeline-dedup — match logic', () => {
    assert.equal(skippedDuplicates[0].matchedExistingId, 'druzhba-north');
  });

+  test('identical names + one shared terminus (≤25 km) → deduped (PR #3406 Dampier-Bunbury regression)', () => {
+    // Real-world case from PR #3406 review: GEM digitized only the southern
+    // 60% of the line, so the shared Bunbury terminus matched at 13.7 km
+    // but the average-endpoint distance was 287 km (over the 5 km gate).
+    // Identical token sets + ≥1 close pairing = same physical pipeline.
+    const existing = [makePipeline('dampier-bunbury', 'Dampier to Bunbury Natural Gas Pipeline',
+      -20.68, 116.72, -33.33, 115.63)];
+    const candidates = [makePipeline('dampier-to-bunbury-natural-gas-pipeline-au',
+      'Dampier to Bunbury Natural Gas Pipeline',
+      -33.265797, 115.755682, -24.86854, 113.674968)];
+    const { toAdd, skippedDuplicates } = dedupePipelines(existing, candidates);
+    assert.equal(toAdd.length, 0);
+    assert.equal(skippedDuplicates.length, 1);
+    assert.equal(skippedDuplicates[0].matchedExistingId, 'dampier-bunbury');
+  });
+
  test('name-match only (endpoints in different ocean) → added', () => {
    const existing = [makePipeline('nord-stream-1', 'Nord Stream 1',
      60.08, 29.05, 54.14, 13.66)];
--- a/tests/pipelines-registry.test.mts
+++ b/tests/pipelines-registry.test.mts
@@ -88,7 +88,7 @@ describe('pipeline registries — evidence', () => {
      const hasEvidence =
        p.evidence.operatorStatement != null ||
        p.evidence.sanctionRefs.length > 0 ||
-        ['ais-relay', 'satellite', 'press'].includes(p.evidence.physicalStateSource);
+        ['ais-relay', 'satellite', 'press', 'gem'].includes(p.evidence.physicalStateSource);
      assert.ok(hasEvidence, `${p.id} has no supporting evidence for state=${p.evidence.physicalState}`);
    }
  });
@@ -157,7 +157,7 @@ describe('pipeline registries — productClass', () => {
    const { productClass: _drop, ...stripped } = oilSample;
    const bad = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [`p${i}`, { ...stripped, id: `p${i}` }]),
+        Array.from({ length: 210 }, (_, i) => [`p${i}`, { ...stripped, id: `p${i}` }]),
      ),
    };
    assert.equal(validateRegistry(bad), false);
@@ -167,7 +167,7 @@ describe('pipeline registries — productClass', () => {
    const oilSample = oil.pipelines[Object.keys(oil.pipelines)[0]!];
    const bad = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [
+        Array.from({ length: 210 }, (_, i) => [
          `p${i}`,
          { ...oilSample, id: `p${i}`, productClass: 'diesel-only' },
        ]),
@@ -180,7 +180,7 @@ describe('pipeline registries — productClass', () => {
    const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
    const bad = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [
+        Array.from({ length: 210 }, (_, i) => [
          `p${i}`,
          { ...gasSample, id: `p${i}`, productClass: 'crude' },
        ]),
@@ -202,7 +202,7 @@ describe('pipeline registries — validateRegistry rejects bad input', () => {
  test('rejects a pipeline with no evidence', () => {
    const bad = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [`p${i}`, {
+        Array.from({ length: 210 }, (_, i) => [`p${i}`, {
          id: `p${i}`, name: 'x', operator: 'y', commodityType: 'gas',
          fromCountry: 'US', toCountry: 'CA', transitCountries: [],
          capacityBcmYr: 1, startPoint: { lat: 0, lon: 0 }, endPoint: { lat: 1, lon: 1 },
@@ -236,7 +236,7 @@ describe('pipeline registries — GEM source enum', () => {
    const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
    const good = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [`p${i}`, {
+        Array.from({ length: 210 }, (_, i) => [`p${i}`, {
          ...gasSample,
          id: `p${i}`,
          evidence: {
@@ -264,7 +264,7 @@ describe('pipeline registries — GEM source enum', () => {
    const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
    const good = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [`p${i}`, {
+        Array.from({ length: 210 }, (_, i) => [`p${i}`, {
          ...gasSample,
          id: `p${i}`,
          evidence: {
@@ -288,7 +288,7 @@ describe('pipeline registries — GEM source enum', () => {
    const gasSample = gas.pipelines[Object.keys(gas.pipelines)[0]!];
    const bad = {
      pipelines: Object.fromEntries(
-        Array.from({ length: 8 }, (_, i) => [`p${i}`, {
+        Array.from({ length: 210 }, (_, i) => [`p${i}`, {
          ...gasSample,
          id: `p${i}`,
          evidence: {