mirror of https://github.com/koala73/worldmonitor.git synced 2026-04-25 17:14:57 +02:00

Files

Elie Habib d521924253 fix(resilience): fail closed on missing v2 energy seeds + health CRIT on absent inputs (#3363 )

* fix(resilience): fail closed on missing v2 energy seeds + health CRIT on absent inputs

PR #3289 shipped the v2 energy construct behind RESILIENCE_ENERGY_V2_ENABLED
(default false). Audit on 2026-04-24 after the user flagged "AE only moved
1.49 points — we added nuclear credit, we should see more" revealed two
safety gaps that made a future flag flip unsafe:

1. scoreEnergyV2 silently fell back to IMPUTE when any of its three
required Redis seeds (low-carbon-generation, fossil-electricity-share,
power-losses) was null. A future operator flipping the flag with
seeds absent would produce fabricated-looking numbers for every
country with zero operator signal.

2. api/health.js had those three seed labels in BOTH SEED_META (CRIT on
missing) AND ON_DEMAND_KEYS (which demotes CRIT to WARN). The demotion
won. Health has been reporting WARNING on a scorer dependency that has
been 100% missing since PR #3289 merged — no paging trail existed.

Changes:

server/worldmonitor/resilience/v1/_dimension-scorers.ts
- Add ResilienceConfigurationError with missingKeys[] payload.
- scoreEnergy: preflight the three v2 seeds when flag=true. Throw
ResilienceConfigurationError listing the specific absent keys.
- scoreAllDimensions: wrap per-dimension dispatch in try/catch so a
thrown ResilienceConfigurationError routes to the source-failure
shape (imputationClass='source-failure', coverage=0) for that ONE
dimension — country keeps scoring other dims normally. Log once
per country-dimension pair so the gap is audit-traceable.

api/health.js
- Remove lowCarbonGeneration / fossilElectricityShare / powerLosses
from ON_DEMAND_KEYS. They stay in BOOTSTRAP_KEYS + SEED_META.
- Replace the transitional comment with a hard "do NOT add these
back" note pointing at the scorer's fail-closed gate.

tests/resilience-energy-v2.test.mts
- New test: flag on + ALL three seeds missing → throws
ResilienceConfigurationError naming all three keys.
- New test: flag on + only one seed missing → throws naming ONLY
the missing key (operator-clarity guard).
- New test: flag on + all seeds present → v2 runs normally.
- Update the file-level invariant comment to reflect the new
fail-closed contract (replacing the prior "degrade gracefully"
wording that codified the silent-IMPUTE bug).
- Note: fixture's `??` fallbacks coerce null-overrides into real
data, so the preflight tests use a direct-reader helper.

docs/methodology/country-resilience-index.mdx
- New "Fail-closed semantics" paragraph in the v2 Energy section
documenting the throw + source-failure + health-CRIT contract.

Non-goals (intentional):
- This PR does NOT flip RESILIENCE_ENERGY_V2_ENABLED.
- This PR does NOT provision seed-bundle-resilience-energy-v2 on Railway.
- This PR does NOT touch RESILIENCE_PILLAR_COMBINE_ENABLED.

Operational effect post-merge:
- /api/health flips from WARNING → CRITICAL on the three v2 seed-meta
entries. That is the intended alarm; it reveals that the Railway
bundle was never provisioned.
- scoreEnergy behavior with flag=false is unchanged (legacy path).
- scoreEnergy behavior with flag=true + seeds present is unchanged.
- scoreEnergy behavior with flag=true + seeds absent changes from
"silently IMPUTE all 217 countries" to "source-failure on the
energy dim for every country, visible in widget + API response".

Tests: 511/511 resilience-* pass. Biome clean. Lint:md clean.

Related plan: docs/plans/2026-04-24-001-fix-resilience-v2-fail-closed-on-missing-seeds-plan.md

* docs(resilience): scrub stale ON_DEMAND_KEYS references for v2 energy seeds

Greptile P2 on PR #3363: four stale references implied the three v2
energy seeds were still gated as ON_DEMAND_KEYS (WARN-on-missing) even
though this PR's api/health.js change removed them (now strict
SEED_META = CRIT on missing). Scrubbing each:

- api/health.js:196 (BOOTSTRAP_KEYS comment) — was "ON_DEMAND_KEYS
until Railway cron provisions; see below." Updated to cite plan
2026-04-24-001 and the strict-SEED_META posture.
- api/health.js:398 (SEED_META comment) — was "Listed in ON_DEMAND_KEYS
below until Railway cron provisions..." Updated for same reason.
- docs/methodology/country-resilience-index.mdx:635 — v2.1 changelog
entry said seed keys were ON_DEMAND_KEYS until graduation. Replaced
with the fail-closed contract description.
- docs/methodology/energy-v2-flag-flip-runbook.md:25 — step 3 said
"ON_DEMAND_KEYS graduation" was required at flag-flip time.
Rewrote to explain no graduation step is needed because the
posture was removed pre-activation.

No code change. Tests still 14/14 on the energy-v2 suite, lint:md clean.

* fix(docs): escape MDX-unsafe `<=` in energy-v2 runbook to unblock Mintlify

Mintlify deploy on PR #3363 failed with
`Unexpected character '=' (U+003D) before name` at
`docs/methodology/energy-v2-flag-flip-runbook.md`. Two lines had
`<=` in plain prose, which MDX tries to parse as a JSX-tag-start.

Replaced both with `≤` (U+2264) — and promoted the two existing `>=`
on adjacent lines to `≥` for consistency. Prose is clearer and MDX
safe.

Same pattern as `mdx-unsafe-patterns-in-md` skill; also adjacent to
PR #3344's `(<137 countries)` fix.

2026-04-24 09:37:18 +04:00

5.7 KiB

Raw Permalink Blame History

PR 1 energy-v2 flag-flip runbook

Operational procedure for graduating the v2 energy construct from flag-off (default shipped in PR #3289) to flag-on. Follow this runbook in order; each step is gated by the previous step's success.

Pre-flip checklist

All must be green before flipping RESILIENCE_ENERGY_V2_ENABLED=true:

Seeders provisioned and green. Railway cron service seed-bundle-resilience-energy-v2 deployed, cron schedule 0 6 * * 1 (Monday 06:00 UTC, weekly). First clean run has landed for all three keys:

redis-cli --url $REDIS_URL GET seed-meta:resilience:low-carbon-generation
redis-cli --url $REDIS_URL GET seed-meta:resilience:fossil-electricity-share
redis-cli --url $REDIS_URL GET seed-meta:resilience:power-losses
# fetchedAt within the last 8 days, recordCount >= 150 for each

Health endpoint green for all three keys. /api/health reports HEALTHY with the three keys in the lowCarbonGeneration, fossilElectricityShare, powerLosses slots. If any shows EMPTY_DATA or STALE_SEED, the flag cannot flip.
Health-registry state (no code change needed at flip time). Per plan 2026-04-24-001 the three v2 seed labels are already STRICT SEED_META entries — NOT in ON_DEMAND_KEYS. /api/health reports CRIT on absent/stale data from the moment the Railway bundle is provisioned. No "graduation" step is required at flag-flip time; this transitional posture was removed before the flag-flip activation path to keep the scorer and health layers in fail-closed lockstep (scorer throws ResilienceConfigurationError → source-failure; health reports CRIT; both surface the gap independently).

Acceptance-gate rerun with flag-off. Baseline Spearman vs the PR 0 freeze must remain 1.0000:

node --import tsx/esm scripts/compare-resilience-current-vs-proposed.mjs \
  > /tmp/pre-flip-flag-off.json
jq '.acceptanceGates.verdict' /tmp/pre-flip-flag-off.json
# Expected: "PASS" (or "CONDITIONAL" if baseline is missing; confirm
# baseline file exists in docs/snapshots/ and re-run).

Flip procedure

Capture a pre-flip snapshot.

RESILIENCE_ENERGY_V2_ENABLED=false \
  node --import tsx/esm scripts/freeze-resilience-ranking.mjs \
  --label "live-pre-pr1-flip-$(date +%Y-%m-%d)" \
  --output docs/snapshots/
git add docs/snapshots/resilience-ranking-live-pre-pr1-flip-*.json
git commit -m "chore(resilience): pre-PR-1-flip baseline snapshot"

Dry-run the flag flip locally.
```
RESILIENCE_ENERGY_V2_ENABLED=true \
  node --import tsx/esm scripts/compare-resilience-current-vs-proposed.mjs \
  > /tmp/flag-on-dry-run.json
jq '.acceptanceGates' /tmp/flag-on-dry-run.json
```
Every gate must be pass. If any is fail, STOP and debug before proceeding. Check in order:
- gate-1-spearman: Spearman vs baseline ≥ 0.85
- gate-2-country-drift: max country drift ≤ 15 points
- gate-6-cohort-median: cohort median shift ≤ 10 points
- gate-7-matched-pair: every matched pair holds expected direction
- gate-9-effective-influence-baseline: ≥ 80% Core indicators measurable
Bump the score-cache prefix. Add a new commit to this branch bumping RESILIENCE_SCORE_CACHE_PREFIX from v10 to v11 in server/worldmonitor/resilience/v1/_shared.ts. This guarantees the flag flip does not serve pre-flip cached scores from the 6h TTL window. Without this bump, the next 6h of readers would see stale d6-formula scores even with the flag on.

Flip the flag in production.

vercel env add RESILIENCE_ENERGY_V2_ENABLED production
# Enter: true
# (or via Vercel dashboard → Settings → Environment Variables)
vercel deploy --prod

Capture the post-flip snapshot immediately after the first post-deploy ranking refresh completes (check via GET resilience:ranking:v11 in Redis):

node --import tsx/esm scripts/freeze-resilience-ranking.mjs \
  --label "live-post-pr1-$(date +%Y-%m-%d)" \
  --output docs/snapshots/
git add docs/snapshots/resilience-ranking-live-post-pr1-*.json
git commit -m "chore(resilience): post-PR-1 snapshot"

Update construct-contract language. In docs/methodology/country-resilience-index.mdx, move items 1, 2, and 3 of the "Known construct limitations" list from "landing in PR 1" to "landed in PR 1 vYYYY-MM-DD." Flip the energy domain section to describe v2 as the default construct, with the legacy construct recast as the emergency-rollback path.

Rollback procedure

If any acceptance gate fails post-flip or a reviewer flags a regression:

Flip the flag back.

vercel env rm RESILIENCE_ENERGY_V2_ENABLED production
# OR
vercel env add RESILIENCE_ENERGY_V2_ENABLED production  # enter: false
vercel deploy --prod

Do NOT bump the cache prefix back to v10. Let the v11 prefix accumulate flag-off scores. The legacy scorer produces d6-formula scores regardless of the prefix version, so rolling the prefix backward is unnecessary and creates a second cache-key migration.
Capture a rollback snapshot for post-mortem.

Acceptance-gate verdict reference

Generated by scripts/compare-resilience-current-vs-proposed.mjs:

Verdict	Meaning	Action
`PASS`	All gates pass	Proceed with flag flip
`CONDITIONAL`	Some gates skipped (baseline missing, etc.)	Fix missing inputs before flipping
`BLOCK`	At least one gate failed	Do NOT flip; investigate failure

The verdict is computed on every invocation of the compare script. Stash the full acceptanceGates block in PR comments when the flip happens.

5.7 KiB Raw Permalink Blame History