* fix(resilience): fail closed on missing v2 energy seeds + health CRIT on absent inputs PR #3289 shipped the v2 energy construct behind RESILIENCE_ENERGY_V2_ENABLED (default false). Audit on 2026-04-24 after the user flagged "AE only moved 1.49 points — we added nuclear credit, we should see more" revealed two safety gaps that made a future flag flip unsafe: 1. scoreEnergyV2 silently fell back to IMPUTE when any of its three required Redis seeds (low-carbon-generation, fossil-electricity-share, power-losses) was null. A future operator flipping the flag with seeds absent would produce fabricated-looking numbers for every country with zero operator signal. 2. api/health.js had those three seed labels in BOTH SEED_META (CRIT on missing) AND ON_DEMAND_KEYS (which demotes CRIT to WARN). The demotion won. Health has been reporting WARNING on a scorer dependency that has been 100% missing since PR #3289 merged — no paging trail existed. Changes: server/worldmonitor/resilience/v1/_dimension-scorers.ts - Add ResilienceConfigurationError with missingKeys[] payload. - scoreEnergy: preflight the three v2 seeds when flag=true. Throw ResilienceConfigurationError listing the specific absent keys. - scoreAllDimensions: wrap per-dimension dispatch in try/catch so a thrown ResilienceConfigurationError routes to the source-failure shape (imputationClass='source-failure', coverage=0) for that ONE dimension — country keeps scoring other dims normally. Log once per country-dimension pair so the gap is audit-traceable. api/health.js - Remove lowCarbonGeneration / fossilElectricityShare / powerLosses from ON_DEMAND_KEYS. They stay in BOOTSTRAP_KEYS + SEED_META. - Replace the transitional comment with a hard "do NOT add these back" note pointing at the scorer's fail-closed gate. tests/resilience-energy-v2.test.mts - New test: flag on + ALL three seeds missing → throws ResilienceConfigurationError naming all three keys. - New test: flag on + only one seed missing → throws naming ONLY the missing key (operator-clarity guard). - New test: flag on + all seeds present → v2 runs normally. - Update the file-level invariant comment to reflect the new fail-closed contract (replacing the prior "degrade gracefully" wording that codified the silent-IMPUTE bug). - Note: fixture's `??` fallbacks coerce null-overrides into real data, so the preflight tests use a direct-reader helper. docs/methodology/country-resilience-index.mdx - New "Fail-closed semantics" paragraph in the v2 Energy section documenting the throw + source-failure + health-CRIT contract. Non-goals (intentional): - This PR does NOT flip RESILIENCE_ENERGY_V2_ENABLED. - This PR does NOT provision seed-bundle-resilience-energy-v2 on Railway. - This PR does NOT touch RESILIENCE_PILLAR_COMBINE_ENABLED. Operational effect post-merge: - /api/health flips from WARNING → CRITICAL on the three v2 seed-meta entries. That is the intended alarm; it reveals that the Railway bundle was never provisioned. - scoreEnergy behavior with flag=false is unchanged (legacy path). - scoreEnergy behavior with flag=true + seeds present is unchanged. - scoreEnergy behavior with flag=true + seeds absent changes from "silently IMPUTE all 217 countries" to "source-failure on the energy dim for every country, visible in widget + API response". Tests: 511/511 resilience-* pass. Biome clean. Lint:md clean. Related plan: docs/plans/2026-04-24-001-fix-resilience-v2-fail-closed-on-missing-seeds-plan.md * docs(resilience): scrub stale ON_DEMAND_KEYS references for v2 energy seeds Greptile P2 on PR #3363: four stale references implied the three v2 energy seeds were still gated as ON_DEMAND_KEYS (WARN-on-missing) even though this PR's api/health.js change removed them (now strict SEED_META = CRIT on missing). Scrubbing each: - api/health.js:196 (BOOTSTRAP_KEYS comment) — was "ON_DEMAND_KEYS until Railway cron provisions; see below." Updated to cite plan 2026-04-24-001 and the strict-SEED_META posture. - api/health.js:398 (SEED_META comment) — was "Listed in ON_DEMAND_KEYS below until Railway cron provisions..." Updated for same reason. - docs/methodology/country-resilience-index.mdx:635 — v2.1 changelog entry said seed keys were ON_DEMAND_KEYS until graduation. Replaced with the fail-closed contract description. - docs/methodology/energy-v2-flag-flip-runbook.md:25 — step 3 said "ON_DEMAND_KEYS graduation" was required at flag-flip time. Rewrote to explain no graduation step is needed because the posture was removed pre-activation. No code change. Tests still 14/14 on the energy-v2 suite, lint:md clean. * fix(docs): escape MDX-unsafe `<=` in energy-v2 runbook to unblock Mintlify Mintlify deploy on PR #3363 failed with `Unexpected character '=' (U+003D) before name` at `docs/methodology/energy-v2-flag-flip-runbook.md`. Two lines had `<=` in plain prose, which MDX tries to parse as a JSX-tag-start. Replaced both with `≤` (U+2264) — and promoted the two existing `>=` on adjacent lines to `≥` for consistency. Prose is clearer and MDX safe. Same pattern as `mdx-unsafe-patterns-in-md` skill; also adjacent to PR #3344's `(<137 countries)` fix.
5.7 KiB
PR 1 energy-v2 flag-flip runbook
Operational procedure for graduating the v2 energy construct from flag-off (default shipped in PR #3289) to flag-on. Follow this runbook in order; each step is gated by the previous step's success.
Pre-flip checklist
All must be green before flipping RESILIENCE_ENERGY_V2_ENABLED=true:
- Seeders provisioned and green. Railway cron service
seed-bundle-resilience-energy-v2deployed, cron schedule0 6 * * 1(Monday 06:00 UTC, weekly). First clean run has landed for all three keys:redis-cli --url $REDIS_URL GET seed-meta:resilience:low-carbon-generation redis-cli --url $REDIS_URL GET seed-meta:resilience:fossil-electricity-share redis-cli --url $REDIS_URL GET seed-meta:resilience:power-losses # fetchedAt within the last 8 days, recordCount >= 150 for each - Health endpoint green for all three keys.
/api/healthreportsHEALTHYwith the three keys in thelowCarbonGeneration,fossilElectricityShare,powerLossesslots. If any showsEMPTY_DATAorSTALE_SEED, the flag cannot flip. - Health-registry state (no code change needed at flip time). Per
plan
2026-04-24-001the three v2 seed labels are already STRICTSEED_METAentries — NOT inON_DEMAND_KEYS./api/healthreports CRIT on absent/stale data from the moment the Railway bundle is provisioned. No "graduation" step is required at flag-flip time; this transitional posture was removed before the flag-flip activation path to keep the scorer and health layers in fail-closed lockstep (scorer throwsResilienceConfigurationError→ source-failure; health reports CRIT; both surface the gap independently). - Acceptance-gate rerun with flag-off. Baseline Spearman vs the
PR 0 freeze must remain 1.0000:
node --import tsx/esm scripts/compare-resilience-current-vs-proposed.mjs \ > /tmp/pre-flip-flag-off.json jq '.acceptanceGates.verdict' /tmp/pre-flip-flag-off.json # Expected: "PASS" (or "CONDITIONAL" if baseline is missing; confirm # baseline file exists in docs/snapshots/ and re-run).
Flip procedure
-
Capture a pre-flip snapshot.
RESILIENCE_ENERGY_V2_ENABLED=false \ node --import tsx/esm scripts/freeze-resilience-ranking.mjs \ --label "live-pre-pr1-flip-$(date +%Y-%m-%d)" \ --output docs/snapshots/ git add docs/snapshots/resilience-ranking-live-pre-pr1-flip-*.json git commit -m "chore(resilience): pre-PR-1-flip baseline snapshot" -
Dry-run the flag flip locally.
RESILIENCE_ENERGY_V2_ENABLED=true \ node --import tsx/esm scripts/compare-resilience-current-vs-proposed.mjs \ > /tmp/flag-on-dry-run.json jq '.acceptanceGates' /tmp/flag-on-dry-run.jsonEvery gate must be
pass. If any isfail, STOP and debug before proceeding. Check in order:gate-1-spearman: Spearman vs baseline ≥ 0.85gate-2-country-drift: max country drift ≤ 15 pointsgate-6-cohort-median: cohort median shift ≤ 10 pointsgate-7-matched-pair: every matched pair holds expected directiongate-9-effective-influence-baseline: ≥ 80% Core indicators measurable
-
Bump the score-cache prefix. Add a new commit to this branch bumping
RESILIENCE_SCORE_CACHE_PREFIXfromv10tov11inserver/worldmonitor/resilience/v1/_shared.ts. This guarantees the flag flip does not serve pre-flip cached scores from the 6h TTL window. Without this bump, the next 6h of readers would see stale d6-formula scores even with the flag on. -
Flip the flag in production.
vercel env add RESILIENCE_ENERGY_V2_ENABLED production # Enter: true # (or via Vercel dashboard → Settings → Environment Variables) vercel deploy --prod -
Capture the post-flip snapshot immediately after the first post-deploy ranking refresh completes (check via
GET resilience:ranking:v11in Redis):node --import tsx/esm scripts/freeze-resilience-ranking.mjs \ --label "live-post-pr1-$(date +%Y-%m-%d)" \ --output docs/snapshots/ git add docs/snapshots/resilience-ranking-live-post-pr1-*.json git commit -m "chore(resilience): post-PR-1 snapshot" -
Update construct-contract language. In
docs/methodology/country-resilience-index.mdx, move items 1, 2, and 3 of the "Known construct limitations" list from "landing in PR 1" to "landed in PR 1 vYYYY-MM-DD." Flip the energy domain section to describe v2 as the default construct, with the legacy construct recast as the emergency-rollback path.
Rollback procedure
If any acceptance gate fails post-flip or a reviewer flags a regression:
- Flip the flag back.
vercel env rm RESILIENCE_ENERGY_V2_ENABLED production # OR vercel env add RESILIENCE_ENERGY_V2_ENABLED production # enter: false vercel deploy --prod - Do NOT bump the cache prefix back to v10. Let the v11 prefix accumulate flag-off scores. The legacy scorer produces d6-formula scores regardless of the prefix version, so rolling the prefix backward is unnecessary and creates a second cache-key migration.
- Capture a rollback snapshot for post-mortem.
Acceptance-gate verdict reference
Generated by scripts/compare-resilience-current-vs-proposed.mjs:
| Verdict | Meaning | Action |
|---|---|---|
PASS |
All gates pass | Proceed with flag flip |
CONDITIONAL |
Some gates skipped (baseline missing, etc.) | Fix missing inputs before flipping |
BLOCK |
At least one gate failed | Do NOT flip; investigate failure |
The verdict is computed on every invocation of the compare script.
Stash the full acceptanceGates block in PR comments when the flip
happens.