worldmonitor

eliott/worldmonitor

Fork 0

mirror of https://github.com/koala73/worldmonitor.git synced 2026-04-25 17:14:57 +02:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Elie Habib	7cf37c604c	feat(resilience): PR 3 — dead-signal cleanup (plan §3.5, §3.6) (#3297 ) * feat(resilience): PR 3 §3.5 — retire fuelStockDays from core score permanently First commit in PR 3 of the resilience repair plan. Retires `fuelStockDays` from the core score with no replacement. Why permanent, not replaced: IEA emergency-stockholding rules are defined in days of NET IMPORTS and do not bind net exporters by design. Norway/Canada/US measured in days-of-imports are incomparable to Germany/Japan measured the same way — the construct is fundamentally different across the two country classes. No globally-comparable recovery-fuel signal can be built from this source; the pre-repair probe showed 100% imputed at 50 for every country in the April 2026 freeze. scoreFuelStockDays: - Rewritten to return coverage=0 + observedWeight=0 + imputationClass='source-failure' for every country regardless of seed content. - Drops the dimension from the `recovery` domain's coverage- weighted mean automatically; remaining recovery dimensions pick up the share via re-normalisation in `_shared.ts#coverageWeightedMean`. - No explicit weight transfer needed — the coverage-weighted blend handles redistribution. Registry: - recoveryFuelStockDays re-tagged from tier='enrichment' to tier='experimental' so the Core coverage gate treats it as out-of-score. - Description updated to make the retirement explicit; entry stays in the registry for structural continuity (the dimension `fuelStockDays` remains in RESILIENCE_DIMENSION_ORDER for the 19-dimension tests; removing the dimension entirely is a PR 4 structural-audit concern). Housekeeping: - Removed `RESILIENCE_RECOVERY_FUEL_STOCKS_KEY` constant (no longer read; noUnusedLocals would reject it). - Removed `RecoveryFuelStocksCountry` interface for the same reason. Comment at the removed declaration instructs future maintainers not to re-add the type as a reservation; when a new recovery-fuel concept lands, introduce a fresh interface. Plan reference: §3.5 point 1 of `docs/plans/2026-04-22-001-fix-resilience-scorer-structural-bias-plan.md`. 51 resilience tests pass, typecheck + biome clean. The `recovery` domain's published score will shift slightly for every country because the 0.10 slot that fuelStockDays was imputing to now redistributes; the compare-harness acceptance-gate rerun at merge time will quantify the shift per plan §6 gates. * feat(resilience): PR 3 §3.5 — retire BIS-backed currencyExternal; rebuild on IMF inflation + WB reserves BIS REER/DSR feeds were load-bearing in currencyExternal (weights 0.35 fxVolatility + 0.35 fxDeviation, ~70% of dimension). They cover ~60 countries max — so every non-BIS country fell through to curated_list_absent (coverage 0.3) or a thin IMF proxy (coverage 0.45). Combined with reserveMarginPct already removed in PR 1, currencyExternal was the clearest "construct absent for most of the world" carrier left in the scorer. Changes: _dimension-scorers.ts - scoreCurrencyExternal now reads IMF macro (inflationPct) + WB FX reserves only. Coverage ladder: inflation + reserves → 0.85 (observed primary + secondary) inflation only → 0.55 reserves only → 0.40 neither → 0.30 (IMPUTE.bisEer retained for snapshot continuity; semantics read as "no IMF + no WB reserves" now) - Removed dead symbols: RESILIENCE_BIS_EXCHANGE_KEY constant (reserved via comment only, flagged by noUnusedLocals), stddev() helper, getCountryBisExchangeRates() loader, BisExchangeRate interface, dateToSortableNumber() — all were exclusive callers of the retired BIS path. _indicator-registry.ts - New core entry inflationStability (weight 0.60, tier=core, sourceKey=economic:imf:macro:v2). - fxReservesAdequacy weight 0.15 → 0.40 (secondary reliability anchor). - fxVolatility + fxDeviation demoted tier=enrichment → tier=experimental (BIS ~60-country coverage; off the core weight sum). - Non-experimental weights now sum to 1.0 (0.60 + 0.40). scripts/compare-resilience-current-vs-proposed.mjs - EXTRACTION_RULES: added inflationStability → imf-macro-country-field field=inflationPct so the registry-parity test passes and the correlation harness sees the new construct. tests/resilience-dimension-scorers.test.mts - Dropped BIS-era wording ("non-BIS country") and test 266 (BIS-outage coverage 0.35 branch) which collapsed to the inflation- only path post-retirement. - Updated coverage assertions: inflation-only 0.45 → 0.55; inflation+ reserves 0.55 → 0.85. tests/resilience-scorers.test.mts - domainAverages.economic 68.33 → 66.33 (US currencyExternal score shifts slightly under IMF+reserves vs old BIS composite). - stressScore 67.85 → 67.21; stressFactor 0.3215 → 0.3279. - overallScore 65.82 → 65.52. - baselineScore unchanged (currencyExternal is stress-only). All 6324 data-tier tests pass. typecheck:api clean. No change to seeders or Redis keys; this is a pure scorer + registry rebuild. * feat(resilience): PR 3 §3.5 point 3 — re-goalpost externalDebtCoverage (0..5 → 0..2) Plan §2.1 diagnosis table showed externalDebtCoverage saturating at score=100 across all 9 probe countries — including stressed states. Signal was collapsed. Root cause: (worst=5, best=0) gave every country with ratio < 0.5 a score above 90, and mapped Greenspan-Guidotti's reserve-adequacy threshold (ratio=1.0) to score 80 — well into "no worry" territory instead of the "mild warning" it should be. Re-anchored on Greenspan-Guidotti directly: ratio=1.0 now maps to score 50 (mild warning), ratio=2.0 to score 0 (acute rollover-shock exposure). Ratios above 2.0 clamp to 0, consistent with "beyond this point the country is already in crisis; exact value stops mattering." Files changed: - _indicator-registry.ts: recoveryDebtToReserves goalposts {worst: 5, best: 0} → {worst: 2, best: 0}. Description updated to cite Greenspan-Guidotti; inline comment documents anchor + rationale. - _dimension-scorers.ts: scoreExternalDebtCoverage normalizer bound changed from (0..5) to (0..2), with inline comment. - docs/methodology/country-resilience-index.mdx: goalpost table row 5-0 → 2-0, description cites Greenspan-Guidotti. - docs/methodology/indicator-sources.yaml: * constructStatus: dead-signal → observed-mechanism (signal is now discriminating). * reviewNotes updated to describe the new anchor. * mechanismTestRationale names the Greenspan-Guidotti rule. - tests/resilience-dimension-monotonicity.test.mts: updated the comment + picked values inside the (0..2) discriminating band (0.3 and 1.5). Old values (1 vs 4) had 4 clamping to 0. - tests/resilience-dimension-scorers.test.mts: NO score threshold relaxed >90 → >=85 (NO ratio=0.2 now scores 90, was 96). - tests/resilience-scorers.test.mts: fixture drift: * domainAverages.recovery 54.83 → 47.33 (US extDebt 70 → 25). * baselineScore 63.63 → 60.12 (extDebt is baseline type). * overallScore 65.52 → 63.27. * stressScore / stressFactor unchanged (extDebt is baseline-only). All 6324 data-tier tests pass. typecheck:api clean. * feat(resilience): PR 3 §3.6 — CI gate on indicator coverage and nominal weight Plan §3.6 adds a new acceptance criterion (also §5 item 5): > No indicator with observed coverage below 70% may exceed 5% nominal > weight OR 5% effective influence in the post-change sensitivity run. This commit enforces the NOMINAL-WEIGHT half as a unit test that runs on every CI build. The EFFECTIVE-INFLUENCE half is produced by scripts/validate-resilience-sensitivity.mjs as a committed artifact; the gate file only asserts that script still exists so a refactor that removes it breaks the build loudly. Why the gate exists (plan §3.6): "A dimension at 30% observed coverage carries the same effective weight as one at 95%. This contradicts the OECD/JRC handbook on uncertainty analysis." Implementation: tests/resilience-coverage-influence-gate.test.mts — three tests: 1. Nominal-weight gate: for every core indicator with coverage < 137 countries (70% of the ~195-country universe), computes its nominal overall weight as indicator.weight × (1/dimensions-in-domain) × domain-weight and asserts it does not exceed 5%. Equal-share-per-dimension is the upper bound on runtime weight (coverage-weighted mean gives a lower share when a dimension drops out), so this is a strict bound: if the nominal number passes, the runtime number also passes for every country. 2. Effective-influence contract: asserts the sensitivity script exists at its expected path. Removing it (intentionally or by refactor) breaks the build. 3. Audit visibility: prints the top 10 core indicators by nominal overall weight. No assertion beyond "ran" — the list lets reviewers spot outliers that pass the gate but are near the cap. Current state (observed from audit output): recoveryReserveMonths: nominal=4.17% coverage=188 recoveryDebtToReserves: nominal=4.17% coverage=185 recoveryImportHhi: nominal=4.17% coverage=190 inflationStability: nominal=3.40% coverage=185 electricityConsumption: nominal=3.30% coverage=217 ucdpConflict: nominal=3.09% coverage=193 Every core indicator has coverage ≥ 180 (already enforced by the pre-existing indicator-tiering test), so the nominal-weight gate has no current violators — its purpose is catching future drift, not flagging today's state. All 6327 data-tier tests pass. typecheck:api clean. * docs(resilience): PR 3 methodology doc — document §3.5 dead-signal retirements + §3.6 coverage gate Methodology-doc update capturing the three §3.5 landings and the §3.6 CI gate. Five edits: 1. Known construct limitations section (#5 and #6): strikethrough the original "dead signals" and "no coverage-based weight cap" items, annotate them with "Landed in PR 3 §3.5"/"Landed in PR 3 §3.6" + specifics of what shipped. 2. Currency & External H4 section: completely rewritten. Old table (fxVolatility / fxDeviation / fxReservesAdequacy on BIS primary) is replaced by the two-indicator post-PR-3 table (inflationStability at 0.60 + fxReservesAdequacy at 0.40). Coverage ladder spelled out (0.85 / 0.55 / 0.40 / 0.30). Legacy BIS indicators named as experimental-tier drill-downs only. 3. Fuel Stock Days H4 section: H4 heading text kept verbatim so the methodology-lint H4-to-dimension mapping does not break; body rewritten to explain that the dimension is retired from core but the seeder still runs for IEA-member drill-downs. 4. External Debt Coverage table row: goalpost 5-0 → 2-0, description cites Greenspan-Guidotti reserve-adequacy rule. 5. New v2.2 changelog entry — PR 3 dead-signal cleanup, covering §3.5 points 1/2/3 + §3.6 + acceptance gates + construct-audit updates. No scoring or code changes in this commit. Methodology-lint test passes (H4 mapping intact). All 6327 data-tier tests pass. * fix(resilience): PR 3 §3.6 gate — correct share-denominator for coverage-weighted aggregation Reviewer catch (thanks). The previous gate computed each indicator's nominal overall weight as indicator.weight × (1 / N_total_dimensions_in_domain) × domain_weight and claimed this was an upper bound ("actual runtime weight is ≤ this when some dimensions drop out on coverage"). That is BACKWARDS for this scorer. The domain aggregation is coverage-weighted (server/worldmonitor/resilience/v1/_shared.ts coverageWeightedMean), so when a dimension pins at coverage=0 it is EXCLUDED from the denominator and the surviving dimensions' shares go UP, not down. PR 3 commit 1 retires fuelStockDays by hard-coding its scorer to coverage=0 for every country — so in the current live state the recovery domain has 5 contributing dimensions (not 6), and each core recovery indicator's nominal share is 1.0 × 1/5 × 0.25 = 5.00% (was mis-reported as 4.17%) The old gate therefore under-estimated nominal influence and could silently pass exactly the kind of low-coverage overweight regression it is meant to block. Fix: - Added `coreBearingDimensions(domainId)` helper that counts only dimensions that have ≥1 core indicator in the registry. A dimension with only experimental/enrichment entries (post-retirement fuelStockDays) has no core contribution → does not dilute shares. - Updated `nominalOverallWeight` to divide by the core-bearing count, not the raw dimension count. - Rewrote the helper's doc comment to stop claiming this is a strict upper bound — explicitly calls out the dynamic case (source failure raising surviving dim shares further) as the sensitivity script's responsibility. - Added a new regression test: asserts (a) at least one recovery dimension is all-non-core (fuelStockDays post-retirement), (b) fuelStockDays has zero core indicators, and (c) recoveryDebt ToReserves nominal = 0.05 exactly (not 0.0417) — any reversion of the retirement or regression to N_total-denominator will fail loudly. Top-10 audit output now correctly shows: recoveryReserveMonths: nominal=5% coverage=188 recoveryDebtToReserves: nominal=5% coverage=185 recoveryImportHhi: nominal=5% coverage=190 (was 4.17% each under the old math) All 486 resilience tests pass. typecheck:api clean. Note: the 5% figure is exactly AT the cap, not over it. "exceed" means strictly > 5%, so it still passes. But now the reviewer / audit log reflects reality. * fix(resilience): PR 3 review — retired-dim confidence drag + false source-failure label Addresses the Codex review P1 + P2 on PR #3297. P1 — retired-dim drag on confidence averages -------------------------------------------- scoreFuelStockDays returns coverage=0 by design (retired construct), but computeLowConfidence, computeOverallCoverage, and the widget's formatResilienceConfidence averaged across all 19 dimensions. That dragged every country's reported averageCoverage down — US went from 0.8556 (active dims only) to 0.8105 (all dims) — enough drift to misclassify edge countries as lowConfidence and to shift the ranking widget's overallCoverage pill for every country. Fix: introduce an authoritative RESILIENCE_RETIRED_DIMENSIONS set in _dimension-scorers.ts and filter it out of all three averages. The filter is keyed on the retired-dim REGISTRY, not on coverage === 0, because a non-retired dim can legitimately emit coverage=0 on a genuinely sparse-data country via weightedBlend fall-through — those entries MUST keep dragging confidence down (that is the sparse-data signal lowConfidence exists to surface). Verified: sparse-country release-gate test (marks sparse WHO/FAO countries as low confidence) still passes with the registry-keyed filter; would have failed with a naive coverage=0 filter. Server-client parity: widget-utils cannot import server code, so RESILIENCE_RETIRED_DIMENSION_IDS is a hand-mirrored constant, kept in lockstep by tests/resilience-retired-dimensions-parity.test.mts (parses the widget file as text, same pattern as existing widget-util tests that can't import the widget module directly). P2 — false "Source down" label on retired dim --------------------------------------------- scoreFuelStockDays hard-coded imputationClass: 'source-failure', which the widget maps to "Source down: upstream seeder failed" with a `!` icon for every country. That is semantically wrong for an intentional retirement. Flipped to null so the widget's absent-path renders a neutral cell without a false outage label. null is already a legal value of ResilienceDimensionScore.imputationClass; no type change needed. Tests ----- - tests/resilience-confidence-averaging.test.mts (new): pins the registry-keyed filter semantic for computeOverallCoverage + computeLowConfidence. Includes a negative-control test proving non-retired coverage=0 dims still flip lowConfidence. - tests/resilience-retired-dimensions-parity.test.mts (new): lockstep gate between server and client retired-dim lists. - Widget test adds a registry-keyed exclusion test with a non-retired coverage=0 dim in the fixture to lock in the correct semantic. - Existing tests asserting imputationClass: 'source-failure' for fuelStockDays flipped to null. All 494 resilience tests + full 6336/6336 data-tier suite pass. Typecheck clean for both tsconfig.json and tsconfig.api.json. * docs(resilience): align methodology + registry metadata with shipped imputationClass=null Follow-up to the previous PR 3 review commit that flipped scoreFuelStockDays's imputationClass from 'source-failure' to null to avoid a false "Source down" widget label on every country. The code changed; the doc and registry metadata did not, leaving three sites in the methodology mdx and two comment/description sites in the registry still claiming imputationClass='source-failure'. Any future reviewer (or tooling that treats the registry description as authoritative) would be misled. This commit rewrites those sites to describe the shipped behavior: - imputationClass=null (not 'source-failure'), with the rationale - exclusion from confidence/coverage averages via the RESILIENCE_RETIRED_DIMENSIONS registry filter - the distinction between structural retirement (filtered) and runtime coverage=0 (kept so sparse-data countries still flag lowConfidence) Touched: - docs/methodology/country-resilience-index.mdx (lines ~33, ~268, ~590) - server/worldmonitor/resilience/v1/_indicator-registry.ts (recoveryFuelStockDays comment block + description field) No code-behavior change. Docs-only. Tests: 157 targeted resilience tests pass (incl. methodology-lint + widget + release-gate + confidence-averaging). Typecheck clean on both tsconfig.json and tsconfig.api.json.	2026-04-22 23:57:28 +04:00
Elie Habib	da0f26a3cf	feat(resilience): PR 0 diagnostic freeze + fairness-audit harness (no scoring changes) (#3284 ) * feat(resilience): PR 0 diagnostic freeze + fairness-audit harness Lands the before-state and measurement apparatus every subsequent resilience-scorer PR validates against. Zero scoring changes. Per the v3 plan at docs/plans/2026-04-22-001-fix-resilience-scorer-structural- bias-plan.md this is tranche 0 of five. What lands: - Construct contract published in the methodology doc: absolute resilience not development-adjusted, mechanism test for every indicator, peer-relative views published separately from the core. - Known construct limitations section: six construct errors scheduled for PR 1-3 repair with explicit mapping to plan tranches. - Indicator-source manifest at docs/methodology/indicator-sources.yaml with source, seriesId, seriesUrl, coveragePct, lastObservedYear, license, mechanismTestRationale, and a constructStatus classification. - Pre-repair ranking snapshot at docs/snapshots/resilience-ranking-live-pre-repair-2026-04-22.json (217 items + 5 greyedOut, captured 2026-04-22 08:38 UTC at commit `425507d15`). - Cohort configuration at tests/helpers/resilience-cohorts.mts: six cohorts covering 87 countries (net-fuel-exporters, net-energy- importers-oecd, nuclear-heavy-generation, coal-heavy-domestic, small-island-importers, fragile-states). - Matched-pair sanity panel at tests/helpers/resilience-matched-pairs.mts: six pairs (FR/DE, NO/CA, UAE/BH, JP/KR, IN/ZA, SG/CH) with expected- direction rationale and minGap for acceptance gate 7. - scripts/compare-resilience-current-vs-proposed.mjs extended to emit cohortSummary and matchedPairSummary alongside the existing output shape (backward compatible). - tests/resilience-cohort-config.test.mts: 11 validations ensuring the cohort + matched-pair configs stay well-formed. Deferred to PR 0.5 (before PR 1 lands): - Monotonicity test harness for all 19 dimension scorers pinning the sign of every indicator. - Pearson-derivative variable-influence baseline inside the sensitivity script producing the nominal-weight-vs-effective-influence table that plan acceptance gate 8 requires. Verification: typecheck:all clean, 430/430 resilience tests pass, 11/11 new cohort-config tests pass, snapshot auto-discovered and validated by the existing snapshot-test harness. * feat(resilience): PR 0 follow-ups — monotonicity harness, variable-influence baseline, cross-consumer formula gate Completes the PR 0 scope per the v3 plan §5 deliverables. Three adds: 1. Monotonicity test harness tests/resilience-dimension-monotonicity.test.mts pins the direction of movement for 14 indicators across 7 dimensions (reserve adequacy, fiscal space 3x, external debt coverage, import concentration, governance WGI, food/water 2x, energy 5x). Each test builds two synthetic ResilienceSeedReader fixtures differing only in the target indicator and asserts the dimension score moves in the documented direction. The scoreEnergy tests explicitly flag three indicators (gasShare, coalShare, electricityConsumption) that PR 1 §3.1-3.2 overturns so future readers understand which directional claims the plan intentionally replaces. 2. Variable-influence baseline scripts/compare-resilience-current-vs-proposed.mjs now computes per-dimension Pearson correlation against the current overallScore scaled by the dimension's nominal domain weight (a Pearson-derivative approximation of Sobol indices). The output carries a variableInfluence[] array sorted by abs(effectiveInfluence) desc. Acceptance gate 8 from the plan compares post-change effective influence against assigned nominal weight; divergences flag a wealth-proxy or saturated-signal construct problem. 3. Cross-consumer formula gate Five external consumers of resilience:score:v10:* now filter stale- formula entries so a flag flip does not serve mixed-formula data downstream: - server/worldmonitor/supply-chain/v1/get-route-impact.ts — readResilienceScore() checks _formula via the new getCurrentCacheFormula export and returns 0 on mismatch. - scripts/validate-resilience-correlation.mjs, scripts/validate-resilience-backtest.mjs, scripts/backtest-resilience-outcomes.mjs, scripts/benchmark-resilience-external.mjs — each inlines a currentCacheFormulaLocal() helper that mirrors the server's formula derivation from env, skips parsed entries whose _formula disagrees, and logs the skip count so operators can notice a mismatch during the flip window. A mixed-formula cohort (some countries d6-tagged, others pc-tagged) would confound every correlation, AUC, and Spearman this repair plan depends on for its acceptance gates. These guards close that gap. Verification: typecheck:all clean, 444/444 resilience tests pass (+14 from the new monotonicity harness). * fix(resilience): PR 0 review follow-ups — sample-union + doc tense Two review-driven fixes on top of PR 0. 1. scripts/compare-resilience-current-vs-proposed.mjs — the cohort and matched-pair summaries were computed against the historical 52-country sensitivity seed, which silently excluded the small-island-importers cohort (zero members in the seed) and the sg-vs-ch matched pair (Singapore not in the seed). With the current script those acceptance gates are partially measured at best. SAMPLE now = union(historical 52 seed, every cohort member, every matched-pair endpoint). The imports for RESILIENCE_COHORTS and MATCHED_PAIRS moved from inside main() to module scope so the union can be computed before the script runs. Net sample size grows from 52 to ~95 countries. Still fast enough for an interactive pass; makes the acceptance gates honest. 2. docs/methodology/country-resilience-index.mdx — the construct contract wording read as present-tense compliance ("Every indicator in the scorer passes a single mechanism test"), which contradicted the immediately-following passage about indicators that currently fail the test. Reworded to "is being evaluated against" and added an explicit PR-0-does-not-change-scoring paragraph that names the known-failing indicators (electricityConsumption, gas/coal flat penalties, WHO per-capita health spend) and points at the repair plan for the replacement schedule. Verification: typecheck:all clean, 444/444 resilience tests pass. * fix(resilience): compare-script loads frozen baseline + emits per-indicator influence Addresses two P1 review findings on PR #3284: 1. Script previously compared current-6d vs proposed-pillar-combined from the SAME checkout; never loaded the frozen pre-PR-0 baseline, so acceptance gates 2/6/7 ("no country moved >15pts vs baseline", cohort median shift vs baseline, matched-pair gap change vs baseline) could not be enforced for later scorer PRs. Now auto-discovers the most recent resilience-ranking-live-pre-repair-<date>.json (or post-<pr>-<date>) in docs/snapshots/ and emits a baselineComparison block with: spearmanVsBaseline, maxCountryAbsDelta, biggestDriftsVsBaseline, cohortShiftVsBaseline, matchedPairGapChange. If no baseline is found, the block is emitted with status 'unavailable' so callers distinguish missing-baseline from passed-baseline. 2. variableInfluence was emitted only at the dimension level, which hid the exact sub-indicators the repair plan targets (electricityConsumption, gasShare, coalShare, etc.) inside their parent dimension. Added extractIndicatorValues() which pulls twelve construct-risk indicators per country from the shared memoized reader, then computes per-indicator Pearson correlation against the current overall score. Emitted as perIndicatorInfluence[], sorted by absolute effective influence. Acceptance gate 8 ("effective influence agrees in sign and rank-order with assigned nominal weights") is now computable at the indicator level, not only at the dimension level. No production code touched; diagnostic-harness only. * fix(resilience): baseline-snapshot selection by structured parse, not filename sort Addresses P1 review on compare-resilience-current-vs-proposed.mjs:118-130. Plain filename sort breaks the "immediate-prior state" contract two ways: 1. Lexical ordering: `pre-repair` sorts after `post-` (`pr...` to 'r' > 'o'), so the PR-0 freeze would keep winning even after post-PR snapshots exist. Later scorer PRs would then report acceptance-gate deltas against the original pre-repair freeze instead of the immediately-prior post-PR-(N-1) snapshot — the gate would appear valid while measuring against the wrong baseline. 2. Lexical ordering: `pr10` < `pr9` (digit-by-digit), so PR-10 would lose the selection to PR-9. Fix: parseBaselineSnapshotMeta() extracts (kind, prNumber, date) from the filename. Sort keys are (kindRank desc, prNumber desc, date desc): - post always beats pre-repair (kindRank 1 vs 0) - among posts, prNumber compared numerically (10 beats 9) - date breaks ties (same-PR re-snapshots, later capture wins) - unlabeled post tags get prNumber 0 so they sort between pre-repair and any numbered PR snapshot Surfaced in output: baselineKind / baselinePrNumber / baselineDate alongside baselineFile so the operator can verify which snapshot was selected without having to reopen the file. Module now isMain-guarded per feedback_seed_isMain_guard memory so tests can import parseBaselineSnapshotMeta without firing the scoring run. Added tests/resilience-baseline-snapshot-ordering.test.mjs (9 tests) pinning the ordering contract for every known failure mode. Diagnostic-harness change only. No production code touched. fix(resilience): full scorable universe + registry-driven per-indicator influence Addresses two fresh P1 review findings on the PR 0 compare harness. Finding 1 — acceptance math ran on a curated ~95-country sample, so plan gate 2 could miss large regressions in excluded countries. - Main scoring loop now iterates the FULL scorable universe (listScorableCountries()), not the 52-country seed + cohort union. - Removed SAMPLE / HISTORICAL_SENSITIVITY_SEED constants. - Added scorableUniverseSize + cohortMissingFromScorable to output so operators see universe size and any cohort/pair endpoint that listScorable refuses to score (fail-loud, not silent drop). Finding 3 — per-indicator influence was a hand-picked 12-indicator subset, hiding most registry indicators from the baseline that later scorer PRs need. - Extraction is now driven by INDICATOR_REGISTRY. Every Core + Enrichment indicator gets a row with explicit extractionStatus: implemented \| not-implemented (with reason) \| unregistered-in-harness - EXTRACTION_RULES covers 40/59 indicators across 11 shape families (static-path, static-wb-infrastructure, static-wgi, static-wgi-mean, static-who, energy-mix-field, gas-storage-field, recovery-country- field, imf-macro/labor-country-field, national-debt, sanctions-count). - Remaining 19 indicators need either a scorer trace hook (PR 0.5) or a safe aggregation duplicate; each carries a reason string. - extractionCoverage summary (totalIndicators / implemented / notImplemented / unregisteredInHarness / coreImplemented / coreTotal) exposed in output so PR 0.5 progress is measurable. Added tests/resilience-indicator-extraction-plan.test.mjs (11 tests) pinning: every registry entry has an extraction row; not-implemented rows carry a reason; all 12 plan-named construct-risk indicators stay extractable; Core-tier coverage floor of 45%; shape-family unit tests. Diagnostic-harness change only. No production code touched. * fix(resilience): wire event-aggregate per-indicator influence via exported scorer helpers Addresses P1 review on PR 0 compare harness. Previous commit marked 16 Core-tier indicators as 'not-implemented' because they needed scorer event-window/severity-weighting math; that left the gate-9 acceptance apparatus incomplete for a large part of the shipped score. Fix: export the scorer-internal aggregation helpers so the harness calls them directly. Zero aggregation math duplicated in the harness, harness and scorer cannot drift. Exported from _dimension-scorers.ts (purely additive): summarizeCyber, summarizeOutages, summarizeGps, summarizeUcdp, summarizeUnrest, summarizeSocialVelocity, getCountryDisplacement, getThreatSummaryScore, countTradeRestrictions, countTradeBarriers. 13 extraction rules moved from not-implemented to implemented: cyberThreats, internetOutages, infraOutages, gpsJamming, ucdpConflict, unrestEvents, socialVelocity, newsThreatScore, displacementTotal, displacementHosted, tradeRestrictions, tradeBarriers, recoveryConflictPressure, recoveryDisplacementVelocity. Coverage: 52/59 total (88%), 46/50 Core-tier (92%). Four Core indicators remain not-implemented for STRUCTURAL reasons, NOT missing code. Scorer inputs are genuinely global scalars with zero per-country variance, so Pearson(indicator, overall) is 0 or NaN by construction: shippingStress, transitDisruption, energyPriceStress — scorer reads a global scalar applied to every country; a per-country effective signal would need re-expression as (global x per-country exposure), which is a derived signal in a different entry. aquastatWaterAvailability — needs a distinct sub-indicator path resolver; enrichment follow-up. New test asserts the three no-per-country-variance indicators STAY not-implemented with a matching reason, so any future extraction that appears to cover them without fixing the underlying construct fails. Dispatcher split into STATIC / SIMPLE / AGGREGATE extractor tables to stay under biome complexity limit. Core-tier floor test raised from 45% to 80%. 89 resilience tests pass, typecheck clean, biome clean. No production behaviour changes. * fix(resilience): tag-gated AQUASTAT extractor closes the last fixable Core gap Reviewer flagged aquastatWaterAvailability as the only remaining Core indicator where the not-implemented status was structurally fixable rather than conceptually impossible. Both aquastatWaterStress and aquastatWaterAvailability share a single .aquastat.value field; the scorer's scoreAquastatValue splits them by the sibling .aquastat.indicator tag keyword (stress/withdrawal/ dependency to stress family; availability/renewable/access to availability family). The harness now mirrors this branching: - classifyAquastatFamily implements the scorer's priority order (stress-family match wins even if the tag also contains an availability keyword, matching the sequential if-check at _dimension-scorers.ts L770-776). - static-aquastat-stress / static-aquastat-availability extractors return the value only when the family matches, so stress-family readings never corrupt the availability Pearson and vice versa. Core-tier coverage: 46/50 to 47/50 (94%). The 3 remaining Core not-implemented indicators (shippingStress, transitDisruption, energyPriceStress) are all structural impossibilities: scorer inputs are global scalars with zero per-country variance. New contract test pins both directions of the tag gate plus the priority-order edge case (a tag containing both families' keywords routes to stress). 90 resilience tests pass, typecheck clean, biome clean.	2026-04-22 16:44:12 +04:00

Elie Habib

7cf37c604c

feat(resilience): PR 3 — dead-signal cleanup (plan §3.5, §3.6) (#3297 )

* feat(resilience): PR 3 §3.5 — retire fuelStockDays from core score permanently

First commit in PR 3 of the resilience repair plan. Retires
`fuelStockDays` from the core score with no replacement.

Why permanent, not replaced:
IEA emergency-stockholding rules are defined in days of NET IMPORTS
and do not bind net exporters by design. Norway/Canada/US measured
in days-of-imports are incomparable to Germany/Japan measured the
same way — the construct is fundamentally different across the two
country classes. No globally-comparable recovery-fuel signal can
be built from this source; the pre-repair probe showed 100% imputed
at 50 for every country in the April 2026 freeze.

scoreFuelStockDays:
- Rewritten to return coverage=0 + observedWeight=0 +
imputationClass='source-failure' for every country regardless
of seed content.
- Drops the dimension from the `recovery` domain's coverage-
weighted mean automatically; remaining recovery dimensions
pick up the share via re-normalisation in
`_shared.ts#coverageWeightedMean`.
- No explicit weight transfer needed — the coverage-weighted
blend handles redistribution.

Registry:
- recoveryFuelStockDays re-tagged from tier='enrichment' to
tier='experimental' so the Core coverage gate treats it as
out-of-score.
- Description updated to make the retirement explicit; entry
stays in the registry for structural continuity (the
dimension `fuelStockDays` remains in RESILIENCE_DIMENSION_ORDER
for the 19-dimension tests; removing the dimension entirely is
a PR 4 structural-audit concern).

Housekeeping:
- Removed `RESILIENCE_RECOVERY_FUEL_STOCKS_KEY` constant (no
longer read; noUnusedLocals would reject it).
- Removed `RecoveryFuelStocksCountry` interface for the same
reason. Comment at the removed declaration instructs future
maintainers not to re-add the type as a reservation; when a
new recovery-fuel concept lands, introduce a fresh interface.

Plan reference: §3.5 point 1 of
`docs/plans/2026-04-22-001-fix-resilience-scorer-structural-bias-plan.md`.

51 resilience tests pass, typecheck + biome clean. The
`recovery` domain's published score will shift slightly for every
country because the 0.10 slot that fuelStockDays was imputing to
now redistributes; the compare-harness acceptance-gate rerun at
merge time will quantify the shift per plan §6 gates.

* feat(resilience): PR 3 §3.5 — retire BIS-backed currencyExternal; rebuild on IMF inflation + WB reserves

BIS REER/DSR feeds were load-bearing in currencyExternal (weights 0.35
fxVolatility + 0.35 fxDeviation, ~70% of dimension). They cover ~60
countries max — so every non-BIS country fell through to
curated_list_absent (coverage 0.3) or a thin IMF proxy (coverage 0.45).
Combined with reserveMarginPct already removed in PR 1, currencyExternal
was the clearest "construct absent for most of the world" carrier left
in the scorer.

Changes:

_dimension-scorers.ts
- scoreCurrencyExternal now reads IMF macro (inflationPct) + WB FX
reserves only. Coverage ladder:
inflation + reserves → 0.85 (observed primary + secondary)
inflation only → 0.55
reserves only → 0.40
neither → 0.30 (IMPUTE.bisEer retained for snapshot
continuity; semantics read as
"no IMF + no WB reserves" now)
- Removed dead symbols: RESILIENCE_BIS_EXCHANGE_KEY constant (reserved
via comment only, flagged by noUnusedLocals), stddev() helper,
getCountryBisExchangeRates() loader, BisExchangeRate interface,
dateToSortableNumber() — all were exclusive callers of the retired
BIS path.

_indicator-registry.ts
- New core entry inflationStability (weight 0.60, tier=core,
sourceKey=economic:imf:macro:v2).
- fxReservesAdequacy weight 0.15 → 0.40 (secondary reliability
anchor).
- fxVolatility + fxDeviation demoted tier=enrichment → tier=experimental
(BIS ~60-country coverage; off the core weight sum).
- Non-experimental weights now sum to 1.0 (0.60 + 0.40).

scripts/compare-resilience-current-vs-proposed.mjs
- EXTRACTION_RULES: added inflationStability →
imf-macro-country-field field=inflationPct so the registry-parity
test passes and the correlation harness sees the new construct.

tests/resilience-dimension-scorers.test.mts
- Dropped BIS-era wording ("non-BIS country") and test 266
(BIS-outage coverage 0.35 branch) which collapsed to the inflation-
only path post-retirement.
- Updated coverage assertions: inflation-only 0.45 → 0.55; inflation+
reserves 0.55 → 0.85.

tests/resilience-scorers.test.mts
- domainAverages.economic 68.33 → 66.33 (US currencyExternal score
shifts slightly under IMF+reserves vs old BIS composite).
- stressScore 67.85 → 67.21; stressFactor 0.3215 → 0.3279.
- overallScore 65.82 → 65.52.
- baselineScore unchanged (currencyExternal is stress-only).

All 6324 data-tier tests pass. typecheck:api clean. No change to
seeders or Redis keys; this is a pure scorer + registry rebuild.

* feat(resilience): PR 3 §3.5 point 3 — re-goalpost externalDebtCoverage (0..5 → 0..2)

Plan §2.1 diagnosis table showed externalDebtCoverage saturating at
score=100 across all 9 probe countries — including stressed states.
Signal was collapsed. Root cause: (worst=5, best=0) gave every country
with ratio < 0.5 a score above 90, and mapped Greenspan-Guidotti's
reserve-adequacy threshold (ratio=1.0) to score 80 — well into "no
worry" territory instead of the "mild warning" it should be.

Re-anchored on Greenspan-Guidotti directly: ratio=1.0 now maps to score
50 (mild warning), ratio=2.0 to score 0 (acute rollover-shock exposure).
Ratios above 2.0 clamp to 0, consistent with "beyond this point the
country is already in crisis; exact value stops mattering."

Files changed:

- _indicator-registry.ts: recoveryDebtToReserves goalposts
{worst: 5, best: 0} → {worst: 2, best: 0}. Description updated to
cite Greenspan-Guidotti; inline comment documents anchor + rationale.

- _dimension-scorers.ts: scoreExternalDebtCoverage normalizer bound
changed from (0..5) to (0..2), with inline comment.

- docs/methodology/country-resilience-index.mdx: goalpost table row
5-0 → 2-0, description cites Greenspan-Guidotti.

- docs/methodology/indicator-sources.yaml:
* constructStatus: dead-signal → observed-mechanism (signal is now
discriminating).
* reviewNotes updated to describe the new anchor.
* mechanismTestRationale names the Greenspan-Guidotti rule.

- tests/resilience-dimension-monotonicity.test.mts: updated the
comment + picked values inside the (0..2) discriminating band (0.3
and 1.5). Old values (1 vs 4) had 4 clamping to 0.

- tests/resilience-dimension-scorers.test.mts: NO score threshold
relaxed >90 → >=85 (NO ratio=0.2 now scores 90, was 96).

- tests/resilience-scorers.test.mts: fixture drift:
* domainAverages.recovery 54.83 → 47.33 (US extDebt 70 → 25).
* baselineScore 63.63 → 60.12 (extDebt is baseline type).
* overallScore 65.52 → 63.27.
* stressScore / stressFactor unchanged (extDebt is baseline-only).

All 6324 data-tier tests pass. typecheck:api clean.

* feat(resilience): PR 3 §3.6 — CI gate on indicator coverage and nominal weight

Plan §3.6 adds a new acceptance criterion (also §5 item 5):

> No indicator with observed coverage below 70% may exceed 5% nominal
> weight OR 5% effective influence in the post-change sensitivity run.

This commit enforces the NOMINAL-WEIGHT half as a unit test that runs
on every CI build. The EFFECTIVE-INFLUENCE half is produced by
scripts/validate-resilience-sensitivity.mjs as a committed artifact;
the gate file only asserts that script still exists so a refactor that
removes it breaks the build loudly.

Why the gate exists (plan §3.6):

"A dimension at 30% observed coverage carries the same effective
weight as one at 95%. This contradicts the OECD/JRC handbook on
uncertainty analysis."

Implementation:

tests/resilience-coverage-influence-gate.test.mts — three tests:
1. Nominal-weight gate: for every core indicator with coverage < 137
countries (70% of the ~195-country universe), computes its nominal
overall weight as
indicator.weight × (1/dimensions-in-domain) × domain-weight
and asserts it does not exceed 5%. Equal-share-per-dimension is
the *upper bound* on runtime weight (coverage-weighted mean gives
a lower share when a dimension drops out), so this is a strict
bound: if the nominal number passes, the runtime number also
passes for every country.
2. Effective-influence contract: asserts the sensitivity script
exists at its expected path. Removing it (intentionally or by
refactor) breaks the build.
3. Audit visibility: prints the top 10 core indicators by nominal
overall weight. No assertion beyond "ran" — the list lets
reviewers spot outliers that pass the gate but are near the cap.

Current state (observed from audit output):

recoveryReserveMonths: nominal=4.17% coverage=188
recoveryDebtToReserves: nominal=4.17% coverage=185
recoveryImportHhi: nominal=4.17% coverage=190
inflationStability: nominal=3.40% coverage=185
electricityConsumption: nominal=3.30% coverage=217
ucdpConflict: nominal=3.09% coverage=193

Every core indicator has coverage ≥ 180 (already enforced by the
pre-existing indicator-tiering test), so the nominal-weight gate has
no current violators — its purpose is catching future drift, not
flagging today's state.

All 6327 data-tier tests pass. typecheck:api clean.

* docs(resilience): PR 3 methodology doc — document §3.5 dead-signal retirements + §3.6 coverage gate

Methodology-doc update capturing the three §3.5 landings and the §3.6 CI
gate. Five edits:

1. **Known construct limitations section (#5 and #6):** strikethrough the
original "dead signals" and "no coverage-based weight cap" items,
annotate them with "Landed in PR 3 §3.5"/"Landed in PR 3 §3.6" +
specifics of what shipped.

2. **Currency & External H4 section:** completely rewritten. Old table
(fxVolatility / fxDeviation / fxReservesAdequacy on BIS primary) is
replaced by the two-indicator post-PR-3 table (inflationStability at
0.60 + fxReservesAdequacy at 0.40). Coverage ladder spelled out
(0.85 / 0.55 / 0.40 / 0.30). Legacy BIS indicators named as
experimental-tier drill-downs only.

3. **Fuel Stock Days H4 section:** H4 heading text kept verbatim so the
methodology-lint H4-to-dimension mapping does not break; body
rewritten to explain that the dimension is retired from core but the
seeder still runs for IEA-member drill-downs.

4. **External Debt Coverage table row:** goalpost 5-0 → 2-0, description
cites Greenspan-Guidotti reserve-adequacy rule.

5. **New v2.2 changelog entry** — PR 3 dead-signal cleanup, covering
§3.5 points 1/2/3 + §3.6 + acceptance gates + construct-audit
updates.

No scoring or code changes in this commit. Methodology-lint test passes
(H4 mapping intact). All 6327 data-tier tests pass.

* fix(resilience): PR 3 §3.6 gate — correct share-denominator for coverage-weighted aggregation

Reviewer catch (thanks). The previous gate computed each indicator's
nominal overall weight as

indicator.weight × (1 / N_total_dimensions_in_domain) × domain_weight

and claimed this was an upper bound ("actual runtime weight is ≤ this
when some dimensions drop out on coverage"). That is BACKWARDS for
this scorer.

The domain aggregation is coverage-weighted
(server/worldmonitor/resilience/v1/_shared.ts coverageWeightedMean),
so when a dimension pins at coverage=0 it is EXCLUDED from the
denominator and the surviving dimensions' shares go UP, not down.

PR 3 commit 1 retires fuelStockDays by hard-coding its scorer to
coverage=0 for every country — so in the current live state the
recovery domain has 5 contributing dimensions (not 6), and each core
recovery indicator's nominal share is

1.0 × 1/5 × 0.25 = 5.00% (was mis-reported as 4.17%)

The old gate therefore under-estimated nominal influence and could
silently pass exactly the kind of low-coverage overweight regression
it is meant to block.

Fix:

- Added `coreBearingDimensions(domainId)` helper that counts only
dimensions that have ≥1 core indicator in the registry. A dimension
with only experimental/enrichment entries (post-retirement
fuelStockDays) has no core contribution → does not dilute shares.
- Updated `nominalOverallWeight` to divide by the core-bearing count,
not the raw dimension count.
- Rewrote the helper's doc comment to stop claiming this is a strict
upper bound — explicitly calls out the dynamic case (source failure
raising surviving dim shares further) as the sensitivity script's
responsibility.
- Added a new regression test: asserts (a) at least one recovery
dimension is all-non-core (fuelStockDays post-retirement),
(b) fuelStockDays has zero core indicators, and (c) recoveryDebt
ToReserves nominal = 0.05 exactly (not 0.0417) — any reversion
of the retirement or regression to N_total-denominator will fail
loudly.

Top-10 audit output now correctly shows:

recoveryReserveMonths: nominal=5% coverage=188
recoveryDebtToReserves: nominal=5% coverage=185
recoveryImportHhi: nominal=5% coverage=190
(was 4.17% each under the old math)

All 486 resilience tests pass. typecheck:api clean.

Note: the 5% figure is exactly AT the cap, not over it. "exceed" means
strictly > 5%, so it still passes. But now the reviewer / audit log
reflects reality.

* fix(resilience): PR 3 review — retired-dim confidence drag + false source-failure label

Addresses the Codex review P1 + P2 on PR #3297.

P1 — retired-dim drag on confidence averages
--------------------------------------------
scoreFuelStockDays returns coverage=0 by design (retired construct),
but computeLowConfidence, computeOverallCoverage, and the widget's
formatResilienceConfidence averaged across all 19 dimensions. That
dragged every country's reported averageCoverage down — US went from
0.8556 (active dims only) to 0.8105 (all dims) — enough drift to
misclassify edge countries as lowConfidence and to shift the ranking
widget's overallCoverage pill for every country.

Fix: introduce an authoritative RESILIENCE_RETIRED_DIMENSIONS set in
_dimension-scorers.ts and filter it out of all three averages. The
filter is keyed on the retired-dim REGISTRY, not on coverage === 0,
because a non-retired dim can legitimately emit coverage=0 on a
genuinely sparse-data country via weightedBlend fall-through — those
entries MUST keep dragging confidence down (that is the sparse-data
signal lowConfidence exists to surface). Verified: sparse-country
release-gate test (marks sparse WHO/FAO countries as low confidence)
still passes with the registry-keyed filter; would have failed with
a naive coverage=0 filter.

Server-client parity: widget-utils cannot import server code, so
RESILIENCE_RETIRED_DIMENSION_IDS is a hand-mirrored constant, kept
in lockstep by tests/resilience-retired-dimensions-parity.test.mts
(parses the widget file as text, same pattern as existing widget-util
tests that can't import the widget module directly).

P2 — false "Source down" label on retired dim
---------------------------------------------
scoreFuelStockDays hard-coded imputationClass: 'source-failure',
which the widget maps to "Source down: upstream seeder failed" with
a `!` icon for every country. That is semantically wrong for an
intentional retirement. Flipped to null so the widget's absent-path
renders a neutral cell without a false outage label. null is already
a legal value of ResilienceDimensionScore.imputationClass; no type
change needed.

Tests
-----
- tests/resilience-confidence-averaging.test.mts (new): pins the
registry-keyed filter semantic for computeOverallCoverage +
computeLowConfidence. Includes a negative-control test proving
non-retired coverage=0 dims still flip lowConfidence.
- tests/resilience-retired-dimensions-parity.test.mts (new):
lockstep gate between server and client retired-dim lists.
- Widget test adds a registry-keyed exclusion test with a non-retired
coverage=0 dim in the fixture to lock in the correct semantic.
- Existing tests asserting imputationClass: 'source-failure' for
fuelStockDays flipped to null.

All 494 resilience tests + full 6336/6336 data-tier suite pass.
Typecheck clean for both tsconfig.json and tsconfig.api.json.

* docs(resilience): align methodology + registry metadata with shipped imputationClass=null

Follow-up to the previous PR 3 review commit that flipped
scoreFuelStockDays's imputationClass from 'source-failure' to null to
avoid a false "Source down" widget label on every country. The code
changed; the doc and registry metadata did not, leaving three sites
in the methodology mdx and two comment/description sites in the
registry still claiming imputationClass='source-failure'. Any future
reviewer (or tooling that treats the registry description as
authoritative) would be misled.

This commit rewrites those sites to describe the shipped behavior:
- imputationClass=null (not 'source-failure'), with the rationale
- exclusion from confidence/coverage averages via the
RESILIENCE_RETIRED_DIMENSIONS registry filter
- the distinction between structural retirement (filtered) and
runtime coverage=0 (kept so sparse-data countries still flag
lowConfidence)

Touched:
- docs/methodology/country-resilience-index.mdx (lines ~33, ~268, ~590)
- server/worldmonitor/resilience/v1/_indicator-registry.ts
(recoveryFuelStockDays comment block + description field)

No code-behavior change. Docs-only.

Tests: 157 targeted resilience tests pass (incl. methodology-lint +
widget + release-gate + confidence-averaging). Typecheck clean on
both tsconfig.json and tsconfig.api.json.

2026-04-22 23:57:28 +04:00

Elie Habib

da0f26a3cf

feat(resilience): PR 0 diagnostic freeze + fairness-audit harness (no scoring changes) (#3284 )

* feat(resilience): PR 0 diagnostic freeze + fairness-audit harness

Lands the before-state and measurement apparatus every subsequent
resilience-scorer PR validates against. Zero scoring changes. Per the
v3 plan at docs/plans/2026-04-22-001-fix-resilience-scorer-structural-
bias-plan.md this is tranche 0 of five.

What lands:
- Construct contract published in the methodology doc: absolute
  resilience not development-adjusted, mechanism test for every
  indicator, peer-relative views published separately from the core.
- Known construct limitations section: six construct errors scheduled
  for PR 1-3 repair with explicit mapping to plan tranches.
- Indicator-source manifest at docs/methodology/indicator-sources.yaml
  with source, seriesId, seriesUrl, coveragePct, lastObservedYear,
  license, mechanismTestRationale, and a constructStatus classification.
- Pre-repair ranking snapshot at
  docs/snapshots/resilience-ranking-live-pre-repair-2026-04-22.json
  (217 items + 5 greyedOut, captured 2026-04-22 08:38 UTC at commit
  425507d15).
- Cohort configuration at tests/helpers/resilience-cohorts.mts: six
  cohorts covering 87 countries (net-fuel-exporters, net-energy-
  importers-oecd, nuclear-heavy-generation, coal-heavy-domestic,
  small-island-importers, fragile-states).
- Matched-pair sanity panel at tests/helpers/resilience-matched-pairs.mts:
  six pairs (FR/DE, NO/CA, UAE/BH, JP/KR, IN/ZA, SG/CH) with expected-
  direction rationale and minGap for acceptance gate 7.
- scripts/compare-resilience-current-vs-proposed.mjs extended to emit
  cohortSummary and matchedPairSummary alongside the existing output
  shape (backward compatible).
- tests/resilience-cohort-config.test.mts: 11 validations ensuring the
  cohort + matched-pair configs stay well-formed.

Deferred to PR 0.5 (before PR 1 lands):
- Monotonicity test harness for all 19 dimension scorers pinning the
  sign of every indicator.
- Pearson-derivative variable-influence baseline inside the sensitivity
  script producing the nominal-weight-vs-effective-influence table that
  plan acceptance gate 8 requires.

Verification: typecheck:all clean, 430/430 resilience tests pass,
11/11 new cohort-config tests pass, snapshot auto-discovered and
validated by the existing snapshot-test harness.

* feat(resilience): PR 0 follow-ups — monotonicity harness, variable-influence baseline, cross-consumer formula gate

Completes the PR 0 scope per the v3 plan §5 deliverables. Three adds:

1. Monotonicity test harness
   tests/resilience-dimension-monotonicity.test.mts pins the direction
   of movement for 14 indicators across 7 dimensions (reserve adequacy,
   fiscal space 3x, external debt coverage, import concentration,
   governance WGI, food/water 2x, energy 5x). Each test builds two
   synthetic ResilienceSeedReader fixtures differing only in the target
   indicator and asserts the dimension score moves in the documented
   direction. The scoreEnergy tests explicitly flag three indicators
   (gasShare, coalShare, electricityConsumption) that PR 1 §3.1-3.2
   overturns so future readers understand which directional claims the
   plan intentionally replaces.

2. Variable-influence baseline
   scripts/compare-resilience-current-vs-proposed.mjs now computes
   per-dimension Pearson correlation against the current overallScore
   scaled by the dimension's nominal domain weight (a Pearson-derivative
   approximation of Sobol indices). The output carries a
   variableInfluence[] array sorted by abs(effectiveInfluence) desc.
   Acceptance gate 8 from the plan compares post-change effective
   influence against assigned nominal weight; divergences flag a
   wealth-proxy or saturated-signal construct problem.

3. Cross-consumer formula gate
   Five external consumers of resilience:score:v10:* now filter stale-
   formula entries so a flag flip does not serve mixed-formula data
   downstream:
     - server/worldmonitor/supply-chain/v1/get-route-impact.ts —
       readResilienceScore() checks _formula via the new
       getCurrentCacheFormula export and returns 0 on mismatch.
     - scripts/validate-resilience-correlation.mjs,
       scripts/validate-resilience-backtest.mjs,
       scripts/backtest-resilience-outcomes.mjs,
       scripts/benchmark-resilience-external.mjs — each inlines a
       currentCacheFormulaLocal() helper that mirrors the server's
       formula derivation from env, skips parsed entries whose
       _formula disagrees, and logs the skip count so operators can
       notice a mismatch during the flip window.

A mixed-formula cohort (some countries d6-tagged, others pc-tagged)
would confound every correlation, AUC, and Spearman this repair plan
depends on for its acceptance gates. These guards close that gap.

Verification: typecheck:all clean, 444/444 resilience tests pass
(+14 from the new monotonicity harness).

* fix(resilience): PR 0 review follow-ups — sample-union + doc tense

Two review-driven fixes on top of PR 0.

1. scripts/compare-resilience-current-vs-proposed.mjs — the cohort and
   matched-pair summaries were computed against the historical
   52-country sensitivity seed, which silently excluded the
   small-island-importers cohort (zero members in the seed) and the
   sg-vs-ch matched pair (Singapore not in the seed). With the current
   script those acceptance gates are partially measured at best.

   SAMPLE now = union(historical 52 seed, every cohort member, every
   matched-pair endpoint). The imports for RESILIENCE_COHORTS and
   MATCHED_PAIRS moved from inside main() to module scope so the union
   can be computed before the script runs.

   Net sample size grows from 52 to ~95 countries. Still fast enough
   for an interactive pass; makes the acceptance gates honest.

2. docs/methodology/country-resilience-index.mdx — the construct
   contract wording read as present-tense compliance ("Every indicator
   in the scorer passes a single mechanism test"), which contradicted
   the immediately-following passage about indicators that currently
   fail the test. Reworded to "is being evaluated against" and added
   an explicit PR-0-does-not-change-scoring paragraph that names the
   known-failing indicators (electricityConsumption, gas/coal flat
   penalties, WHO per-capita health spend) and points at the repair
   plan for the replacement schedule.

Verification: typecheck:all clean, 444/444 resilience tests pass.

* fix(resilience): compare-script loads frozen baseline + emits per-indicator influence

Addresses two P1 review findings on PR #3284:

1. Script previously compared current-6d vs proposed-pillar-combined
   from the SAME checkout; never loaded the frozen pre-PR-0 baseline,
   so acceptance gates 2/6/7 ("no country moved >15pts vs baseline",
   cohort median shift vs baseline, matched-pair gap change vs
   baseline) could not be enforced for later scorer PRs.

   Now auto-discovers the most recent
   resilience-ranking-live-pre-repair-<date>.json (or post-<pr>-<date>)
   in docs/snapshots/ and emits a baselineComparison block with:
   spearmanVsBaseline, maxCountryAbsDelta, biggestDriftsVsBaseline,
   cohortShiftVsBaseline, matchedPairGapChange. If no baseline is
   found, the block is emitted with status 'unavailable' so callers
   distinguish missing-baseline from passed-baseline.

2. variableInfluence was emitted only at the dimension level, which
   hid the exact sub-indicators the repair plan targets
   (electricityConsumption, gasShare, coalShare, etc.) inside their
   parent dimension. Added extractIndicatorValues() which pulls twelve
   construct-risk indicators per country from the shared memoized
   reader, then computes per-indicator Pearson correlation against
   the current overall score. Emitted as perIndicatorInfluence[],
   sorted by absolute effective influence.

Acceptance gate 8 ("effective influence agrees in sign and rank-order
with assigned nominal weights") is now computable at the indicator
level, not only at the dimension level.

No production code touched; diagnostic-harness only.

* fix(resilience): baseline-snapshot selection by structured parse, not filename sort

Addresses P1 review on compare-resilience-current-vs-proposed.mjs:118-130.

Plain filename sort breaks the "immediate-prior state" contract two ways:

1. Lexical ordering: `pre-repair` sorts after `post-*`
   (`pr...` to 'r' > 'o'), so the PR-0 freeze would keep winning even
   after post-PR snapshots exist. Later scorer PRs would then report
   acceptance-gate deltas against the original pre-repair freeze
   instead of the immediately-prior post-PR-(N-1) snapshot — the gate
   would appear valid while measuring against the wrong baseline.

2. Lexical ordering: `pr10` < `pr9` (digit-by-digit), so PR-10 would
   lose the selection to PR-9.

Fix: parseBaselineSnapshotMeta() extracts (kind, prNumber, date) from
the filename. Sort keys are (kindRank desc, prNumber desc, date desc):
  - post always beats pre-repair (kindRank 1 vs 0)
  - among posts, prNumber compared numerically (10 beats 9)
  - date breaks ties (same-PR re-snapshots, later capture wins)
  - unlabeled post tags get prNumber 0 so they sort between
    pre-repair and any numbered PR snapshot

Surfaced in output: baselineKind / baselinePrNumber / baselineDate
alongside baselineFile so the operator can verify which snapshot was
selected without having to reopen the file.

Module now isMain-guarded per feedback_seed_isMain_guard memory so
tests can import parseBaselineSnapshotMeta without firing the
scoring run.

Added tests/resilience-baseline-snapshot-ordering.test.mjs (9 tests)
pinning the ordering contract for every known failure mode.

Diagnostic-harness change only. No production code touched.

* fix(resilience): full scorable universe + registry-driven per-indicator influence

Addresses two fresh P1 review findings on the PR 0 compare harness.

Finding 1 — acceptance math ran on a curated ~95-country sample,
so plan gate 2 could miss large regressions in excluded countries.

  - Main scoring loop now iterates the FULL scorable universe
    (listScorableCountries()), not the 52-country seed + cohort union.
  - Removed SAMPLE / HISTORICAL_SENSITIVITY_SEED constants.
  - Added scorableUniverseSize + cohortMissingFromScorable to output
    so operators see universe size and any cohort/pair endpoint that
    listScorable refuses to score (fail-loud, not silent drop).

Finding 3 — per-indicator influence was a hand-picked 12-indicator
subset, hiding most registry indicators from the baseline that
later scorer PRs need.

  - Extraction is now driven by INDICATOR_REGISTRY. Every Core +
    Enrichment indicator gets a row with explicit extractionStatus:
      implemented | not-implemented (with reason) | unregistered-in-harness
  - EXTRACTION_RULES covers 40/59 indicators across 11 shape families
    (static-path, static-wb-infrastructure, static-wgi, static-wgi-mean,
    static-who, energy-mix-field, gas-storage-field, recovery-country-
    field, imf-macro/labor-country-field, national-debt, sanctions-count).
  - Remaining 19 indicators need either a scorer trace hook (PR 0.5)
    or a safe aggregation duplicate; each carries a reason string.
  - extractionCoverage summary (totalIndicators / implemented /
    notImplemented / unregisteredInHarness / coreImplemented / coreTotal)
    exposed in output so PR 0.5 progress is measurable.

Added tests/resilience-indicator-extraction-plan.test.mjs (11 tests)
pinning: every registry entry has an extraction row; not-implemented
rows carry a reason; all 12 plan-named construct-risk indicators stay
extractable; Core-tier coverage floor of 45%; shape-family unit tests.

Diagnostic-harness change only. No production code touched.

* fix(resilience): wire event-aggregate per-indicator influence via exported scorer helpers

Addresses P1 review on PR 0 compare harness. Previous commit marked 16
Core-tier indicators as 'not-implemented' because they needed scorer
event-window/severity-weighting math; that left the gate-9 acceptance
apparatus incomplete for a large part of the shipped score.

Fix: export the scorer-internal aggregation helpers so the harness
calls them directly. Zero aggregation math duplicated in the harness,
harness and scorer cannot drift.

Exported from _dimension-scorers.ts (purely additive):
  summarizeCyber, summarizeOutages, summarizeGps,
  summarizeUcdp, summarizeUnrest, summarizeSocialVelocity,
  getCountryDisplacement, getThreatSummaryScore,
  countTradeRestrictions, countTradeBarriers.

13 extraction rules moved from not-implemented to implemented:
  cyberThreats, internetOutages, infraOutages, gpsJamming,
  ucdpConflict, unrestEvents, socialVelocity, newsThreatScore,
  displacementTotal, displacementHosted, tradeRestrictions,
  tradeBarriers, recoveryConflictPressure, recoveryDisplacementVelocity.

Coverage:
  52/59 total (88%), 46/50 Core-tier (92%).

Four Core indicators remain not-implemented for STRUCTURAL reasons,
NOT missing code. Scorer inputs are genuinely global scalars with
zero per-country variance, so Pearson(indicator, overall) is 0 or
NaN by construction:
  shippingStress, transitDisruption, energyPriceStress — scorer
  reads a global scalar applied to every country; a per-country
  effective signal would need re-expression as (global x per-country
  exposure), which is a derived signal in a different entry.
  aquastatWaterAvailability — needs a distinct sub-indicator path
  resolver; enrichment follow-up.

New test asserts the three no-per-country-variance indicators STAY
not-implemented with a matching reason, so any future extraction
that appears to cover them without fixing the underlying construct
fails.

Dispatcher split into STATIC / SIMPLE / AGGREGATE extractor tables
to stay under biome complexity limit. Core-tier floor test raised
from 45% to 80%.

89 resilience tests pass, typecheck clean, biome clean. No production
behaviour changes.

* fix(resilience): tag-gated AQUASTAT extractor closes the last fixable Core gap

Reviewer flagged aquastatWaterAvailability as the only remaining Core
indicator where the not-implemented status was structurally fixable
rather than conceptually impossible.

Both aquastatWaterStress and aquastatWaterAvailability share a single
.aquastat.value field; the scorer's scoreAquastatValue splits them
by the sibling .aquastat.indicator tag keyword (stress/withdrawal/
dependency to stress family; availability/renewable/access to
availability family). The harness now mirrors this branching:

  - classifyAquastatFamily implements the scorer's priority order
    (stress-family match wins even if the tag also contains an
    availability keyword, matching the sequential if-check at
    _dimension-scorers.ts L770-776).
  - static-aquastat-stress / static-aquastat-availability extractors
    return the value only when the family matches, so stress-family
    readings never corrupt the availability Pearson and vice versa.

Core-tier coverage: 46/50 to 47/50 (94%). The 3 remaining Core
not-implemented indicators (shippingStress, transitDisruption,
energyPriceStress) are all structural impossibilities: scorer inputs
are global scalars with zero per-country variance.

New contract test pins both directions of the tag gate plus the
priority-order edge case (a tag containing both families' keywords
routes to stress).

90 resilience tests pass, typecheck clean, biome clean.

2026-04-22 16:44:12 +04:00

2 Commits