Files
worldmonitor/scripts/validate-resilience-correlation.mjs
Elie Habib d3d406448a feat(resilience): PR 2 §3.4 recovery-domain weight rebalance (#3328)
* feat(resilience): PR 2 §3.4 recovery-domain weight rebalance

Dials the two PR 2 §3.4 recovery dims (liquidReserveAdequacy,
sovereignFiscalBuffer) to ~10% share each of the recovery-domain
score via a new per-dimension weight channel in the coverage-weighted
mean. Matches the plan's direction that the sovereign-wealth signal
complement — rather than dominate — the classical liquid-reserves
and fiscal-space signals.

Implementation

- RESILIENCE_DIMENSION_WEIGHTS: new Record<ResilienceDimensionId, number>
  alongside RESILIENCE_DOMAIN_WEIGHTS. Every dim has an explicit entry
  (default 1.0) so rebalance decisions stay auditable; the two new
  recovery dims carry 0.5 each.

  Share math at full coverage (6 active recovery dims):
    weight sum                  = 4 × 1.0 + 2 × 0.5 = 5.0
    each new-dim share          = 0.5 / 5.0 = 0.10  ✓
    each core-dim share         = 1.0 / 5.0 = 0.20

  Retired dims (reserveAdequacy, fuelStockDays) keep weight 1.0 in
  the map; their coverage=0 neutralizes them at the coverage channel
  regardless. Explicit entries guard against a future scorer bug
  accidentally returning coverage>0 for a retired dim and falling
  through the `?? 1.0` default — every retirement decision is now
  tied to a single explicit source of truth.

- coverageWeightedMean (_shared.ts): refactored to apply
  `coverage × dimWeight` per dim instead of `coverage` alone. Backward-
  compatible when all weights default to 1.0 (reduces to the original
  mean). All three aggregation callers — buildDomainList, baseline-
  Score, stressScore — pick up the weighting transparently.

Test coverage

1. New `tests/resilience-recovery-weight-rebalance.test.mts`:
   pins the per-dim weight values, asserts the share math
   (0.10 new / 0.20 core), verifies completeness of the weight map,
   and documents why retired dims stay in the map at 1.0.
2. New `tests/resilience-recovery-ordering.test.mts`: fixture-based
   Spearman-proxy sensitivity check. Asserts NO > US > YE ordering
   preserved on both the overall score and the recovery-domain
   subscore after the rebalance. (Live post-merge Spearman rerun
   against the PR 0 snapshot is tracked as a follow-up commit.)
3. resilience-scorers.test.mts fixture anchors updated in lockstep:
     baselineScore: 60.35 → 62.17 (low-scoring liquidReserveAdequacy
       + partial-coverage SWF now contribute ~half the weight)
     overallScore:  63.60 → 64.39 (recovery subscore lifts by ~3 pts
       from the rebalance, overall by ~0.79)
     recovery flat mean: 48.75 (unchanged — flat mean doesn't apply
       weights by design; documents the coverage-weighted diff)
   Local coverageWeightedMean helper in the test mirrors the
   production implementation (weights applied per dim).

Methodology doc

- New "Per-dimension weights in the recovery domain" subsection with
  the weight table and a sentence explaining the cap. Cross-references
  the source of truth (RESILIENCE_DIMENSION_WEIGHTS).

Deliberate non-goals

- Live post-merge Spearman ≥0.85 check against the PR 0 baseline
  snapshot. Fixture ordering is preserved (new ordering test); the
  live-data check runs after Railway cron refreshes the rankings on
  the new weights and commits docs/snapshots/resilience-ranking-live-
  post-pr2-<date>.json. Tracked as the final piece of PR 2 §3.4
  alongside the health.js / bootstrap graduation (waiting on the
  7-day Railway cron bake-in window).

Tests: 6588/6588 data-tier tests pass. Typecheck clean on both
tsconfig configs. Biome clean on touched files. NO > US > YE
fixture ordering preserved.

* fix(resilience): PR 2 review — thread RESILIENCE_DIMENSION_WEIGHTS through the comparison harness

Greptile P2: the operator comparison harness
(scripts/compare-resilience-current-vs-proposed.mjs) claims its domain
scores "mirror the production scorer's coverage-weighted mean" and is
the artifact generator for Spearman / rank-delta acceptance decisions.
After PR 2 §3.4's weight rebalance, the production mirror diverged —
production now applies RESILIENCE_DIMENSION_WEIGHTS (liquidReserveAdequacy
= 0.5, sovereignFiscalBuffer = 0.5) inside coverageWeightedMean, but
the harness still used equal-weight aggregation.

Left unfixed, post-merge Spearman / rank-delta diagnostics would
compare live API scores (with the 0.5 recovery weights) against
harness predictions that assume equal-share dims — silently biasing
every acceptance decision until someone noticed a country's rank-
delta didn't track.

Fix

- Mirrored coverageWeightedMean now accepts dimensionWeights and
  applies `coverage × weight` per dim, matching _shared.ts exactly.
- Mirrored buildDomainList accepts + forwards dimensionWeights.
- main() imports RESILIENCE_DIMENSION_WEIGHTS from the scorer module
  and passes it through to buildDomainList at the single call site.
- Missing-entry default = 1.0 (same contract as production) — makes
  the harness forward-compatible with any future weight refactor
  (adds a new dim without an explicit entry, old production fallback
  path still produces the correct number).

Verification

- Harness syntax-check clean (node -c).
- RESILIENCE_DIMENSION_WEIGHTS import resolves correctly from the
  harness's import path.
- 509/509 resilience tests still pass (harness isn't in the test
  suite; the invariant is that production ↔ harness use the same
  math, and the production side is covered by tests/resilience-
  recovery-weight-rebalance.test.mts).

* fix(resilience): PR 2 review — bump cache prefixes v10→v11 + document coverage-vs-weight asymmetry

Greptile P1 + P2 on PR #3328.

P1 — cache prefix not bumped after formula change
--------------------------------------------------
The per-dim weight rebalance changes the score formula, but the
`_formula` tag only distinguishes 'd6' vs 'pc' (pillar-combined vs
legacy 6-domain) — it does NOT detect intra-'d6' weight changes. Left
unfixed, scores cached before deploy would be served with the old
equal-weight math for up to the full 6h TTL, and the ranking key for
up to its 12h TTL. Matches the established v9→v10 pattern for every
prior formula-changing deploy.

Bumped in lockstep:
 - RESILIENCE_SCORE_CACHE_PREFIX:     v10  → v11
 - RESILIENCE_RANKING_CACHE_KEY:      v10  → v11
 - RESILIENCE_HISTORY_KEY_PREFIX:      v5  → v6
 - scripts/seed-resilience-scores.mjs local mirrors
 - api/health.js resilienceRanking literal
 - 4 analysis/backtest scripts that read the cached keys directly
 - Test fixtures in resilience-{ranking, handlers, scores-seed,
   pillar-aggregation}.test.* that assert on literal key values

The v5→v6 history bump is the critical one: without it, pre-rebalance
history points would mix with post-rebalance points inside the 30-day
window, and change30d / trend math would diff values from different
formulas against each other, producing false-negative "falling" trends
for every country across the deploy window.

P2 — coverage-vs-weight asymmetry in computeLowConfidence / computeOverallCoverage
----------------------------------------------------------------------------------
Reviewer flagged that these two functions still average coverage
equally across all non-retired dims, even after the scoring aggregation
started applying RESILIENCE_DIMENSION_WEIGHTS. The asymmetry is
INTENTIONAL — these signals answer a different question from scoring:

  scoring aggregation: "how much does each dim matter to the score?"
  coverage signal:     "how much real data do we have on this country?"

A dim at weight 0.5 still has the same data-availability footprint as
a weight=1.0 dim: its coverage value reflects whether we successfully
fetched the upstream source, not whether the scorer cares about it.
Applying scoring weights to the coverage signal would let a
half-weight dim hide half its sparsity from the overallCoverage pill,
misleading users reading coverage as a data-quality indicator.

Added explicit comments to both functions noting the asymmetry is
deliberate and pointing at the other site for matching rationale.
No code change — just documentation.

Tests: 6588/6588 data-tier tests pass (+511 resilience-specific
including the prefix-literal assertions). Typecheck clean on both
tsconfig configs. Biome clean on touched files.

* docs(resilience): bump methodology doc cache-prefix references to v11/v6

Greptile P2 on PR #3328: Redis keys table in the reproducibility
appendix still published `score:v10` / `ranking:v10` / `history:v5`,
and the rollback instructions told operators to flush those keys.
After the recovery-domain weight rebalance, live cache runs at
`score:v11` / `ranking:v11` / `history:v6`.

- Updated the Redis keys table (line 490-492) to match `_shared.ts`.
- Updated the rollback block to name the current keys.
- Left the historical "Activation sequence" narrative intact (it
  accurately describes the pillar-combine PR's v9→v10 / v4→v5 bump)
  but added a parenthetical pointing at the current v11/v6 values.

No code change — doc-only correction for operator accuracy.

* fix(docs): escape MDX-unsafe `<137` pattern to unblock Mintlify deploy

Line 643 had `(<137 countries)` — MDX parses `<137` as a JSX tag
starting with digit `1`, which is illegal and breaks the deploy with
"Unexpected character \`1\` (U+0031) before name". Surfaced after the
prior cache-prefix commit forced Mintlify to re-parse this file.

Replaced with "fewer than 137 countries" for unambiguous rendering.
Other `<` occurrences in this doc (lines 34, 642) are followed by
whitespace and don't trip MDX's tag parser.
2026-04-23 10:25:18 +04:00

223 lines
8.4 KiB
JavaScript

#!/usr/bin/env node
import { loadEnvFile, getRedisCredentials } from './_seed-utils.mjs';
// Source of truth: server/worldmonitor/resilience/v1/_shared.ts → RESILIENCE_SCORE_CACHE_PREFIX
const RESILIENCE_SCORE_CACHE_PREFIX = 'resilience:score:v11:';
// Mirror of server/worldmonitor/resilience/v1/_shared.ts#currentCacheFormula.
// Must stay in lockstep with the server-side definition so this script
// skips cross-formula cache entries for the same reasons the server
// does — correlations benchmarked against a mixed-formula cohort of
// d6 + pc entries would be meaningless.
function currentCacheFormulaLocal() {
const combine = (process.env.RESILIENCE_PILLAR_COMBINE_ENABLED ?? 'false').toLowerCase() === 'true';
const v2 = (process.env.RESILIENCE_SCHEMA_V2_ENABLED ?? 'true').toLowerCase() === 'true';
return combine && v2 ? 'pc' : 'd6';
}
const REFERENCE_INDICES = {
ndgain: {
NO: 0.76, IS: 0.72, NZ: 0.71, DK: 0.74, SE: 0.73, FI: 0.72, CH: 0.73, AU: 0.70,
CA: 0.70, US: 0.67, DE: 0.68, GB: 0.67, FR: 0.65, JP: 0.66, KR: 0.63, IT: 0.59,
ES: 0.60, PL: 0.57, BR: 0.45, MX: 0.44, TR: 0.47, TH: 0.44, MY: 0.50, CN: 0.47,
IN: 0.37, ZA: 0.41, EG: 0.38, PK: 0.30, NG: 0.26, KE: 0.33, BD: 0.31, VN: 0.40,
PH: 0.38, ID: 0.42, UA: 0.43, RU: 0.44, AF: 0.20, YE: 0.17, SO: 0.15, HT: 0.22,
SS: 0.14, CF: 0.18, SD: 0.19, ML: 0.25, NE: 0.22, TD: 0.20, SY: 0.21, IQ: 0.30,
MM: 0.28, VE: 0.30, IR: 0.35, ET: 0.26,
},
inform: {
NO: 1.8, IS: 1.5, NZ: 2.1, DK: 1.7, SE: 1.9, FI: 1.6, CH: 1.4, AU: 2.3,
CA: 2.0, US: 3.1, DE: 2.2, GB: 2.4, FR: 2.5, JP: 3.0, KR: 2.3, IT: 2.6,
ES: 2.4, PL: 2.1, BR: 4.1, MX: 4.5, TR: 4.0, TH: 3.5, MY: 3.0, CN: 4.2,
IN: 5.5, ZA: 4.3, EG: 4.8, PK: 6.2, NG: 6.5, KE: 5.0, BD: 5.8, VN: 3.8,
PH: 5.2, ID: 4.8, UA: 5.5, RU: 4.5, AF: 8.0, YE: 8.5, SO: 8.8, HT: 7.2,
SS: 8.3, CF: 8.1, SD: 8.4, ML: 6.8, NE: 7.0, TD: 7.5, SY: 7.8, IQ: 6.8,
MM: 7.0, VE: 5.8, IR: 5.0, ET: 7.2,
},
};
const SAMPLE_COUNTRIES = Object.keys(REFERENCE_INDICES.ndgain);
async function redisPipeline(url, token, commands) {
const resp = await fetch(`${url}/pipeline`, {
method: 'POST',
headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' },
body: JSON.stringify(commands),
signal: AbortSignal.timeout(30_000),
});
if (!resp.ok) throw new Error(`Redis pipeline HTTP ${resp.status}`);
return resp.json();
}
function toRanks(values) {
const indexed = values.map((v, i) => ({ v, i }));
indexed.sort((a, b) => a.v - b.v);
const ranks = new Array(values.length);
let pos = 0;
while (pos < indexed.length) {
let end = pos + 1;
while (end < indexed.length && indexed[end].v === indexed[pos].v) end++;
const avgRank = (pos + end + 1) / 2;
for (let k = pos; k < end; k++) ranks[indexed[k].i] = avgRank;
pos = end;
}
return ranks;
}
function pearson(x, y) {
const n = x.length;
if (n < 3) return NaN;
let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0;
for (let i = 0; i < n; i++) {
sumX += x[i];
sumY += y[i];
sumXY += x[i] * y[i];
sumX2 += x[i] * x[i];
sumY2 += y[i] * y[i];
}
const numerator = n * sumXY - sumX * sumY;
const denominator = Math.sqrt((n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY));
if (denominator === 0) return 0;
return numerator / denominator;
}
function spearmanRho(x, y) {
return pearson(toRanks(x), toRanks(y));
}
async function fetchWorldMonitorScores(url, token, countryCodes) {
const commands = countryCodes.map((c) => ['GET', `${RESILIENCE_SCORE_CACHE_PREFIX}${c}`]);
const results = await redisPipeline(url, token, commands);
const current = currentCacheFormulaLocal();
const skipped = { staleFormula: 0, noOverallScore: 0, malformed: 0 };
const scores = new Map();
for (let i = 0; i < countryCodes.length; i++) {
const raw = results[i]?.result;
if (typeof raw !== 'string') continue;
try {
const parsed = JSON.parse(raw);
// Cross-formula gate: the benchmark/validation scripts run off
// live cache entries. A mixed-formula cohort (some countries
// scored under d6, others under pc because their cache entries
// landed on either side of a flag flip) would produce a
// meaningless Spearman. Skip stale-formula entries so the
// correlation runs only against same-formula peers.
if (parsed?._formula !== current) {
skipped.staleFormula++;
continue;
}
if (typeof parsed?.overallScore === 'number' && parsed.overallScore > 0) {
scores.set(countryCodes[i], parsed.overallScore);
} else {
skipped.noOverallScore++;
}
} catch {
skipped.malformed++;
}
}
if (skipped.staleFormula > 0) {
console.warn(`[validate-resilience-correlation] skipped ${skipped.staleFormula} stale-formula entries (current=${current})`);
}
return scores;
}
function computeCorrelation(wmScores, referenceScores, invert = false) {
const paired = [];
for (const [iso2, wmScore] of wmScores.entries()) {
const refScore = referenceScores[iso2];
if (refScore == null) continue;
paired.push({
iso2,
wm: wmScore,
ref: invert ? -refScore : refScore,
});
}
if (paired.length < 10) {
return { rho: NaN, n: paired.length, divergences: [] };
}
const wmValues = paired.map((p) => p.wm);
const refValues = paired.map((p) => p.ref);
const rho = spearmanRho(wmValues, refValues);
const wmRanks = toRanks(wmValues);
const refRanks = toRanks(refValues);
const divergences = paired.map((p, i) => ({
iso2: p.iso2,
wmRank: Math.round(wmRanks[i]),
refRank: Math.round(refRanks[i]),
delta: Math.abs(Math.round(wmRanks[i]) - Math.round(refRanks[i])),
}));
divergences.sort((a, b) => b.delta - a.delta);
return { rho, n: paired.length, divergences };
}
function padRight(str, len) {
return str + ' '.repeat(Math.max(0, len - str.length));
}
async function run() {
loadEnvFile(import.meta.url);
const { url, token } = getRedisCredentials();
console.log(`Fetching WorldMonitor resilience scores for ${SAMPLE_COUNTRIES.length} countries...`);
const wmScores = await fetchWorldMonitorScores(url, token, SAMPLE_COUNTRIES);
console.log(`Retrieved scores for ${wmScores.size}/${SAMPLE_COUNTRIES.length} countries\n`);
if (wmScores.size < 20) {
console.error('Too few scores available. Ensure resilience scores are cached in Redis.');
process.exit(1);
}
const ndgainResult = computeCorrelation(wmScores, REFERENCE_INDICES.ndgain, false);
const informResult = computeCorrelation(wmScores, REFERENCE_INDICES.inform, true);
console.log('=== EXTERNAL INDEX CORRELATION ===\n');
const ndgainPass = ndgainResult.rho > 0.65;
const informPass = informResult.rho > 0.60;
console.log(`WorldMonitor vs ND-GAIN Readiness: rho = ${ndgainResult.rho.toFixed(3)} (n=${ndgainResult.n}, target > 0.65) ${ndgainPass ? 'PASS' : 'FAIL'}`);
console.log(`WorldMonitor vs INFORM Risk: rho = ${informResult.rho.toFixed(3)} (n=${informResult.n}, target > 0.60, inverted) ${informPass ? 'PASS' : 'FAIL'}`);
const passingCount = [ndgainPass, informPass].filter(Boolean).length;
const gatePass = passingCount >= 2;
console.log(`\nGATE CHECK: rho > 0.6 for at least 2 indices? ${gatePass ? 'YES' : 'NO'} (${passingCount}/2 passing)\n`);
for (const [label, result] of [['ND-GAIN', ndgainResult], ['INFORM', informResult]]) {
console.log(`Top divergences vs ${label} (countries that rank very differently):`);
const top5 = result.divergences.slice(0, 5);
for (const d of top5) {
console.log(` ${padRight(d.iso2 + ':', 5)} WM rank ${padRight(String(d.wmRank), 3)}, ${label} rank ${padRight(String(d.refRank), 3)} (delta ${d.delta})`);
}
console.log('');
}
const allCountriesSorted = [...wmScores.entries()]
.sort((a, b) => b[1] - a[1])
.map(([iso2, score], i) => ({ iso2, score, rank: i + 1 }));
console.log('WorldMonitor score ranking (sample):');
console.log(' Rank ISO2 Score');
for (const entry of allCountriesSorted) {
console.log(` ${padRight(String(entry.rank), 6)}${padRight(entry.iso2, 6)}${entry.score.toFixed(1)}`);
}
return { ndgainRho: ndgainResult.rho, informRho: informResult.rho, gatePass };
}
const isMain = process.argv[1]?.endsWith('validate-resilience-correlation.mjs');
if (isMain) {
run().catch((err) => {
console.error(`FATAL: ${err instanceof Error ? err.message : String(err)}`);
process.exit(1);
});
}
export { run, spearmanRho, toRanks, pearson, computeCorrelation, REFERENCE_INDICES, SAMPLE_COUNTRIES };