mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(resilience): PR 0 diagnostic freeze + fairness-audit harness
Lands the before-state and measurement apparatus every subsequent
resilience-scorer PR validates against. Zero scoring changes. Per the
v3 plan at docs/plans/2026-04-22-001-fix-resilience-scorer-structural-
bias-plan.md this is tranche 0 of five.
What lands:
- Construct contract published in the methodology doc: absolute
resilience not development-adjusted, mechanism test for every
indicator, peer-relative views published separately from the core.
- Known construct limitations section: six construct errors scheduled
for PR 1-3 repair with explicit mapping to plan tranches.
- Indicator-source manifest at docs/methodology/indicator-sources.yaml
with source, seriesId, seriesUrl, coveragePct, lastObservedYear,
license, mechanismTestRationale, and a constructStatus classification.
- Pre-repair ranking snapshot at
docs/snapshots/resilience-ranking-live-pre-repair-2026-04-22.json
(217 items + 5 greyedOut, captured 2026-04-22 08:38 UTC at commit
425507d15).
- Cohort configuration at tests/helpers/resilience-cohorts.mts: six
cohorts covering 87 countries (net-fuel-exporters, net-energy-
importers-oecd, nuclear-heavy-generation, coal-heavy-domestic,
small-island-importers, fragile-states).
- Matched-pair sanity panel at tests/helpers/resilience-matched-pairs.mts:
six pairs (FR/DE, NO/CA, UAE/BH, JP/KR, IN/ZA, SG/CH) with expected-
direction rationale and minGap for acceptance gate 7.
- scripts/compare-resilience-current-vs-proposed.mjs extended to emit
cohortSummary and matchedPairSummary alongside the existing output
shape (backward compatible).
- tests/resilience-cohort-config.test.mts: 11 validations ensuring the
cohort + matched-pair configs stay well-formed.
Deferred to PR 0.5 (before PR 1 lands):
- Monotonicity test harness for all 19 dimension scorers pinning the
sign of every indicator.
- Pearson-derivative variable-influence baseline inside the sensitivity
script producing the nominal-weight-vs-effective-influence table that
plan acceptance gate 8 requires.
Verification: typecheck:all clean, 430/430 resilience tests pass,
11/11 new cohort-config tests pass, snapshot auto-discovered and
validated by the existing snapshot-test harness.
* feat(resilience): PR 0 follow-ups — monotonicity harness, variable-influence baseline, cross-consumer formula gate
Completes the PR 0 scope per the v3 plan §5 deliverables. Three adds:
1. Monotonicity test harness
tests/resilience-dimension-monotonicity.test.mts pins the direction
of movement for 14 indicators across 7 dimensions (reserve adequacy,
fiscal space 3x, external debt coverage, import concentration,
governance WGI, food/water 2x, energy 5x). Each test builds two
synthetic ResilienceSeedReader fixtures differing only in the target
indicator and asserts the dimension score moves in the documented
direction. The scoreEnergy tests explicitly flag three indicators
(gasShare, coalShare, electricityConsumption) that PR 1 §3.1-3.2
overturns so future readers understand which directional claims the
plan intentionally replaces.
2. Variable-influence baseline
scripts/compare-resilience-current-vs-proposed.mjs now computes
per-dimension Pearson correlation against the current overallScore
scaled by the dimension's nominal domain weight (a Pearson-derivative
approximation of Sobol indices). The output carries a
variableInfluence[] array sorted by abs(effectiveInfluence) desc.
Acceptance gate 8 from the plan compares post-change effective
influence against assigned nominal weight; divergences flag a
wealth-proxy or saturated-signal construct problem.
3. Cross-consumer formula gate
Five external consumers of resilience:score:v10:* now filter stale-
formula entries so a flag flip does not serve mixed-formula data
downstream:
- server/worldmonitor/supply-chain/v1/get-route-impact.ts —
readResilienceScore() checks _formula via the new
getCurrentCacheFormula export and returns 0 on mismatch.
- scripts/validate-resilience-correlation.mjs,
scripts/validate-resilience-backtest.mjs,
scripts/backtest-resilience-outcomes.mjs,
scripts/benchmark-resilience-external.mjs — each inlines a
currentCacheFormulaLocal() helper that mirrors the server's
formula derivation from env, skips parsed entries whose
_formula disagrees, and logs the skip count so operators can
notice a mismatch during the flip window.
A mixed-formula cohort (some countries d6-tagged, others pc-tagged)
would confound every correlation, AUC, and Spearman this repair plan
depends on for its acceptance gates. These guards close that gap.
Verification: typecheck:all clean, 444/444 resilience tests pass
(+14 from the new monotonicity harness).
* fix(resilience): PR 0 review follow-ups — sample-union + doc tense
Two review-driven fixes on top of PR 0.
1. scripts/compare-resilience-current-vs-proposed.mjs — the cohort and
matched-pair summaries were computed against the historical
52-country sensitivity seed, which silently excluded the
small-island-importers cohort (zero members in the seed) and the
sg-vs-ch matched pair (Singapore not in the seed). With the current
script those acceptance gates are partially measured at best.
SAMPLE now = union(historical 52 seed, every cohort member, every
matched-pair endpoint). The imports for RESILIENCE_COHORTS and
MATCHED_PAIRS moved from inside main() to module scope so the union
can be computed before the script runs.
Net sample size grows from 52 to ~95 countries. Still fast enough
for an interactive pass; makes the acceptance gates honest.
2. docs/methodology/country-resilience-index.mdx — the construct
contract wording read as present-tense compliance ("Every indicator
in the scorer passes a single mechanism test"), which contradicted
the immediately-following passage about indicators that currently
fail the test. Reworded to "is being evaluated against" and added
an explicit PR-0-does-not-change-scoring paragraph that names the
known-failing indicators (electricityConsumption, gas/coal flat
penalties, WHO per-capita health spend) and points at the repair
plan for the replacement schedule.
Verification: typecheck:all clean, 444/444 resilience tests pass.
* fix(resilience): compare-script loads frozen baseline + emits per-indicator influence
Addresses two P1 review findings on PR #3284:
1. Script previously compared current-6d vs proposed-pillar-combined
from the SAME checkout; never loaded the frozen pre-PR-0 baseline,
so acceptance gates 2/6/7 ("no country moved >15pts vs baseline",
cohort median shift vs baseline, matched-pair gap change vs
baseline) could not be enforced for later scorer PRs.
Now auto-discovers the most recent
resilience-ranking-live-pre-repair-<date>.json (or post-<pr>-<date>)
in docs/snapshots/ and emits a baselineComparison block with:
spearmanVsBaseline, maxCountryAbsDelta, biggestDriftsVsBaseline,
cohortShiftVsBaseline, matchedPairGapChange. If no baseline is
found, the block is emitted with status 'unavailable' so callers
distinguish missing-baseline from passed-baseline.
2. variableInfluence was emitted only at the dimension level, which
hid the exact sub-indicators the repair plan targets
(electricityConsumption, gasShare, coalShare, etc.) inside their
parent dimension. Added extractIndicatorValues() which pulls twelve
construct-risk indicators per country from the shared memoized
reader, then computes per-indicator Pearson correlation against
the current overall score. Emitted as perIndicatorInfluence[],
sorted by absolute effective influence.
Acceptance gate 8 ("effective influence agrees in sign and rank-order
with assigned nominal weights") is now computable at the indicator
level, not only at the dimension level.
No production code touched; diagnostic-harness only.
* fix(resilience): baseline-snapshot selection by structured parse, not filename sort
Addresses P1 review on compare-resilience-current-vs-proposed.mjs:118-130.
Plain filename sort breaks the "immediate-prior state" contract two ways:
1. Lexical ordering: `pre-repair` sorts after `post-*`
(`pr...` to 'r' > 'o'), so the PR-0 freeze would keep winning even
after post-PR snapshots exist. Later scorer PRs would then report
acceptance-gate deltas against the original pre-repair freeze
instead of the immediately-prior post-PR-(N-1) snapshot — the gate
would appear valid while measuring against the wrong baseline.
2. Lexical ordering: `pr10` < `pr9` (digit-by-digit), so PR-10 would
lose the selection to PR-9.
Fix: parseBaselineSnapshotMeta() extracts (kind, prNumber, date) from
the filename. Sort keys are (kindRank desc, prNumber desc, date desc):
- post always beats pre-repair (kindRank 1 vs 0)
- among posts, prNumber compared numerically (10 beats 9)
- date breaks ties (same-PR re-snapshots, later capture wins)
- unlabeled post tags get prNumber 0 so they sort between
pre-repair and any numbered PR snapshot
Surfaced in output: baselineKind / baselinePrNumber / baselineDate
alongside baselineFile so the operator can verify which snapshot was
selected without having to reopen the file.
Module now isMain-guarded per feedback_seed_isMain_guard memory so
tests can import parseBaselineSnapshotMeta without firing the
scoring run.
Added tests/resilience-baseline-snapshot-ordering.test.mjs (9 tests)
pinning the ordering contract for every known failure mode.
Diagnostic-harness change only. No production code touched.
* fix(resilience): full scorable universe + registry-driven per-indicator influence
Addresses two fresh P1 review findings on the PR 0 compare harness.
Finding 1 — acceptance math ran on a curated ~95-country sample,
so plan gate 2 could miss large regressions in excluded countries.
- Main scoring loop now iterates the FULL scorable universe
(listScorableCountries()), not the 52-country seed + cohort union.
- Removed SAMPLE / HISTORICAL_SENSITIVITY_SEED constants.
- Added scorableUniverseSize + cohortMissingFromScorable to output
so operators see universe size and any cohort/pair endpoint that
listScorable refuses to score (fail-loud, not silent drop).
Finding 3 — per-indicator influence was a hand-picked 12-indicator
subset, hiding most registry indicators from the baseline that
later scorer PRs need.
- Extraction is now driven by INDICATOR_REGISTRY. Every Core +
Enrichment indicator gets a row with explicit extractionStatus:
implemented | not-implemented (with reason) | unregistered-in-harness
- EXTRACTION_RULES covers 40/59 indicators across 11 shape families
(static-path, static-wb-infrastructure, static-wgi, static-wgi-mean,
static-who, energy-mix-field, gas-storage-field, recovery-country-
field, imf-macro/labor-country-field, national-debt, sanctions-count).
- Remaining 19 indicators need either a scorer trace hook (PR 0.5)
or a safe aggregation duplicate; each carries a reason string.
- extractionCoverage summary (totalIndicators / implemented /
notImplemented / unregisteredInHarness / coreImplemented / coreTotal)
exposed in output so PR 0.5 progress is measurable.
Added tests/resilience-indicator-extraction-plan.test.mjs (11 tests)
pinning: every registry entry has an extraction row; not-implemented
rows carry a reason; all 12 plan-named construct-risk indicators stay
extractable; Core-tier coverage floor of 45%; shape-family unit tests.
Diagnostic-harness change only. No production code touched.
* fix(resilience): wire event-aggregate per-indicator influence via exported scorer helpers
Addresses P1 review on PR 0 compare harness. Previous commit marked 16
Core-tier indicators as 'not-implemented' because they needed scorer
event-window/severity-weighting math; that left the gate-9 acceptance
apparatus incomplete for a large part of the shipped score.
Fix: export the scorer-internal aggregation helpers so the harness
calls them directly. Zero aggregation math duplicated in the harness,
harness and scorer cannot drift.
Exported from _dimension-scorers.ts (purely additive):
summarizeCyber, summarizeOutages, summarizeGps,
summarizeUcdp, summarizeUnrest, summarizeSocialVelocity,
getCountryDisplacement, getThreatSummaryScore,
countTradeRestrictions, countTradeBarriers.
13 extraction rules moved from not-implemented to implemented:
cyberThreats, internetOutages, infraOutages, gpsJamming,
ucdpConflict, unrestEvents, socialVelocity, newsThreatScore,
displacementTotal, displacementHosted, tradeRestrictions,
tradeBarriers, recoveryConflictPressure, recoveryDisplacementVelocity.
Coverage:
52/59 total (88%), 46/50 Core-tier (92%).
Four Core indicators remain not-implemented for STRUCTURAL reasons,
NOT missing code. Scorer inputs are genuinely global scalars with
zero per-country variance, so Pearson(indicator, overall) is 0 or
NaN by construction:
shippingStress, transitDisruption, energyPriceStress — scorer
reads a global scalar applied to every country; a per-country
effective signal would need re-expression as (global x per-country
exposure), which is a derived signal in a different entry.
aquastatWaterAvailability — needs a distinct sub-indicator path
resolver; enrichment follow-up.
New test asserts the three no-per-country-variance indicators STAY
not-implemented with a matching reason, so any future extraction
that appears to cover them without fixing the underlying construct
fails.
Dispatcher split into STATIC / SIMPLE / AGGREGATE extractor tables
to stay under biome complexity limit. Core-tier floor test raised
from 45% to 80%.
89 resilience tests pass, typecheck clean, biome clean. No production
behaviour changes.
* fix(resilience): tag-gated AQUASTAT extractor closes the last fixable Core gap
Reviewer flagged aquastatWaterAvailability as the only remaining Core
indicator where the not-implemented status was structurally fixable
rather than conceptually impossible.
Both aquastatWaterStress and aquastatWaterAvailability share a single
.aquastat.value field; the scorer's scoreAquastatValue splits them
by the sibling .aquastat.indicator tag keyword (stress/withdrawal/
dependency to stress family; availability/renewable/access to
availability family). The harness now mirrors this branching:
- classifyAquastatFamily implements the scorer's priority order
(stress-family match wins even if the tag also contains an
availability keyword, matching the sequential if-check at
_dimension-scorers.ts L770-776).
- static-aquastat-stress / static-aquastat-availability extractors
return the value only when the family matches, so stress-family
readings never corrupt the availability Pearson and vice versa.
Core-tier coverage: 46/50 to 47/50 (94%). The 3 remaining Core
not-implemented indicators (shippingStress, transitDisruption,
energyPriceStress) are all structural impossibilities: scorer inputs
are global scalars with zero per-country variance.
New contract test pins both directions of the tag gate plus the
priority-order edge case (a tag containing both families' keywords
routes to stress).
90 resilience tests pass, typecheck clean, biome clean.