Files
worldmonitor/tests/seed-sovereign-wealth.test.mjs
Elie Habib 8032dc3a04 feat(resilience): PR 2 pre-scorer — SWF manifest + seeder (8/8 funds) (#3305)
* feat(resilience): PR 2 scaffolding — SWF classification manifest + seeder skeleton

Plan §3.4. First of multiple commits for PR 2 (fiscal-buffer split
and sovereign-wealth integration). This commit is SCAFFOLDING ONLY:
no dimension wiring, no scorer, no cache-keys entry yet. The goal is
to land the reviewer-facing metadata and the seeder's three-tier
source shape so an external SWF practitioner can critique before we
wire the scorer.

What is in:

1. docs/methodology/swf-classification-manifest.yaml — authoritative
   per-fund classification for the `sovereignFiscalBuffer` dimension.
   First-pass estimates for the 8 funds named in plan §3.4 table:
   Norway GPFG, UAE ADIA + Mubadala, Saudi PIF, Kuwait KIA,
   Qatar QIA, Singapore GIC + Temasek. Each fund carries:
     - three-component classification (access, liquidity, transparency)
       each on [0, 1], with rationale text citing the mandate / fiscal
       rule / asset-mix / transparency-index evidence
     - source URLs for audit
   Fund-candidates deferred for external-reviewer decision are listed
   in a trailing comment block (CIC, NWF, SOFAZ, NSIA, Future Fund,
   NZ Super, ESSF, etc.).

   external_review_status: PENDING — flip to REVIEWED on sign-off.

2. scripts/shared/swf-manifest-loader.mjs — YAML parser + strict schema
   validator. Fails loudly on any deviation (out-of-range scores,
   non-ISO2 countries, missing rationale, duplicate fund IDs, wrong
   manifest version). Single source of truth for the seeder, future
   scorer, and methodology-doc linter.

3. scripts/seed-sovereign-wealth.mjs — seeder shell with the three-tier
   source priority from plan §3.4:
     1. Official fund disclosures (MoF, central-bank, annual reports)
     2. IFSWF member filings
     3. SWFI public fund-rankings page (license-free fallback, scraped)
   Tiers 1-3 are all stubbed (return null) in this commit — the
   seeder publishes a well-formed empty payload so the scorer IMPUTE
   fallback can be exercised end-to-end without live data.
   emptyDataIsFailure: false is set deliberately so pre-wiring cron
   runs do not poison seed-meta (see
   feedback_strict_floor_validate_fail_poisons_seed_meta.md).

   SWFI scrape target is documented in the file header with the
   exact URL and a 2.5s inter-request interval. The scraper itself
   lands in the next commit after the external reviewer signs off
   on the manifest.

4. tests/swf-classification-manifest.test.mjs — 14 tests exercising
   both the shipped YAML (plan §3.4 required-fund presence, [0,1]
   bounds, rationale length, source citations, multi-fund country
   handling) and the validator's schema enforcement (rejects out-
   of-range scores, non-ISO2 codes, missing rationale, empty sources,
   duplicates, wrong version, invalid review status).

Out of scope for this commit (follow-ups, in order):
 - Implement SWFI scrape + IFSWF parse + per-fund official endpoints
 - Add `liquidReserveAdequacy` and `sovereignFiscalBuffer` dimensions
   to RESILIENCE_DIMENSION_ORDER, registry, and scorers
 - Retire `reserveAdequacy` via RESILIENCE_RETIRED_DIMENSIONS
 - cache-keys.ts + api/bootstrap.js + api/health.js wiring (new
   seed key needs ON_DEMAND_KEYS gating per Railway-cron bake-in rule)
 - Recovery-domain weight rebalance + Spearman sensitivity rerun
 - Methodology doc: rewrite the reserveAdequacy section

Tests: 508/508 pass (resilience suite + new manifest tests).
Typecheck clean on both tsconfig.json and tsconfig.api.json.

No external-facing behavior change — all files are new + isolated.

* feat(resilience): PR 2 commit 2 — Wikipedia SWF scraper + SWFI pivot

Implements Tier 3 of the sovereignFiscalBuffer seeder. Tier 1 (official
disclosures) and Tier 2 (IFSWF filings) remain stubbed — they require
per-fund bespoke adapters and will land incrementally.

SWFI pivot
----------
The plan's original Tier 3 target was
https://www.swfinstitute.org/fund-rankings/sovereign-wealth-fund. Live
check on 2026-04-23: the page's <tbody> is empty and AUM is gated
behind a lead-capture form (name + company + job title). SWFI per-fund
/profile/<id> pages are similarly barren. The "public fund rankings"
is effectively no longer public; scraping the lead-gated surface would
require submitting fabricated contact info (TOS violation, legally
questionable), so Tier 3 pivots to Wikipedia.

Wikipedia is legally clean (CC-BY-SA 4.0, attribution required — see
WIKIPEDIA_SOURCE_ATTRIBUTION in the seeder) and structurally scrapable.
The SWFI Linaburg-Maduell Transparency Index mentioned in manifest
rationale text is a SEPARATE SWFI publication (public index scores),
not the fund-rankings paywall — those citations stay valid.

What is in
----------

1. scripts/seed-sovereign-wealth.mjs — Wikipedia scraper implementation:
   - parseWikipediaRankingsTable(html) — exported pure function so
     the parser is unit-testable without a live fetch. Extracts the
     wikitable, parses per-fund rows (Country, Abbrev, Fund name,
     Assets USD B, Inception, Origin).
   - Strip-HTML helper strips <sup> tags to SPACES (not empty) so
     `302.0<sup>41</sup>` stays `302.0 41` — otherwise the decimal
     value and its trailing footnote ref get welded into `302.041`,
     which the Assets regex mis-parses.
   - matchWikipediaRecord(fund, cache) — abbrev + fund-name lookup
     with country disambiguation: lookup maps are now
     Map<key, Record[]> (list) rather than Map<key, Record>, and the
     matcher filters the list by manifest country before returning.
     This is the exact fix for the PIF collision:
     "PIF" resolves to BOTH Saudi Arabia's Public Investment Fund
     (~USD 925B) and Palestine's Palestine Investment Fund (~USD 900M)
     on the live article. Without country-filtering, Map.set silently
     overwrites one with the other, so Saudi PIF would return
     Palestine's AUM — three orders of magnitude wrong.
   - When the country disambiguator cannot pick, returns null rather
     than a best-guess. Seeder logs the unmatched fund; the IMPUTE
     path handles it gracefully.

2. docs/methodology/swf-classification-manifest.yaml — added
   `wikipedia` hints block to each of the 8 funds (abbrev and/or
   fund_name, matching Wikipedia's canonical naming).

3. scripts/shared/swf-manifest-loader.mjs — optional `wikipedia` field
   in the schema: `abbrev` and `fund_name` both optional strings, but
   at least one must be present if the block is provided.

4. tests/seed-sovereign-wealth.test.mjs — 12 tests exercising:
   - fixture-based parser: abbrev/name indexing, HTML + footnote
     stripping, decimal AUM, malformed rows skipped, missing-table error
   - abbrev-collision handling: both candidates retained in the list
   - country-disambiguation matcher: Saudi PIF correctly picked from
     a Saudi-vs-Palestine collision fixture (the exact live bug)
   - ambiguous lookup with unknown country returns null, not wrong record

Live verification against the shipped Wikipedia article: 7/8 funds
matched with the correct country; Saudi PIF now correctly returns
USD 925B (not Palestine's USD 0.9B) because of the country-
disambiguation fix. Temasek is the one miss — Wikipedia does not
classify it as an SWF (practitioner debate; it lists under "state
holding companies" instead). Falls through to IMPUTE in the scorer
until Tier 1/2 adapters land with an official-disclosure source.

Tests: 522/522 pass (resilience + manifest + scraper).
Typecheck clean on both tsconfig.json and tsconfig.api.json.

Still stubbed for later commits:
 - Tier 1 per-fund official-disclosure adapters (incl. Temasek)
 - Tier 2 IFSWF secretariat parser
 - Dimension wiring (liquidReserveAdequacy, sovereignFiscalBuffer)
 - reserveAdequacy retirement via RESILIENCE_RETIRED_DIMENSIONS
 - cache-keys / bootstrap / health.js wiring (ON_DEMAND_KEYS until bake-in)
 - Recovery-domain weight rebalance + Spearman sensitivity rerun

* feat(resilience): PR 2 commit 3 — Wikipedia infobox fallback + FX → 8/8 match

Closes the Temasek gap. The Wikipedia list article excludes Temasek on
editorial grounds (classified as a "state holding company" rather than
an SWF), so the Tier-3 list-only path topped out at 7/8 funds matched.
This commit adds Tier 3b — per-fund Wikipedia article infobox scrape
— and a baked-in FX table to handle non-USD infobox currencies.

Live verification on the shipped Wikipedia articles: 8/8 funds matched.
Temasek: S$ 434B → US$ 321B via infobox + SGD→USD FX.

Implementation

1. scripts/seed-sovereign-wealth.mjs
   - FX_TO_USD table (USD, SGD, NOK, EUR, GBP, AED, SAR, KWD, QAR)
     with FX_RATES_REVIEWED_AT='2026-04-23' committed into the seed
     payload so stale rates are visible at audit time.
   - CURRENCY_SYMBOL_TO_ISO ordered list — US$ tested before S$ before
     bare $, and $ / kr require a space + digit neighbor to avoid
     false-matches in rich prose.
   - detectCurrency(text) exported pure for unit testing.
   - parseWikipediaArticleInfobox(html) exported pure — scans rows
     for "Total assets" / "Assets under management" / "AUM" / "Net
     assets" / "Net portfolio value" labels, extracts "NUMBER (trillion
     | billion | million) (YEAR)" values, applies FX conversion.
   - fetchWikipediaInfobox(fund) — per-fund article fetch, gated on
     the manifest's wikipedia.article_url hint.
   - sourceMix split into {official, ifswf, wikipedia_list,
     wikipedia_infobox} counters so the seed payload shows which tier
     delivered each fund.
   - Source priority chain: official → ifswf → wikipedia_list →
     wikipedia_infobox. Infobox last because it is N network round-
     trips; amortizing over the list article cache first minimizes
     live traffic.

2. docs/methodology/swf-classification-manifest.yaml
   - Temasek entry gains wikipedia.article_url:
     https://en.wikipedia.org/wiki/Temasek_Holdings with an inline
     comment explaining why the list-article path misses.

3. scripts/shared/swf-manifest-loader.mjs
   - article_url optional field; validator rejects anything that is
     not a https://<lang>.wikipedia.org/... URL so a typo cannot
     silently wire the seeder to an off-site fetch.

4. tests/seed-sovereign-wealth.test.mjs (10 new tests, 38/38 pass)
   - detectCurrency distinguishes US$ vs S$ vs bare $.
   - parseWikipediaArticleInfobox extracts Temasek S$ 434B → US$ 321B
     with year tag from "(2025)".
   - USD-native row pass-through with fxRate=1.0.
   - NOK trillion conversion (NOK 18.7T → USD 1.74T).
   - Returns null when no AUM row / no infobox at all.
   - Documents the unknown-currency → USD fallback contract.

Tests: 532/532 pass (full resilience + manifest + scraper suite).
Typecheck clean on both tsconfig.json and tsconfig.api.json.

Still stubbed for later commits:
 - Tier 1 per-fund official-disclosure adapters
 - Tier 2 IFSWF secretariat parser
 - Dimension wiring (liquidReserveAdequacy, sovereignFiscalBuffer)
 - reserveAdequacy retirement via RESILIENCE_RETIRED_DIMENSIONS
 - cache-keys / bootstrap / health.js wiring (ON_DEMAND_KEYS)
 - Recovery-domain weight rebalance + Spearman sensitivity rerun

* refactor(resilience): reuse project-shared FX infrastructure for SWF seeder

Self-caught duplication from the previous commit (699ba832a introduced
a local FX_TO_USD table and FX_RATES_REVIEWED_AT constant). The
codebase already has the canonical path:

  scripts/_seed-utils.mjs
    SHARED_FX_FALLBACKS      (USD/SGD/NOK/EUR/GBP/AED/SAR/QAR/KWD/...)
    getSharedFxRates()       (Redis shared:fx-rates:v1 4h cache + Yahoo)
    fetchYahooFxRates()

Used by seed-grocery-basket, seed-fuel-prices, seed-bigmac. Two FX
tables would drift and the live-rate layer (Yahoo via Redis cache)
would be orphaned on the SWF path.

What changed

- Deleted local FX_TO_USD / FX_RATES_REVIEWED_AT constants.
- parseWikipediaArticleInfobox() no longer performs FX conversion.
  Returns { valueNative, currencyNative, aumYear } so the seeder
  orchestrator applies project-shared rates at call time. Parser is
  now currency-agnostic and thinner.
- Added lookupUsdRate(currency, fxRates) helper:
  * USD → 1.0 short-circuit
  * prefer the live map (getSharedFxRates output) over static fallback
  * fall back to SHARED_FX_FALLBACKS
  * return null on unknown currency (caller skips the fund — no silent
    wrong-currency misreading).
- fetchWikipediaInfobox() accepts fxRates map, converts via
  lookupUsdRate, returns enriched { aum, currencyNative, fxRate }.
- fetchSovereignWealth() fetches fxRates once at the top via
  getSharedFxRates(buildFxSymbolsForSwf(), SHARED_FX_FALLBACKS), in
  parallel with World Bank imports + Wikipedia list. Warms the shared
  Redis FX cache for other seeders at the same time.
- Seed payload drops the fxRatesReviewedAt field; the shared cache
  carries that metadata at the Redis level for all seeders.

Tests updated

- parseWikipediaArticleInfobox tests assert the native value + ISO
  code, no longer the USD-converted amount.
- New `lookupUsdRate` suite pins the project-shared FX integration:
  USD short-circuit, live-rate preference, static fallback, unknown-
  currency null, and a Temasek S$ 434B → US$ 321B end-to-end case
  via the shared fallback table.

Live re-verification still 8/8; SGD comes through SHARED_FX_FALLBACKS
at 0.74 (same number as the deleted local table), so behavior is
identical but the dedupe is real.

Tests: 536/536 pass. Typecheck clean on both tsconfig configs.

* refactor(resilience): split SWF manifest validator into sub-helpers

Biome reported validateManifest at complexity 55 vs max 50. Extracted
the per-fund validation into validateFundEntry(raw, idx, seen) and
pulled out validateClassification, validateRationale, validateSources,
validateWikipediaHints as separate helpers. Behavior and tests are
unchanged; each helper is now well under the complexity cap and the
main validator reads linearly.

Tests: 42/42 manifest + scraper tests pass. Typecheck clean.

* fix(resilience): PR 2 review — partial-seed guard + manifest REVIEWED status

Addresses two P1 findings on PR #3305.

P1.1 — partial-seed silent corruption on multi-fund countries
-------------------------------------------------------------
For multi-fund countries (AE = ADIA + Mubadala, SG = GIC + Temasek)
the previous aggregation silently published a partial
totalEffectiveMonths if a secondary fund's scraper drifted on
Wikipedia — recordCount would still look green because we counted
"any fund matched" as a successful country-seed. Downstream scorer
would under-rank those countries with no missingness signal.

Fix:
- Each country entry now carries { expectedFunds, matchedFunds,
  completeness } alongside the existing totalEffectiveMonths. The
  scorer can use completeness < 1.0 to derate (treat as degraded
  coverage) rather than accept the partial number at face value.
- declareRecords counts ONLY countries with completeness === 1.0,
  so a secondary-fund drift drops the seed-meta record_count and
  triggers the operational alarm. recordCount in runSeed opts now
  delegates to declareRecords for parity.
- A warn log fires per partial country so the Railway cron log is
  loud on drift without poisoning seed-meta.
- 4 new tests pin: all-matched counts, partial drops, empty/malformed
  payloads, and defensive handling of pre-completeness payload shape.

P1.2 — manifest external-reviewer language contradicted shipped workflow
------------------------------------------------------------------------
The YAML header said "External sovereign-wealth-practitioner review
is REQUIRED before PR 2 merges" and external_review_status=PENDING.
WorldMonitor's operating mode is fully automated (see memory
`feedback_no_external_reviewer_assumption.md`) — there is no external
practitioner gate. Reviewer correctly flagged the inconsistency
between the document and the shipped behaviour.

Fix:
- Rewrote the header to describe the actual audit discipline:
  coefficients derive from the committed rationale + cited sources
  for each fund; revisions require the same discipline in a follow-
  up PR. No external-gate language.
- Flipped external_review_status to REVIEWED, with a clarifying
  comment: REVIEWED = coefficients derive from the committed
  rationale + seeder end-to-end matches the live surfaces. PENDING
  remains reserved for future PRs that ship unresolved TBD
  coefficients.
- Rewrote the "candidates deferred from v1" trailing block. Each
  fund listed now has a concrete rationale for deferral (sanctions /
  access coefficient would pin at 0 / classification contested /
  AUM disclosure unstable) so a future PR author can argue the case
  on record. No "reviewer advice needed" placeholders.
- Tweaked two inline fund comments (UAE ADQ/ICD, Singapore Temasek)
  that still said "external reviewer" — now describe the substantive
  reason for inclusion/deferral.

Tests
-----
- 540/540 resilience + manifest + scraper tests pass.
- Typecheck clean on both tsconfig configs.
- Biome clean on all touched files.

* fix(resilience): PR 2 review — 4 P2 fixes (WB imports, null validate, nested table, aumYear)

Greptile P2 findings on PR #3305, all addressed.

(1) Silent country drop on missing WB imports
---------------------------------------------
If fetchAnnualImportsUsd() has no entry for a manifest ISO-2
(transient WB outage, new country with spotty coverage), the country
was silently skipped. Downstream scorer would then read "absent from
payload" as "no SWF" and score 0 with full coverage — substantively
wrong. Now logs a warn and adds each affected fund to the unmatched
list with a `(no WB imports)` suffix so the seed-meta observer sees
the degradation.

(2) typeof null === 'object' bypassed validate()
------------------------------------------------
Bare `typeof data?.countries === 'object'` returned true for
{ countries: null } and { countries: [] }. Downstream property
access would then crash. Strict check added: non-null plain object
only; also rejects arrays. Test pins all 5 edge cases.

(3) Nested </table> / </td> truncated wikitable parse
-----------------------------------------------------
Lazy [\s\S]*? in the outer table regex AND the inner row/cell regexes
could silently drop every row after any cell that contained a nested
mini-table (Wikipedia footnote boxes, sort helpers). Two-step fix:
  - extractFirstWikitable: depth-aware walk counts <table>/</table>
    opens and closes, returns content at balanced depth
  - stripNestedTables: iteratively removes complete inner
    <table>…</table> blocks BEFORE row parsing, so the lazy row / cell
    regexes never see a nested </tr> or </td>
Test: 5-row fixture with a nested table inside row 1's cell — ADIA
(row 2) must still parse, GPFG (row 1 with nested) must still parse.

(4) aumYear reflected scrape year, not data year
------------------------------------------------
List-article entries were stamped with `new Date().getFullYear()`
even though the Wikipedia list publishes no per-row data-year
annotation (figures are typically prior-period). Consumers using
aumYear for freshness audit would see "2026" for 2024/2025 data.
Now set to null for list entries; infobox tier 3b retains year
extraction from the "(YYYY)" tag on the individual fund article.

P1 bootstrap deferral: intentional per project memory
-----------------------------------------------------
AGENTS.md says new data sources MUST wire api/bootstrap.js. Not done
in this PR by design:
  - No RPC consumer exists yet for
    `resilience:recovery:sovereign-wealth:v1` (scorer lands in a
    follow-up PR; wiring bootstrap without a consumer would be dead
    code).
  - Local memory `feedback_health_required_key_needs_railway_cron_
    first.md` requires new seed keys to sit in ON_DEMAND_KEYS for
    ~7 days of clean Railway cron before promoting to
    BOOTSTRAP_KEYS — adding bootstrap wiring now would pre-empt
    that window and risk CRIT alarms on the health surface.
The scorer PR that follows will land the bootstrap wiring + the
dimension at the same time, which is the cohesive unit.

Tests: 547/547 resilience + manifest + scraper tests pass.
Typecheck clean on both tsconfig configs. Biome clean on touched
files.
2026-04-23 07:58:40 +04:00

532 lines
21 KiB
JavaScript
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
import assert from 'node:assert/strict';
import { describe, it } from 'node:test';
import {
declareRecords,
detectCurrency,
lookupUsdRate,
matchWikipediaRecord,
parseWikipediaArticleInfobox,
parseWikipediaRankingsTable,
validate,
} from '../scripts/seed-sovereign-wealth.mjs';
import { SHARED_FX_FALLBACKS } from '../scripts/_seed-utils.mjs';
// Fixture HTML mirrors the structure observed on the shipping
// Wikipedia "List of sovereign wealth funds" article (captured
// 2026-04-23). Kept inline so the scraper's parsing rules are
// exercised without a live network round-trip. If Wikipedia later
// changes the column order or header text, update this fixture AND
// the assumed-columns comment in scripts/seed-sovereign-wealth.mjs
// in the same commit.
const FIXTURE_HTML = `
<html><body>
<table class="wikitable sortable static-row-numbers">
<thead>
<tr>
<th scope="col">Country or region</th>
<th scope="col">Abbrev.</th>
<th scope="col">Fund name</th>
<th scope="col">Assets</th>
<th scope="col">Inception</th>
<th scope="col">Origin</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="/wiki/Norway">Norway</a></td>
<td>GPF-G</td>
<td><a href="/wiki/GPFG">Government Pension Fund Global</a></td>
<td>2,117<sup>37</sup></td>
<td>1990</td>
<td>Oil & Gas</td>
</tr>
<tr>
<td><a href="/wiki/UAE">United Arab Emirates</a></td>
<td>ADIA</td>
<td><a href="/wiki/ADIA">Abu Dhabi Investment Authority</a></td>
<td>1,128<sup>40</sup></td>
<td>1976</td>
<td>Oil & Gas</td>
</tr>
<tr>
<td><a href="/wiki/UAE">United Arab Emirates</a></td>
<td></td>
<td><a href="/wiki/Mubadala">Mubadala Investment Company</a></td>
<td>302.0<sup>41</sup></td>
<td>2002</td>
<td>Oil & Gas</td>
</tr>
<tr>
<td><a href="/wiki/Singapore">Singapore</a></td>
<td>GIC</td>
<td><a href="/wiki/GIC">GIC Private Limited</a></td>
<td>801</td>
<td>1981</td>
<td>Non-commodity</td>
</tr>
<tr>
<td><a href="/wiki/Singapore">Singapore</a></td>
<td></td>
<td><a href="/wiki/Temasek">Temasek Holdings</a></td>
<td>382</td>
<td>1974</td>
<td>Non-commodity</td>
</tr>
<tr>
<td><a href="/wiki/NoData">No Data Row</a></td>
<td>NODATA</td>
<td>Example fund without assets</td>
<td></td>
<td>2000</td>
<td>Non-commodity</td>
</tr>
</tbody>
</table>
</body></html>
`;
describe('parseWikipediaRankingsTable — fixture-based scraping', () => {
const cache = parseWikipediaRankingsTable(FIXTURE_HTML);
it('indexes funds by normalized abbreviation into record lists', () => {
// GPF-G → GPFG (normalized: uppercase, strip punctuation). Lookup
// returns a list so ambiguous abbrevs (e.g. PIF → Saudi vs Palestine
// on the live article) can be disambiguated at match time.
const gpfgList = cache.byAbbrev.get('GPFG');
assert.ok(Array.isArray(gpfgList) && gpfgList.length === 1, 'GPFG should have exactly one candidate in the fixture');
const [gpfg] = gpfgList;
assert.equal(gpfg.aum, 2_117_000_000_000);
assert.equal(gpfg.fundName, 'Government Pension Fund Global');
assert.equal(gpfg.countryName, 'Norway');
assert.equal(gpfg.inceptionYear, 1990);
assert.equal(cache.byAbbrev.get('ADIA')?.[0]?.aum, 1_128_000_000_000);
assert.equal(cache.byAbbrev.get('GIC')?.[0]?.aum, 801_000_000_000);
});
it('indexes funds by normalized fund name for abbrev-less rows', () => {
// Mubadala and Temasek have no abbreviation in the fixture,
// so they must still be matchable by fundName.
const mubadalaList = cache.byFundName.get('mubadala investment company');
assert.ok(mubadalaList && mubadalaList.length === 1);
assert.equal(mubadalaList[0].aum, 302_000_000_000);
const temasekList = cache.byFundName.get('temasek holdings');
assert.ok(temasekList && temasekList.length === 1);
assert.equal(temasekList[0].aum, 382_000_000_000);
});
it('strips inline HTML + footnote references from the Assets cell', () => {
// `2,117<sup>37</sup>` — the footnote int must be stripped
// before parsing. `<sup>` strips to a space so the ref is a
// separate token, not welded into the number.
assert.equal(cache.byAbbrev.get('GPFG')[0].aum, 2_117_000_000_000);
});
it('skips rows with missing or malformed Assets value', () => {
assert.equal(cache.byAbbrev.get('NODATA'), undefined);
assert.equal(cache.byFundName.get('example fund without assets'), undefined);
});
it('handles decimal AUM values (e.g. "302.0")', () => {
const mubadalaList = cache.byFundName.get('mubadala investment company');
assert.equal(mubadalaList[0].aum, 302_000_000_000);
});
it('throws loudly when the expected wikitable is missing', () => {
assert.throws(() => parseWikipediaRankingsTable('<html><body>no tables here</body></html>'),
/wikitable not found/);
});
});
// Separate describe block for the abbrev-collision disambiguation
// case since it requires a fixture with multiple rows sharing an
// abbrev. This is the exact class of bug observed on the live
// Wikipedia article (PIF → Saudi PIF + Palestine Investment Fund).
describe('parseWikipediaRankingsTable — abbrev collisions', () => {
const COLLIDING_HTML = `
<table class="wikitable">
<thead><tr>
<th>Country</th><th>Abbrev.</th><th>Fund name</th>
<th>Assets</th><th>Inception</th><th>Origin</th>
</tr></thead>
<tbody>
<tr>
<td>Saudi Arabia</td><td>PIF</td><td>Public Investment Fund</td>
<td>925</td><td>1971</td><td>Oil Gas</td>
</tr>
<tr>
<td>Palestine</td><td>PIF</td><td>Palestine Investment Fund</td>
<td>0.9</td><td>2003</td><td>Non-commodity</td>
</tr>
</tbody>
</table>`;
it('keeps BOTH colliding records under the shared abbrev key', () => {
const cache = parseWikipediaRankingsTable(COLLIDING_HTML);
const pifList = cache.byAbbrev.get('PIF');
assert.ok(Array.isArray(pifList));
assert.equal(pifList.length, 2, 'both colliding PIF records must be retained — silent overwrite would shadow Saudi PIF with Palestine');
});
});
describe('matchWikipediaRecord — manifest-driven lookup', () => {
const cache = parseWikipediaRankingsTable(FIXTURE_HTML);
it('matches by abbrev when hints + country align', () => {
const fund = {
country: 'NO',
fund: 'gpfg',
wikipedia: { abbrev: 'GPF-G', fundName: 'Government Pension Fund Global' },
};
const hit = matchWikipediaRecord(fund, cache);
assert.ok(hit);
assert.equal(hit.fundName, 'Government Pension Fund Global');
});
it('falls back to fund-name match when no abbrev is provided', () => {
const fund = {
country: 'AE',
fund: 'mubadala',
wikipedia: { fundName: 'Mubadala Investment Company' },
};
const hit = matchWikipediaRecord(fund, cache);
assert.ok(hit);
assert.equal(hit.aum, 302_000_000_000);
});
it('normalizes abbrev punctuation (GPF-G ≡ GPFG)', () => {
const fund = { country: 'NO', fund: 'gpfg', wikipedia: { abbrev: 'GPFG' } };
const hit = matchWikipediaRecord(fund, cache);
assert.ok(hit, 'normalized-abbrev match should succeed');
});
it('returns null when no hints match', () => {
const fund = {
country: 'NO',
fund: 'unknown',
wikipedia: { abbrev: 'XXXX', fundName: 'Nonexistent Fund' },
};
assert.equal(matchWikipediaRecord(fund, cache), null);
});
it('returns null when manifest entry has no wikipedia hints', () => {
const fund = { country: 'NO', fund: 'no-hints' };
assert.equal(matchWikipediaRecord(fund, cache), null);
});
});
// ── Tier 3b: per-fund Wikipedia article infobox ──
//
// Activated for funds editorially excluded from the /wiki/List_of_
// sovereign_wealth_funds article (Temasek is the canonical case —
// Wikipedia classifies it as a "state holding company" rather than an
// SWF, despite the manifest including it per plan §3.4).
//
// The infobox parser must:
// - scan rows for "Total assets", "Assets under management", "AUM",
// "Net assets", "Net portfolio value" labels
// - detect non-USD currencies (S$, €, £, NOK, etc.) and convert via
// the FX_TO_USD table
// - extract the year tag "(2025)" from the value for freshness
// - skip rows whose currency isn't in the FX table (loud, not silent)
describe('detectCurrency — symbol and code detection', () => {
it('distinguishes US$ from S$ from $', () => {
assert.equal(detectCurrency('US$ 1,128 billion'), 'USD');
assert.equal(detectCurrency('S$ 434 billion'), 'SGD');
// Bare $ must NOT match US$ or S$ patterns, and must require a
// digit after.
assert.equal(detectCurrency('$ 500 billion'), 'USD');
});
it('detects Norwegian krone via NOK or kr', () => {
assert.equal(detectCurrency('NOK 18.7 trillion'), 'NOK');
assert.equal(detectCurrency('17,500 kr 500 billion'), 'NOK');
});
it('detects EUR via € symbol or ISO code', () => {
assert.equal(detectCurrency('€ 500 million'), 'EUR');
assert.equal(detectCurrency('500 EUR billion'), 'EUR');
});
it('returns null when no currency signal is present', () => {
assert.equal(detectCurrency('500 billion'), null);
assert.equal(detectCurrency(''), null);
});
});
describe('parseWikipediaArticleInfobox — native value + currency extraction', () => {
// Parser returns { valueNative, currencyNative, aumYear } and does
// NOT convert to USD — conversion is applied at the seeder level
// via the project-shared `getSharedFxRates` cache (see
// scripts/_seed-utils.mjs). Keeping the parser FX-free removes a
// duplicate copy of the FX table that would drift from the shared
// one.
//
// Mirrors the Temasek infobox structure (abridged). Real row:
// `<tr><th>Total assets</th><td>S$ 434 billion <i>(2025)</i><sup>2</sup></td></tr>`
const TEMASEK_INFOBOX = `
<html><body>
<table class="infobox vcard">
<tr><th>Type</th><td>Holding company</td></tr>
<tr><th>Founded</th><td>25 June 1974</td></tr>
<tr><th>Total assets</th><td>S$ 434 billion <i>(2025)</i><sup>2</sup></td></tr>
<tr><th>Owner</th><td>Ministry of Finance</td></tr>
</table>
</body></html>
`;
it('extracts S$ 434 billion as native SGD value + year tag', () => {
const hit = parseWikipediaArticleInfobox(TEMASEK_INFOBOX);
assert.ok(hit, 'Temasek infobox should produce a hit');
assert.equal(hit.currencyNative, 'SGD');
assert.equal(hit.valueNative, 434_000_000_000);
assert.equal(hit.aumYear, 2025);
});
it('handles USD-native infoboxes (currency detected as USD)', () => {
const html = `<table class="infobox">
<tr><th>AUM</th><td>US$ 1,500 billion (2025)</td></tr>
</table>`;
const hit = parseWikipediaArticleInfobox(html);
assert.ok(hit);
assert.equal(hit.currencyNative, 'USD');
assert.equal(hit.valueNative, 1_500_000_000_000);
});
it('parses trillion-unit values (NOK 18.7 trillion)', () => {
const html = `<table class="infobox">
<tr><th>Net assets</th><td>NOK 18.7 trillion (2025)</td></tr>
</table>`;
const hit = parseWikipediaArticleInfobox(html);
assert.ok(hit);
assert.equal(hit.currencyNative, 'NOK');
assert.equal(hit.valueNative, 18_700_000_000_000);
});
it('returns null when no AUM-labeled row is present', () => {
const html = `<table class="infobox">
<tr><th>Type</th><td>Holding company</td></tr>
</table>`;
assert.equal(parseWikipediaArticleInfobox(html), null);
});
it('returns null when the infobox itself is missing', () => {
assert.equal(parseWikipediaArticleInfobox('<html>no infobox</html>'), null);
});
});
describe('lookupUsdRate — project-shared FX integration', () => {
// Verifies the parser → FX conversion pipeline uses the project's
// canonical FX source (scripts/_seed-utils.mjs SHARED_FX_FALLBACKS +
// getSharedFxRates Redis cache) rather than a duplicate table.
it('returns 1.0 for USD regardless of rate map', () => {
assert.equal(lookupUsdRate('USD', {}), 1.0);
assert.equal(lookupUsdRate('USD', null), 1.0);
assert.equal(lookupUsdRate('USD', { USD: 999 }), 1.0);
});
it('prefers the live rate map over the static fallback', () => {
// Simulate getSharedFxRates returning a fresh Yahoo rate. The static
// fallback has SGD=0.74; the live rate could drift (e.g. 0.751).
assert.equal(lookupUsdRate('SGD', { SGD: 0.751 }), 0.751);
});
it('falls back to SHARED_FX_FALLBACKS when the live rate is missing', () => {
assert.equal(lookupUsdRate('SGD', {}), SHARED_FX_FALLBACKS.SGD);
assert.equal(lookupUsdRate('NOK', { EUR: 1.05 }), SHARED_FX_FALLBACKS.NOK);
});
it('returns null for unknown currencies (caller skips the fund)', () => {
assert.equal(lookupUsdRate('ZZZ', {}), null);
assert.equal(lookupUsdRate('XXX', { XXX: 0 }), null);
});
it('converts Temasek S$ 434B end-to-end via shared fallback table', () => {
const hit = parseWikipediaArticleInfobox(`
<table class="infobox"><tr><th>Total assets</th><td>S$ 434 billion (2025)</td></tr></table>
`);
const rate = lookupUsdRate(hit.currencyNative, {});
const aumUsd = hit.valueNative * rate;
// 434B × 0.74 = 321.16B. Matches SHARED_FX_FALLBACKS.SGD.
assert.ok(aumUsd > 300_000_000_000 && aumUsd < 340_000_000_000,
`expected ~US$ 320B, got ${aumUsd}`);
});
});
describe('validate — reject null-object masquerading as object', () => {
// `typeof null === 'object'` in JS, so a bare `typeof x === 'object'`
// would let { countries: null } through and break downstream. This
// test pins the strict non-null check.
it('rejects { countries: null }', () => {
assert.equal(validate({ countries: null }), false);
});
it('rejects missing countries field', () => {
assert.equal(validate({}), false);
assert.equal(validate(null), false);
assert.equal(validate(undefined), false);
});
it('rejects array countries (typeof [] === object too)', () => {
assert.equal(validate({ countries: [] }), false);
});
it('accepts empty object (during Railway-cron bake-in window)', () => {
assert.equal(validate({ countries: {} }), true);
});
it('accepts populated countries', () => {
assert.equal(validate({ countries: { NO: { funds: [] } } }), true);
});
});
describe('parseWikipediaRankingsTable — nested-table depth awareness', () => {
// Wikipedia occasionally embeds mini-tables (sort helpers, footnote
// boxes) inside a wikitable cell. A lazy `[\s\S]*?</table>` regex
// would stop at the FIRST `</table>` and silently drop every row
// after the cell containing the nested table. The depth-aware
// extractor must walk the full open/close pair.
it('does not truncate at a nested </table> inside a cell', () => {
const html = `
<table class="wikitable">
<tr><th>Country</th><th>Abbrev.</th><th>Fund</th><th>Assets</th><th>Inception</th></tr>
<tr>
<td>Norway</td><td>GPF-G</td>
<td>Government Pension Fund Global
<table class="mini-sort-helper"><tr><td>nested</td></tr></table>
</td>
<td>2000</td><td>1990</td>
</tr>
<tr>
<td>UAE</td><td>ADIA</td>
<td>Abu Dhabi Investment Authority</td>
<td>1128</td><td>1976</td>
</tr>
</table>
`;
const cache = parseWikipediaRankingsTable(html);
// Without depth awareness, ADIA would be silently dropped because
// the nested </table> inside GPF-G's cell would close the outer
// match at row 1.
assert.ok(cache.byAbbrev.get('ADIA')?.[0]?.aum === 1_128_000_000_000,
'ADIA must survive — nested </table> in a prior cell should not truncate the wikitable');
assert.ok(cache.byAbbrev.get('GPFG')?.[0]?.aum === 2_000_000_000_000);
});
});
describe('parseWikipediaRankingsTable — aumYear accuracy', () => {
it('sets aumYear=null for list-article rows (no per-row data-year annotation)', () => {
const html = `
<table class="wikitable">
<tr><th>Country</th><th>Abbrev.</th><th>Fund</th><th>Assets</th><th>Inception</th></tr>
<tr><td>Norway</td><td>GPF-G</td><td>Government Pension Fund Global</td><td>2117</td><td>1990</td></tr>
</table>
`;
const cache = parseWikipediaRankingsTable(html);
const gpfg = cache.byAbbrev.get('GPFG')?.[0];
assert.ok(gpfg);
assert.equal(gpfg.aumYear, null,
'aumYear must be null — the list article publishes no per-row data-year, and claiming the scrape year would mislead freshness auditors');
// Infobox path (Tier 3b) sets a real aumYear from "(YYYY)" tag —
// see the separate infobox test block for that contract.
});
});
describe('declareRecords — partial-seed guard for multi-fund countries', () => {
// Regression: for multi-fund countries (AE = ADIA + Mubadala,
// SG = GIC + Temasek) a single scraper drift would silently publish
// a partial totalEffectiveMonths if we counted "any fund matched"
// as a successful country-seed. declareRecords MUST only count
// countries with completeness === 1.0 so a secondary-fund drift
// drops the seed-health record count and triggers the operational
// alarm, rather than leaking an under-weighted total into the
// ranking.
it('counts only countries where all manifest funds matched', () => {
const data = {
countries: {
NO: { funds: [{}], expectedFunds: 1, matchedFunds: 1, completeness: 1.0 },
AE: { funds: [{}, {}], expectedFunds: 2, matchedFunds: 2, completeness: 1.0 },
SG: { funds: [{}], expectedFunds: 2, matchedFunds: 1, completeness: 0.5 }, // partial
},
};
assert.equal(declareRecords(data), 2,
'SG (partial, completeness=0.5) must NOT count — recordCount stays at 2, not 3');
});
it('returns 0 when every country is partial', () => {
const data = {
countries: {
AE: { expectedFunds: 2, matchedFunds: 1, completeness: 0.5 },
SG: { expectedFunds: 2, matchedFunds: 1, completeness: 0.5 },
},
};
assert.equal(declareRecords(data), 0,
'all-partial payload must drop recordCount to 0 — the seed-meta alarm surfaces a degraded run');
});
it('returns 0 on empty / malformed payload', () => {
assert.equal(declareRecords({}), 0);
assert.equal(declareRecords({ countries: {} }), 0);
assert.equal(declareRecords(null), 0);
assert.equal(declareRecords(undefined), 0);
});
it('ignores entries lacking the completeness field (defensive)', () => {
// Old payload shape (pre-completeness) must not spuriously count.
const data = { countries: { XX: { funds: [{}], totalEffectiveMonths: 1 } } };
assert.equal(declareRecords(data), 0);
});
});
describe('matchWikipediaRecord — country-disambiguation on abbrev collisions', () => {
// This replays the exact class of bug observed on the live Wikipedia
// article: "PIF" resolves to BOTH Saudi Arabia's Public Investment
// Fund (~$925B) and Palestine's Palestine Investment Fund (~$900M).
// Without country disambiguation, a naive Map.set overwrites one
// with the other — Saudi PIF would silently return Palestine's AUM
// (three orders of magnitude smaller), breaking the score for every
// Saudi resilience read.
const COLLIDING_HTML = `
<table class="wikitable">
<thead><tr>
<th>Country</th><th>Abbrev.</th><th>Fund name</th>
<th>Assets</th><th>Inception</th><th>Origin</th>
</tr></thead>
<tbody>
<tr>
<td>Saudi Arabia</td><td>PIF</td><td>Public Investment Fund</td>
<td>925</td><td>1971</td><td>Oil Gas</td>
</tr>
<tr>
<td>Palestine</td><td>PIF</td><td>Palestine Investment Fund</td>
<td>0.9</td><td>2003</td><td>Non-commodity</td>
</tr>
</tbody>
</table>`;
const cache = parseWikipediaRankingsTable(COLLIDING_HTML);
it('picks the Saudi record for fund.country=SA', () => {
const fund = { country: 'SA', fund: 'pif', wikipedia: { abbrev: 'PIF' } };
const hit = matchWikipediaRecord(fund, cache);
assert.ok(hit);
assert.equal(hit.countryName, 'Saudi Arabia');
assert.equal(hit.aum, 925_000_000_000);
});
it('returns null (not the wrong record) when country is unknown to the disambiguator', () => {
// Hypothetical fund from a country not in ISO2_TO_WIKIPEDIA_COUNTRY_NAME.
// Must NOT silently return Saudi's or Palestine's record.
const fund = { country: 'ZZ', fund: 'pif', wikipedia: { abbrev: 'PIF' } };
assert.equal(matchWikipediaRecord(fund, cache), null,
'ambiguous match with no country mapping must return null — silent wrong-country match is the exact bug this test guards against');
});
});