mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(resilience): PR 2 scaffolding — SWF classification manifest + seeder skeleton
Plan §3.4. First of multiple commits for PR 2 (fiscal-buffer split
and sovereign-wealth integration). This commit is SCAFFOLDING ONLY:
no dimension wiring, no scorer, no cache-keys entry yet. The goal is
to land the reviewer-facing metadata and the seeder's three-tier
source shape so an external SWF practitioner can critique before we
wire the scorer.
What is in:
1. docs/methodology/swf-classification-manifest.yaml — authoritative
per-fund classification for the `sovereignFiscalBuffer` dimension.
First-pass estimates for the 8 funds named in plan §3.4 table:
Norway GPFG, UAE ADIA + Mubadala, Saudi PIF, Kuwait KIA,
Qatar QIA, Singapore GIC + Temasek. Each fund carries:
- three-component classification (access, liquidity, transparency)
each on [0, 1], with rationale text citing the mandate / fiscal
rule / asset-mix / transparency-index evidence
- source URLs for audit
Fund-candidates deferred for external-reviewer decision are listed
in a trailing comment block (CIC, NWF, SOFAZ, NSIA, Future Fund,
NZ Super, ESSF, etc.).
external_review_status: PENDING — flip to REVIEWED on sign-off.
2. scripts/shared/swf-manifest-loader.mjs — YAML parser + strict schema
validator. Fails loudly on any deviation (out-of-range scores,
non-ISO2 countries, missing rationale, duplicate fund IDs, wrong
manifest version). Single source of truth for the seeder, future
scorer, and methodology-doc linter.
3. scripts/seed-sovereign-wealth.mjs — seeder shell with the three-tier
source priority from plan §3.4:
1. Official fund disclosures (MoF, central-bank, annual reports)
2. IFSWF member filings
3. SWFI public fund-rankings page (license-free fallback, scraped)
Tiers 1-3 are all stubbed (return null) in this commit — the
seeder publishes a well-formed empty payload so the scorer IMPUTE
fallback can be exercised end-to-end without live data.
emptyDataIsFailure: false is set deliberately so pre-wiring cron
runs do not poison seed-meta (see
feedback_strict_floor_validate_fail_poisons_seed_meta.md).
SWFI scrape target is documented in the file header with the
exact URL and a 2.5s inter-request interval. The scraper itself
lands in the next commit after the external reviewer signs off
on the manifest.
4. tests/swf-classification-manifest.test.mjs — 14 tests exercising
both the shipped YAML (plan §3.4 required-fund presence, [0,1]
bounds, rationale length, source citations, multi-fund country
handling) and the validator's schema enforcement (rejects out-
of-range scores, non-ISO2 codes, missing rationale, empty sources,
duplicates, wrong version, invalid review status).
Out of scope for this commit (follow-ups, in order):
- Implement SWFI scrape + IFSWF parse + per-fund official endpoints
- Add `liquidReserveAdequacy` and `sovereignFiscalBuffer` dimensions
to RESILIENCE_DIMENSION_ORDER, registry, and scorers
- Retire `reserveAdequacy` via RESILIENCE_RETIRED_DIMENSIONS
- cache-keys.ts + api/bootstrap.js + api/health.js wiring (new
seed key needs ON_DEMAND_KEYS gating per Railway-cron bake-in rule)
- Recovery-domain weight rebalance + Spearman sensitivity rerun
- Methodology doc: rewrite the reserveAdequacy section
Tests: 508/508 pass (resilience suite + new manifest tests).
Typecheck clean on both tsconfig.json and tsconfig.api.json.
No external-facing behavior change — all files are new + isolated.
* feat(resilience): PR 2 commit 2 — Wikipedia SWF scraper + SWFI pivot
Implements Tier 3 of the sovereignFiscalBuffer seeder. Tier 1 (official
disclosures) and Tier 2 (IFSWF filings) remain stubbed — they require
per-fund bespoke adapters and will land incrementally.
SWFI pivot
----------
The plan's original Tier 3 target was
https://www.swfinstitute.org/fund-rankings/sovereign-wealth-fund. Live
check on 2026-04-23: the page's <tbody> is empty and AUM is gated
behind a lead-capture form (name + company + job title). SWFI per-fund
/profile/<id> pages are similarly barren. The "public fund rankings"
is effectively no longer public; scraping the lead-gated surface would
require submitting fabricated contact info (TOS violation, legally
questionable), so Tier 3 pivots to Wikipedia.
Wikipedia is legally clean (CC-BY-SA 4.0, attribution required — see
WIKIPEDIA_SOURCE_ATTRIBUTION in the seeder) and structurally scrapable.
The SWFI Linaburg-Maduell Transparency Index mentioned in manifest
rationale text is a SEPARATE SWFI publication (public index scores),
not the fund-rankings paywall — those citations stay valid.
What is in
----------
1. scripts/seed-sovereign-wealth.mjs — Wikipedia scraper implementation:
- parseWikipediaRankingsTable(html) — exported pure function so
the parser is unit-testable without a live fetch. Extracts the
wikitable, parses per-fund rows (Country, Abbrev, Fund name,
Assets USD B, Inception, Origin).
- Strip-HTML helper strips <sup> tags to SPACES (not empty) so
`302.0<sup>41</sup>` stays `302.0 41` — otherwise the decimal
value and its trailing footnote ref get welded into `302.041`,
which the Assets regex mis-parses.
- matchWikipediaRecord(fund, cache) — abbrev + fund-name lookup
with country disambiguation: lookup maps are now
Map<key, Record[]> (list) rather than Map<key, Record>, and the
matcher filters the list by manifest country before returning.
This is the exact fix for the PIF collision:
"PIF" resolves to BOTH Saudi Arabia's Public Investment Fund
(~USD 925B) and Palestine's Palestine Investment Fund (~USD 900M)
on the live article. Without country-filtering, Map.set silently
overwrites one with the other, so Saudi PIF would return
Palestine's AUM — three orders of magnitude wrong.
- When the country disambiguator cannot pick, returns null rather
than a best-guess. Seeder logs the unmatched fund; the IMPUTE
path handles it gracefully.
2. docs/methodology/swf-classification-manifest.yaml — added
`wikipedia` hints block to each of the 8 funds (abbrev and/or
fund_name, matching Wikipedia's canonical naming).
3. scripts/shared/swf-manifest-loader.mjs — optional `wikipedia` field
in the schema: `abbrev` and `fund_name` both optional strings, but
at least one must be present if the block is provided.
4. tests/seed-sovereign-wealth.test.mjs — 12 tests exercising:
- fixture-based parser: abbrev/name indexing, HTML + footnote
stripping, decimal AUM, malformed rows skipped, missing-table error
- abbrev-collision handling: both candidates retained in the list
- country-disambiguation matcher: Saudi PIF correctly picked from
a Saudi-vs-Palestine collision fixture (the exact live bug)
- ambiguous lookup with unknown country returns null, not wrong record
Live verification against the shipped Wikipedia article: 7/8 funds
matched with the correct country; Saudi PIF now correctly returns
USD 925B (not Palestine's USD 0.9B) because of the country-
disambiguation fix. Temasek is the one miss — Wikipedia does not
classify it as an SWF (practitioner debate; it lists under "state
holding companies" instead). Falls through to IMPUTE in the scorer
until Tier 1/2 adapters land with an official-disclosure source.
Tests: 522/522 pass (resilience + manifest + scraper).
Typecheck clean on both tsconfig.json and tsconfig.api.json.
Still stubbed for later commits:
- Tier 1 per-fund official-disclosure adapters (incl. Temasek)
- Tier 2 IFSWF secretariat parser
- Dimension wiring (liquidReserveAdequacy, sovereignFiscalBuffer)
- reserveAdequacy retirement via RESILIENCE_RETIRED_DIMENSIONS
- cache-keys / bootstrap / health.js wiring (ON_DEMAND_KEYS until bake-in)
- Recovery-domain weight rebalance + Spearman sensitivity rerun
* feat(resilience): PR 2 commit 3 — Wikipedia infobox fallback + FX → 8/8 match
Closes the Temasek gap. The Wikipedia list article excludes Temasek on
editorial grounds (classified as a "state holding company" rather than
an SWF), so the Tier-3 list-only path topped out at 7/8 funds matched.
This commit adds Tier 3b — per-fund Wikipedia article infobox scrape
— and a baked-in FX table to handle non-USD infobox currencies.
Live verification on the shipped Wikipedia articles: 8/8 funds matched.
Temasek: S$ 434B → US$ 321B via infobox + SGD→USD FX.
Implementation
1. scripts/seed-sovereign-wealth.mjs
- FX_TO_USD table (USD, SGD, NOK, EUR, GBP, AED, SAR, KWD, QAR)
with FX_RATES_REVIEWED_AT='2026-04-23' committed into the seed
payload so stale rates are visible at audit time.
- CURRENCY_SYMBOL_TO_ISO ordered list — US$ tested before S$ before
bare $, and $ / kr require a space + digit neighbor to avoid
false-matches in rich prose.
- detectCurrency(text) exported pure for unit testing.
- parseWikipediaArticleInfobox(html) exported pure — scans rows
for "Total assets" / "Assets under management" / "AUM" / "Net
assets" / "Net portfolio value" labels, extracts "NUMBER (trillion
| billion | million) (YEAR)" values, applies FX conversion.
- fetchWikipediaInfobox(fund) — per-fund article fetch, gated on
the manifest's wikipedia.article_url hint.
- sourceMix split into {official, ifswf, wikipedia_list,
wikipedia_infobox} counters so the seed payload shows which tier
delivered each fund.
- Source priority chain: official → ifswf → wikipedia_list →
wikipedia_infobox. Infobox last because it is N network round-
trips; amortizing over the list article cache first minimizes
live traffic.
2. docs/methodology/swf-classification-manifest.yaml
- Temasek entry gains wikipedia.article_url:
https://en.wikipedia.org/wiki/Temasek_Holdings with an inline
comment explaining why the list-article path misses.
3. scripts/shared/swf-manifest-loader.mjs
- article_url optional field; validator rejects anything that is
not a https://<lang>.wikipedia.org/... URL so a typo cannot
silently wire the seeder to an off-site fetch.
4. tests/seed-sovereign-wealth.test.mjs (10 new tests, 38/38 pass)
- detectCurrency distinguishes US$ vs S$ vs bare $.
- parseWikipediaArticleInfobox extracts Temasek S$ 434B → US$ 321B
with year tag from "(2025)".
- USD-native row pass-through with fxRate=1.0.
- NOK trillion conversion (NOK 18.7T → USD 1.74T).
- Returns null when no AUM row / no infobox at all.
- Documents the unknown-currency → USD fallback contract.
Tests: 532/532 pass (full resilience + manifest + scraper suite).
Typecheck clean on both tsconfig.json and tsconfig.api.json.
Still stubbed for later commits:
- Tier 1 per-fund official-disclosure adapters
- Tier 2 IFSWF secretariat parser
- Dimension wiring (liquidReserveAdequacy, sovereignFiscalBuffer)
- reserveAdequacy retirement via RESILIENCE_RETIRED_DIMENSIONS
- cache-keys / bootstrap / health.js wiring (ON_DEMAND_KEYS)
- Recovery-domain weight rebalance + Spearman sensitivity rerun
* refactor(resilience): reuse project-shared FX infrastructure for SWF seeder
Self-caught duplication from the previous commit (699ba832a introduced
a local FX_TO_USD table and FX_RATES_REVIEWED_AT constant). The
codebase already has the canonical path:
scripts/_seed-utils.mjs
SHARED_FX_FALLBACKS (USD/SGD/NOK/EUR/GBP/AED/SAR/QAR/KWD/...)
getSharedFxRates() (Redis shared:fx-rates:v1 4h cache + Yahoo)
fetchYahooFxRates()
Used by seed-grocery-basket, seed-fuel-prices, seed-bigmac. Two FX
tables would drift and the live-rate layer (Yahoo via Redis cache)
would be orphaned on the SWF path.
What changed
- Deleted local FX_TO_USD / FX_RATES_REVIEWED_AT constants.
- parseWikipediaArticleInfobox() no longer performs FX conversion.
Returns { valueNative, currencyNative, aumYear } so the seeder
orchestrator applies project-shared rates at call time. Parser is
now currency-agnostic and thinner.
- Added lookupUsdRate(currency, fxRates) helper:
* USD → 1.0 short-circuit
* prefer the live map (getSharedFxRates output) over static fallback
* fall back to SHARED_FX_FALLBACKS
* return null on unknown currency (caller skips the fund — no silent
wrong-currency misreading).
- fetchWikipediaInfobox() accepts fxRates map, converts via
lookupUsdRate, returns enriched { aum, currencyNative, fxRate }.
- fetchSovereignWealth() fetches fxRates once at the top via
getSharedFxRates(buildFxSymbolsForSwf(), SHARED_FX_FALLBACKS), in
parallel with World Bank imports + Wikipedia list. Warms the shared
Redis FX cache for other seeders at the same time.
- Seed payload drops the fxRatesReviewedAt field; the shared cache
carries that metadata at the Redis level for all seeders.
Tests updated
- parseWikipediaArticleInfobox tests assert the native value + ISO
code, no longer the USD-converted amount.
- New `lookupUsdRate` suite pins the project-shared FX integration:
USD short-circuit, live-rate preference, static fallback, unknown-
currency null, and a Temasek S$ 434B → US$ 321B end-to-end case
via the shared fallback table.
Live re-verification still 8/8; SGD comes through SHARED_FX_FALLBACKS
at 0.74 (same number as the deleted local table), so behavior is
identical but the dedupe is real.
Tests: 536/536 pass. Typecheck clean on both tsconfig configs.
* refactor(resilience): split SWF manifest validator into sub-helpers
Biome reported validateManifest at complexity 55 vs max 50. Extracted
the per-fund validation into validateFundEntry(raw, idx, seen) and
pulled out validateClassification, validateRationale, validateSources,
validateWikipediaHints as separate helpers. Behavior and tests are
unchanged; each helper is now well under the complexity cap and the
main validator reads linearly.
Tests: 42/42 manifest + scraper tests pass. Typecheck clean.
* fix(resilience): PR 2 review — partial-seed guard + manifest REVIEWED status
Addresses two P1 findings on PR #3305.
P1.1 — partial-seed silent corruption on multi-fund countries
-------------------------------------------------------------
For multi-fund countries (AE = ADIA + Mubadala, SG = GIC + Temasek)
the previous aggregation silently published a partial
totalEffectiveMonths if a secondary fund's scraper drifted on
Wikipedia — recordCount would still look green because we counted
"any fund matched" as a successful country-seed. Downstream scorer
would under-rank those countries with no missingness signal.
Fix:
- Each country entry now carries { expectedFunds, matchedFunds,
completeness } alongside the existing totalEffectiveMonths. The
scorer can use completeness < 1.0 to derate (treat as degraded
coverage) rather than accept the partial number at face value.
- declareRecords counts ONLY countries with completeness === 1.0,
so a secondary-fund drift drops the seed-meta record_count and
triggers the operational alarm. recordCount in runSeed opts now
delegates to declareRecords for parity.
- A warn log fires per partial country so the Railway cron log is
loud on drift without poisoning seed-meta.
- 4 new tests pin: all-matched counts, partial drops, empty/malformed
payloads, and defensive handling of pre-completeness payload shape.
P1.2 — manifest external-reviewer language contradicted shipped workflow
------------------------------------------------------------------------
The YAML header said "External sovereign-wealth-practitioner review
is REQUIRED before PR 2 merges" and external_review_status=PENDING.
WorldMonitor's operating mode is fully automated (see memory
`feedback_no_external_reviewer_assumption.md`) — there is no external
practitioner gate. Reviewer correctly flagged the inconsistency
between the document and the shipped behaviour.
Fix:
- Rewrote the header to describe the actual audit discipline:
coefficients derive from the committed rationale + cited sources
for each fund; revisions require the same discipline in a follow-
up PR. No external-gate language.
- Flipped external_review_status to REVIEWED, with a clarifying
comment: REVIEWED = coefficients derive from the committed
rationale + seeder end-to-end matches the live surfaces. PENDING
remains reserved for future PRs that ship unresolved TBD
coefficients.
- Rewrote the "candidates deferred from v1" trailing block. Each
fund listed now has a concrete rationale for deferral (sanctions /
access coefficient would pin at 0 / classification contested /
AUM disclosure unstable) so a future PR author can argue the case
on record. No "reviewer advice needed" placeholders.
- Tweaked two inline fund comments (UAE ADQ/ICD, Singapore Temasek)
that still said "external reviewer" — now describe the substantive
reason for inclusion/deferral.
Tests
-----
- 540/540 resilience + manifest + scraper tests pass.
- Typecheck clean on both tsconfig configs.
- Biome clean on all touched files.
* fix(resilience): PR 2 review — 4 P2 fixes (WB imports, null validate, nested table, aumYear)
Greptile P2 findings on PR #3305, all addressed.
(1) Silent country drop on missing WB imports
---------------------------------------------
If fetchAnnualImportsUsd() has no entry for a manifest ISO-2
(transient WB outage, new country with spotty coverage), the country
was silently skipped. Downstream scorer would then read "absent from
payload" as "no SWF" and score 0 with full coverage — substantively
wrong. Now logs a warn and adds each affected fund to the unmatched
list with a `(no WB imports)` suffix so the seed-meta observer sees
the degradation.
(2) typeof null === 'object' bypassed validate()
------------------------------------------------
Bare `typeof data?.countries === 'object'` returned true for
{ countries: null } and { countries: [] }. Downstream property
access would then crash. Strict check added: non-null plain object
only; also rejects arrays. Test pins all 5 edge cases.
(3) Nested </table> / </td> truncated wikitable parse
-----------------------------------------------------
Lazy [\s\S]*? in the outer table regex AND the inner row/cell regexes
could silently drop every row after any cell that contained a nested
mini-table (Wikipedia footnote boxes, sort helpers). Two-step fix:
- extractFirstWikitable: depth-aware walk counts <table>/</table>
opens and closes, returns content at balanced depth
- stripNestedTables: iteratively removes complete inner
<table>…</table> blocks BEFORE row parsing, so the lazy row / cell
regexes never see a nested </tr> or </td>
Test: 5-row fixture with a nested table inside row 1's cell — ADIA
(row 2) must still parse, GPFG (row 1 with nested) must still parse.
(4) aumYear reflected scrape year, not data year
------------------------------------------------
List-article entries were stamped with `new Date().getFullYear()`
even though the Wikipedia list publishes no per-row data-year
annotation (figures are typically prior-period). Consumers using
aumYear for freshness audit would see "2026" for 2024/2025 data.
Now set to null for list entries; infobox tier 3b retains year
extraction from the "(YYYY)" tag on the individual fund article.
P1 bootstrap deferral: intentional per project memory
-----------------------------------------------------
AGENTS.md says new data sources MUST wire api/bootstrap.js. Not done
in this PR by design:
- No RPC consumer exists yet for
`resilience:recovery:sovereign-wealth:v1` (scorer lands in a
follow-up PR; wiring bootstrap without a consumer would be dead
code).
- Local memory `feedback_health_required_key_needs_railway_cron_
first.md` requires new seed keys to sit in ON_DEMAND_KEYS for
~7 days of clean Railway cron before promoting to
BOOTSTRAP_KEYS — adding bootstrap wiring now would pre-empt
that window and risk CRIT alarms on the health surface.
The scorer PR that follows will land the bootstrap wiring + the
dimension at the same time, which is the cohesive unit.
Tests: 547/547 resilience + manifest + scraper tests pass.
Typecheck clean on both tsconfig configs. Biome clean on touched
files.
532 lines
21 KiB
JavaScript
532 lines
21 KiB
JavaScript
import assert from 'node:assert/strict';
|
||
import { describe, it } from 'node:test';
|
||
|
||
import {
|
||
declareRecords,
|
||
detectCurrency,
|
||
lookupUsdRate,
|
||
matchWikipediaRecord,
|
||
parseWikipediaArticleInfobox,
|
||
parseWikipediaRankingsTable,
|
||
validate,
|
||
} from '../scripts/seed-sovereign-wealth.mjs';
|
||
import { SHARED_FX_FALLBACKS } from '../scripts/_seed-utils.mjs';
|
||
|
||
// Fixture HTML mirrors the structure observed on the shipping
|
||
// Wikipedia "List of sovereign wealth funds" article (captured
|
||
// 2026-04-23). Kept inline so the scraper's parsing rules are
|
||
// exercised without a live network round-trip. If Wikipedia later
|
||
// changes the column order or header text, update this fixture AND
|
||
// the assumed-columns comment in scripts/seed-sovereign-wealth.mjs
|
||
// in the same commit.
|
||
|
||
const FIXTURE_HTML = `
|
||
<html><body>
|
||
<table class="wikitable sortable static-row-numbers">
|
||
<thead>
|
||
<tr>
|
||
<th scope="col">Country or region</th>
|
||
<th scope="col">Abbrev.</th>
|
||
<th scope="col">Fund name</th>
|
||
<th scope="col">Assets</th>
|
||
<th scope="col">Inception</th>
|
||
<th scope="col">Origin</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td><a href="/wiki/Norway">Norway</a></td>
|
||
<td>GPF-G</td>
|
||
<td><a href="/wiki/GPFG">Government Pension Fund Global</a></td>
|
||
<td>2,117<sup>37</sup></td>
|
||
<td>1990</td>
|
||
<td>Oil & Gas</td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="/wiki/UAE">United Arab Emirates</a></td>
|
||
<td>ADIA</td>
|
||
<td><a href="/wiki/ADIA">Abu Dhabi Investment Authority</a></td>
|
||
<td>1,128<sup>40</sup></td>
|
||
<td>1976</td>
|
||
<td>Oil & Gas</td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="/wiki/UAE">United Arab Emirates</a></td>
|
||
<td></td>
|
||
<td><a href="/wiki/Mubadala">Mubadala Investment Company</a></td>
|
||
<td>302.0<sup>41</sup></td>
|
||
<td>2002</td>
|
||
<td>Oil & Gas</td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="/wiki/Singapore">Singapore</a></td>
|
||
<td>GIC</td>
|
||
<td><a href="/wiki/GIC">GIC Private Limited</a></td>
|
||
<td>801</td>
|
||
<td>1981</td>
|
||
<td>Non-commodity</td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="/wiki/Singapore">Singapore</a></td>
|
||
<td></td>
|
||
<td><a href="/wiki/Temasek">Temasek Holdings</a></td>
|
||
<td>382</td>
|
||
<td>1974</td>
|
||
<td>Non-commodity</td>
|
||
</tr>
|
||
<tr>
|
||
<td><a href="/wiki/NoData">No Data Row</a></td>
|
||
<td>NODATA</td>
|
||
<td>Example fund without assets</td>
|
||
<td></td>
|
||
<td>2000</td>
|
||
<td>Non-commodity</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</body></html>
|
||
`;
|
||
|
||
describe('parseWikipediaRankingsTable — fixture-based scraping', () => {
|
||
const cache = parseWikipediaRankingsTable(FIXTURE_HTML);
|
||
|
||
it('indexes funds by normalized abbreviation into record lists', () => {
|
||
// GPF-G → GPFG (normalized: uppercase, strip punctuation). Lookup
|
||
// returns a list so ambiguous abbrevs (e.g. PIF → Saudi vs Palestine
|
||
// on the live article) can be disambiguated at match time.
|
||
const gpfgList = cache.byAbbrev.get('GPFG');
|
||
assert.ok(Array.isArray(gpfgList) && gpfgList.length === 1, 'GPFG should have exactly one candidate in the fixture');
|
||
const [gpfg] = gpfgList;
|
||
assert.equal(gpfg.aum, 2_117_000_000_000);
|
||
assert.equal(gpfg.fundName, 'Government Pension Fund Global');
|
||
assert.equal(gpfg.countryName, 'Norway');
|
||
assert.equal(gpfg.inceptionYear, 1990);
|
||
|
||
assert.equal(cache.byAbbrev.get('ADIA')?.[0]?.aum, 1_128_000_000_000);
|
||
assert.equal(cache.byAbbrev.get('GIC')?.[0]?.aum, 801_000_000_000);
|
||
});
|
||
|
||
it('indexes funds by normalized fund name for abbrev-less rows', () => {
|
||
// Mubadala and Temasek have no abbreviation in the fixture,
|
||
// so they must still be matchable by fundName.
|
||
const mubadalaList = cache.byFundName.get('mubadala investment company');
|
||
assert.ok(mubadalaList && mubadalaList.length === 1);
|
||
assert.equal(mubadalaList[0].aum, 302_000_000_000);
|
||
|
||
const temasekList = cache.byFundName.get('temasek holdings');
|
||
assert.ok(temasekList && temasekList.length === 1);
|
||
assert.equal(temasekList[0].aum, 382_000_000_000);
|
||
});
|
||
|
||
it('strips inline HTML + footnote references from the Assets cell', () => {
|
||
// `2,117<sup>37</sup>` — the footnote int must be stripped
|
||
// before parsing. `<sup>` strips to a space so the ref is a
|
||
// separate token, not welded into the number.
|
||
assert.equal(cache.byAbbrev.get('GPFG')[0].aum, 2_117_000_000_000);
|
||
});
|
||
|
||
it('skips rows with missing or malformed Assets value', () => {
|
||
assert.equal(cache.byAbbrev.get('NODATA'), undefined);
|
||
assert.equal(cache.byFundName.get('example fund without assets'), undefined);
|
||
});
|
||
|
||
it('handles decimal AUM values (e.g. "302.0")', () => {
|
||
const mubadalaList = cache.byFundName.get('mubadala investment company');
|
||
assert.equal(mubadalaList[0].aum, 302_000_000_000);
|
||
});
|
||
|
||
it('throws loudly when the expected wikitable is missing', () => {
|
||
assert.throws(() => parseWikipediaRankingsTable('<html><body>no tables here</body></html>'),
|
||
/wikitable not found/);
|
||
});
|
||
});
|
||
|
||
// Separate describe block for the abbrev-collision disambiguation
|
||
// case since it requires a fixture with multiple rows sharing an
|
||
// abbrev. This is the exact class of bug observed on the live
|
||
// Wikipedia article (PIF → Saudi PIF + Palestine Investment Fund).
|
||
describe('parseWikipediaRankingsTable — abbrev collisions', () => {
|
||
const COLLIDING_HTML = `
|
||
<table class="wikitable">
|
||
<thead><tr>
|
||
<th>Country</th><th>Abbrev.</th><th>Fund name</th>
|
||
<th>Assets</th><th>Inception</th><th>Origin</th>
|
||
</tr></thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>Saudi Arabia</td><td>PIF</td><td>Public Investment Fund</td>
|
||
<td>925</td><td>1971</td><td>Oil Gas</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Palestine</td><td>PIF</td><td>Palestine Investment Fund</td>
|
||
<td>0.9</td><td>2003</td><td>Non-commodity</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>`;
|
||
|
||
it('keeps BOTH colliding records under the shared abbrev key', () => {
|
||
const cache = parseWikipediaRankingsTable(COLLIDING_HTML);
|
||
const pifList = cache.byAbbrev.get('PIF');
|
||
assert.ok(Array.isArray(pifList));
|
||
assert.equal(pifList.length, 2, 'both colliding PIF records must be retained — silent overwrite would shadow Saudi PIF with Palestine');
|
||
});
|
||
});
|
||
|
||
describe('matchWikipediaRecord — manifest-driven lookup', () => {
|
||
const cache = parseWikipediaRankingsTable(FIXTURE_HTML);
|
||
|
||
it('matches by abbrev when hints + country align', () => {
|
||
const fund = {
|
||
country: 'NO',
|
||
fund: 'gpfg',
|
||
wikipedia: { abbrev: 'GPF-G', fundName: 'Government Pension Fund Global' },
|
||
};
|
||
const hit = matchWikipediaRecord(fund, cache);
|
||
assert.ok(hit);
|
||
assert.equal(hit.fundName, 'Government Pension Fund Global');
|
||
});
|
||
|
||
it('falls back to fund-name match when no abbrev is provided', () => {
|
||
const fund = {
|
||
country: 'AE',
|
||
fund: 'mubadala',
|
||
wikipedia: { fundName: 'Mubadala Investment Company' },
|
||
};
|
||
const hit = matchWikipediaRecord(fund, cache);
|
||
assert.ok(hit);
|
||
assert.equal(hit.aum, 302_000_000_000);
|
||
});
|
||
|
||
it('normalizes abbrev punctuation (GPF-G ≡ GPFG)', () => {
|
||
const fund = { country: 'NO', fund: 'gpfg', wikipedia: { abbrev: 'GPFG' } };
|
||
const hit = matchWikipediaRecord(fund, cache);
|
||
assert.ok(hit, 'normalized-abbrev match should succeed');
|
||
});
|
||
|
||
it('returns null when no hints match', () => {
|
||
const fund = {
|
||
country: 'NO',
|
||
fund: 'unknown',
|
||
wikipedia: { abbrev: 'XXXX', fundName: 'Nonexistent Fund' },
|
||
};
|
||
assert.equal(matchWikipediaRecord(fund, cache), null);
|
||
});
|
||
|
||
it('returns null when manifest entry has no wikipedia hints', () => {
|
||
const fund = { country: 'NO', fund: 'no-hints' };
|
||
assert.equal(matchWikipediaRecord(fund, cache), null);
|
||
});
|
||
});
|
||
|
||
// ── Tier 3b: per-fund Wikipedia article infobox ──
|
||
//
|
||
// Activated for funds editorially excluded from the /wiki/List_of_
|
||
// sovereign_wealth_funds article (Temasek is the canonical case —
|
||
// Wikipedia classifies it as a "state holding company" rather than an
|
||
// SWF, despite the manifest including it per plan §3.4).
|
||
//
|
||
// The infobox parser must:
|
||
// - scan rows for "Total assets", "Assets under management", "AUM",
|
||
// "Net assets", "Net portfolio value" labels
|
||
// - detect non-USD currencies (S$, €, £, NOK, etc.) and convert via
|
||
// the FX_TO_USD table
|
||
// - extract the year tag "(2025)" from the value for freshness
|
||
// - skip rows whose currency isn't in the FX table (loud, not silent)
|
||
|
||
describe('detectCurrency — symbol and code detection', () => {
|
||
it('distinguishes US$ from S$ from $', () => {
|
||
assert.equal(detectCurrency('US$ 1,128 billion'), 'USD');
|
||
assert.equal(detectCurrency('S$ 434 billion'), 'SGD');
|
||
// Bare $ must NOT match US$ or S$ patterns, and must require a
|
||
// digit after.
|
||
assert.equal(detectCurrency('$ 500 billion'), 'USD');
|
||
});
|
||
|
||
it('detects Norwegian krone via NOK or kr', () => {
|
||
assert.equal(detectCurrency('NOK 18.7 trillion'), 'NOK');
|
||
assert.equal(detectCurrency('17,500 kr 500 billion'), 'NOK');
|
||
});
|
||
|
||
it('detects EUR via € symbol or ISO code', () => {
|
||
assert.equal(detectCurrency('€ 500 million'), 'EUR');
|
||
assert.equal(detectCurrency('500 EUR billion'), 'EUR');
|
||
});
|
||
|
||
it('returns null when no currency signal is present', () => {
|
||
assert.equal(detectCurrency('500 billion'), null);
|
||
assert.equal(detectCurrency(''), null);
|
||
});
|
||
});
|
||
|
||
describe('parseWikipediaArticleInfobox — native value + currency extraction', () => {
|
||
// Parser returns { valueNative, currencyNative, aumYear } and does
|
||
// NOT convert to USD — conversion is applied at the seeder level
|
||
// via the project-shared `getSharedFxRates` cache (see
|
||
// scripts/_seed-utils.mjs). Keeping the parser FX-free removes a
|
||
// duplicate copy of the FX table that would drift from the shared
|
||
// one.
|
||
//
|
||
// Mirrors the Temasek infobox structure (abridged). Real row:
|
||
// `<tr><th>Total assets</th><td>S$ 434 billion <i>(2025)</i><sup>2</sup></td></tr>`
|
||
const TEMASEK_INFOBOX = `
|
||
<html><body>
|
||
<table class="infobox vcard">
|
||
<tr><th>Type</th><td>Holding company</td></tr>
|
||
<tr><th>Founded</th><td>25 June 1974</td></tr>
|
||
<tr><th>Total assets</th><td>S$ 434 billion <i>(2025)</i><sup>2</sup></td></tr>
|
||
<tr><th>Owner</th><td>Ministry of Finance</td></tr>
|
||
</table>
|
||
</body></html>
|
||
`;
|
||
|
||
it('extracts S$ 434 billion as native SGD value + year tag', () => {
|
||
const hit = parseWikipediaArticleInfobox(TEMASEK_INFOBOX);
|
||
assert.ok(hit, 'Temasek infobox should produce a hit');
|
||
assert.equal(hit.currencyNative, 'SGD');
|
||
assert.equal(hit.valueNative, 434_000_000_000);
|
||
assert.equal(hit.aumYear, 2025);
|
||
});
|
||
|
||
it('handles USD-native infoboxes (currency detected as USD)', () => {
|
||
const html = `<table class="infobox">
|
||
<tr><th>AUM</th><td>US$ 1,500 billion (2025)</td></tr>
|
||
</table>`;
|
||
const hit = parseWikipediaArticleInfobox(html);
|
||
assert.ok(hit);
|
||
assert.equal(hit.currencyNative, 'USD');
|
||
assert.equal(hit.valueNative, 1_500_000_000_000);
|
||
});
|
||
|
||
it('parses trillion-unit values (NOK 18.7 trillion)', () => {
|
||
const html = `<table class="infobox">
|
||
<tr><th>Net assets</th><td>NOK 18.7 trillion (2025)</td></tr>
|
||
</table>`;
|
||
const hit = parseWikipediaArticleInfobox(html);
|
||
assert.ok(hit);
|
||
assert.equal(hit.currencyNative, 'NOK');
|
||
assert.equal(hit.valueNative, 18_700_000_000_000);
|
||
});
|
||
|
||
it('returns null when no AUM-labeled row is present', () => {
|
||
const html = `<table class="infobox">
|
||
<tr><th>Type</th><td>Holding company</td></tr>
|
||
</table>`;
|
||
assert.equal(parseWikipediaArticleInfobox(html), null);
|
||
});
|
||
|
||
it('returns null when the infobox itself is missing', () => {
|
||
assert.equal(parseWikipediaArticleInfobox('<html>no infobox</html>'), null);
|
||
});
|
||
});
|
||
|
||
describe('lookupUsdRate — project-shared FX integration', () => {
|
||
// Verifies the parser → FX conversion pipeline uses the project's
|
||
// canonical FX source (scripts/_seed-utils.mjs SHARED_FX_FALLBACKS +
|
||
// getSharedFxRates Redis cache) rather than a duplicate table.
|
||
|
||
it('returns 1.0 for USD regardless of rate map', () => {
|
||
assert.equal(lookupUsdRate('USD', {}), 1.0);
|
||
assert.equal(lookupUsdRate('USD', null), 1.0);
|
||
assert.equal(lookupUsdRate('USD', { USD: 999 }), 1.0);
|
||
});
|
||
|
||
it('prefers the live rate map over the static fallback', () => {
|
||
// Simulate getSharedFxRates returning a fresh Yahoo rate. The static
|
||
// fallback has SGD=0.74; the live rate could drift (e.g. 0.751).
|
||
assert.equal(lookupUsdRate('SGD', { SGD: 0.751 }), 0.751);
|
||
});
|
||
|
||
it('falls back to SHARED_FX_FALLBACKS when the live rate is missing', () => {
|
||
assert.equal(lookupUsdRate('SGD', {}), SHARED_FX_FALLBACKS.SGD);
|
||
assert.equal(lookupUsdRate('NOK', { EUR: 1.05 }), SHARED_FX_FALLBACKS.NOK);
|
||
});
|
||
|
||
it('returns null for unknown currencies (caller skips the fund)', () => {
|
||
assert.equal(lookupUsdRate('ZZZ', {}), null);
|
||
assert.equal(lookupUsdRate('XXX', { XXX: 0 }), null);
|
||
});
|
||
|
||
it('converts Temasek S$ 434B end-to-end via shared fallback table', () => {
|
||
const hit = parseWikipediaArticleInfobox(`
|
||
<table class="infobox"><tr><th>Total assets</th><td>S$ 434 billion (2025)</td></tr></table>
|
||
`);
|
||
const rate = lookupUsdRate(hit.currencyNative, {});
|
||
const aumUsd = hit.valueNative * rate;
|
||
// 434B × 0.74 = 321.16B. Matches SHARED_FX_FALLBACKS.SGD.
|
||
assert.ok(aumUsd > 300_000_000_000 && aumUsd < 340_000_000_000,
|
||
`expected ~US$ 320B, got ${aumUsd}`);
|
||
});
|
||
});
|
||
|
||
describe('validate — reject null-object masquerading as object', () => {
|
||
// `typeof null === 'object'` in JS, so a bare `typeof x === 'object'`
|
||
// would let { countries: null } through and break downstream. This
|
||
// test pins the strict non-null check.
|
||
|
||
it('rejects { countries: null }', () => {
|
||
assert.equal(validate({ countries: null }), false);
|
||
});
|
||
|
||
it('rejects missing countries field', () => {
|
||
assert.equal(validate({}), false);
|
||
assert.equal(validate(null), false);
|
||
assert.equal(validate(undefined), false);
|
||
});
|
||
|
||
it('rejects array countries (typeof [] === object too)', () => {
|
||
assert.equal(validate({ countries: [] }), false);
|
||
});
|
||
|
||
it('accepts empty object (during Railway-cron bake-in window)', () => {
|
||
assert.equal(validate({ countries: {} }), true);
|
||
});
|
||
|
||
it('accepts populated countries', () => {
|
||
assert.equal(validate({ countries: { NO: { funds: [] } } }), true);
|
||
});
|
||
});
|
||
|
||
describe('parseWikipediaRankingsTable — nested-table depth awareness', () => {
|
||
// Wikipedia occasionally embeds mini-tables (sort helpers, footnote
|
||
// boxes) inside a wikitable cell. A lazy `[\s\S]*?</table>` regex
|
||
// would stop at the FIRST `</table>` and silently drop every row
|
||
// after the cell containing the nested table. The depth-aware
|
||
// extractor must walk the full open/close pair.
|
||
|
||
it('does not truncate at a nested </table> inside a cell', () => {
|
||
const html = `
|
||
<table class="wikitable">
|
||
<tr><th>Country</th><th>Abbrev.</th><th>Fund</th><th>Assets</th><th>Inception</th></tr>
|
||
<tr>
|
||
<td>Norway</td><td>GPF-G</td>
|
||
<td>Government Pension Fund Global
|
||
<table class="mini-sort-helper"><tr><td>nested</td></tr></table>
|
||
</td>
|
||
<td>2000</td><td>1990</td>
|
||
</tr>
|
||
<tr>
|
||
<td>UAE</td><td>ADIA</td>
|
||
<td>Abu Dhabi Investment Authority</td>
|
||
<td>1128</td><td>1976</td>
|
||
</tr>
|
||
</table>
|
||
`;
|
||
const cache = parseWikipediaRankingsTable(html);
|
||
// Without depth awareness, ADIA would be silently dropped because
|
||
// the nested </table> inside GPF-G's cell would close the outer
|
||
// match at row 1.
|
||
assert.ok(cache.byAbbrev.get('ADIA')?.[0]?.aum === 1_128_000_000_000,
|
||
'ADIA must survive — nested </table> in a prior cell should not truncate the wikitable');
|
||
assert.ok(cache.byAbbrev.get('GPFG')?.[0]?.aum === 2_000_000_000_000);
|
||
});
|
||
});
|
||
|
||
describe('parseWikipediaRankingsTable — aumYear accuracy', () => {
|
||
it('sets aumYear=null for list-article rows (no per-row data-year annotation)', () => {
|
||
const html = `
|
||
<table class="wikitable">
|
||
<tr><th>Country</th><th>Abbrev.</th><th>Fund</th><th>Assets</th><th>Inception</th></tr>
|
||
<tr><td>Norway</td><td>GPF-G</td><td>Government Pension Fund Global</td><td>2117</td><td>1990</td></tr>
|
||
</table>
|
||
`;
|
||
const cache = parseWikipediaRankingsTable(html);
|
||
const gpfg = cache.byAbbrev.get('GPFG')?.[0];
|
||
assert.ok(gpfg);
|
||
assert.equal(gpfg.aumYear, null,
|
||
'aumYear must be null — the list article publishes no per-row data-year, and claiming the scrape year would mislead freshness auditors');
|
||
// Infobox path (Tier 3b) sets a real aumYear from "(YYYY)" tag —
|
||
// see the separate infobox test block for that contract.
|
||
});
|
||
});
|
||
|
||
describe('declareRecords — partial-seed guard for multi-fund countries', () => {
|
||
// Regression: for multi-fund countries (AE = ADIA + Mubadala,
|
||
// SG = GIC + Temasek) a single scraper drift would silently publish
|
||
// a partial totalEffectiveMonths if we counted "any fund matched"
|
||
// as a successful country-seed. declareRecords MUST only count
|
||
// countries with completeness === 1.0 so a secondary-fund drift
|
||
// drops the seed-health record count and triggers the operational
|
||
// alarm, rather than leaking an under-weighted total into the
|
||
// ranking.
|
||
|
||
it('counts only countries where all manifest funds matched', () => {
|
||
const data = {
|
||
countries: {
|
||
NO: { funds: [{}], expectedFunds: 1, matchedFunds: 1, completeness: 1.0 },
|
||
AE: { funds: [{}, {}], expectedFunds: 2, matchedFunds: 2, completeness: 1.0 },
|
||
SG: { funds: [{}], expectedFunds: 2, matchedFunds: 1, completeness: 0.5 }, // partial
|
||
},
|
||
};
|
||
assert.equal(declareRecords(data), 2,
|
||
'SG (partial, completeness=0.5) must NOT count — recordCount stays at 2, not 3');
|
||
});
|
||
|
||
it('returns 0 when every country is partial', () => {
|
||
const data = {
|
||
countries: {
|
||
AE: { expectedFunds: 2, matchedFunds: 1, completeness: 0.5 },
|
||
SG: { expectedFunds: 2, matchedFunds: 1, completeness: 0.5 },
|
||
},
|
||
};
|
||
assert.equal(declareRecords(data), 0,
|
||
'all-partial payload must drop recordCount to 0 — the seed-meta alarm surfaces a degraded run');
|
||
});
|
||
|
||
it('returns 0 on empty / malformed payload', () => {
|
||
assert.equal(declareRecords({}), 0);
|
||
assert.equal(declareRecords({ countries: {} }), 0);
|
||
assert.equal(declareRecords(null), 0);
|
||
assert.equal(declareRecords(undefined), 0);
|
||
});
|
||
|
||
it('ignores entries lacking the completeness field (defensive)', () => {
|
||
// Old payload shape (pre-completeness) must not spuriously count.
|
||
const data = { countries: { XX: { funds: [{}], totalEffectiveMonths: 1 } } };
|
||
assert.equal(declareRecords(data), 0);
|
||
});
|
||
});
|
||
|
||
describe('matchWikipediaRecord — country-disambiguation on abbrev collisions', () => {
|
||
// This replays the exact class of bug observed on the live Wikipedia
|
||
// article: "PIF" resolves to BOTH Saudi Arabia's Public Investment
|
||
// Fund (~$925B) and Palestine's Palestine Investment Fund (~$900M).
|
||
// Without country disambiguation, a naive Map.set overwrites one
|
||
// with the other — Saudi PIF would silently return Palestine's AUM
|
||
// (three orders of magnitude smaller), breaking the score for every
|
||
// Saudi resilience read.
|
||
const COLLIDING_HTML = `
|
||
<table class="wikitable">
|
||
<thead><tr>
|
||
<th>Country</th><th>Abbrev.</th><th>Fund name</th>
|
||
<th>Assets</th><th>Inception</th><th>Origin</th>
|
||
</tr></thead>
|
||
<tbody>
|
||
<tr>
|
||
<td>Saudi Arabia</td><td>PIF</td><td>Public Investment Fund</td>
|
||
<td>925</td><td>1971</td><td>Oil Gas</td>
|
||
</tr>
|
||
<tr>
|
||
<td>Palestine</td><td>PIF</td><td>Palestine Investment Fund</td>
|
||
<td>0.9</td><td>2003</td><td>Non-commodity</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>`;
|
||
const cache = parseWikipediaRankingsTable(COLLIDING_HTML);
|
||
|
||
it('picks the Saudi record for fund.country=SA', () => {
|
||
const fund = { country: 'SA', fund: 'pif', wikipedia: { abbrev: 'PIF' } };
|
||
const hit = matchWikipediaRecord(fund, cache);
|
||
assert.ok(hit);
|
||
assert.equal(hit.countryName, 'Saudi Arabia');
|
||
assert.equal(hit.aum, 925_000_000_000);
|
||
});
|
||
|
||
it('returns null (not the wrong record) when country is unknown to the disambiguator', () => {
|
||
// Hypothetical fund from a country not in ISO2_TO_WIKIPEDIA_COUNTRY_NAME.
|
||
// Must NOT silently return Saudi's or Palestine's record.
|
||
const fund = { country: 'ZZ', fund: 'pif', wikipedia: { abbrev: 'PIF' } };
|
||
assert.equal(matchWikipediaRecord(fund, cache), null,
|
||
'ambiguous match with no country mapping must return null — silent wrong-country match is the exact bug this test guards against');
|
||
});
|
||
});
|