Files
worldmonitor/scripts/seed-grocery-basket.mjs
Elie Habib 044598346e feat(seed-contract): PR 2a — runSeed envelope dual-write + 91 seeders migrated (#3097)
* feat(seed-contract): PR 2a — runSeed envelope dual-write + 91 seeders migrated

Opt-in contract path in runSeed: when opts.declareRecords is provided, write
{_seed, data} envelope to the canonical key alongside legacy seed-meta:*
(dual-write). State machine: OK / OK_ZERO / RETRY with zeroIsValid opt.
declareRecords throws or returns non-integer → hard fail (contract violation).
extraKeys[*] support per-key declareRecords; each extra key writes its own
envelope. Legacy seeders (no declareRecords) entirely unchanged.

Migrated all 91 scripts/seed-*.mjs to contract mode. Each exports
declareRecords returning the canonical record count, and passes
schemaVersion: 1 + maxStaleMin (matched to api/health.js SEED_META, or 2.5x
interval where no registry entry exists). Contract conformance reports 84/86
seeders with full descriptor (2 pre-existing warnings).

Legacy seed-meta keys still written so unmigrated readers keep working;
follow-up slices flip health.js + readers to envelope-first.

Tests: 61/61 PR 1 tests still pass.

Next slices for PR 2:
- api/health.js registry collapse + 15 seed-bundle-*.mjs canonicalKey wiring
- reader migration (mcp, resilience, aviation, displacement, regional-snapshot)
- direct writers — ais-relay.cjs, consumer-prices-core publish.ts
- public-boundary stripSeedEnvelope + test migration

Plan: docs/plans/2026-04-14-002-fix-runseed-zero-record-lockout-plan.md

* fix(seed-contract): unwrap envelopes in internal cross-seed readers

After PR 2a enveloped 91 canonical keys as {_seed, data}, every script-side
reader that returned the raw parsed JSON started silently handing callers the
envelope instead of the bare payload. WoW baselines (bigmac, grocery-basket,
fear-greed) saw undefined .countries / .composite; seed-climate-anomalies saw
undefined .normals from climate:zone-normals:v1; seed-thermal-escalation saw
undefined .fireDetections from wildfire:fires:v1; seed-forecasts' ~40-key
pipeline batch returned envelopes for every input.

Fix: route every script-side reader through unwrapEnvelope(...).data. Legacy
bare-shape values pass through unchanged (unwrapEnvelope returns
{_seed: null, data: raw} for any non-envelope shape).

Changed:
- scripts/_seed-utils.mjs: import unwrapEnvelope; redisGet, readSeedSnapshot,
  verifySeedKey all unwrap. Exported new readCanonicalValue() helper for
  cross-seed consumers.
- 18 seed-*.mjs scripts with local redisGet-style helpers or inline fetch
  patched to unwrap via the envelope source module (subagent sweep).
- scripts/seed-forecasts.mjs pipeline batch: parse() unwraps each result.
- scripts/seed-energy-spine.mjs redisMget: unwraps each result.

Tests:
- tests/seed-utils-envelope-reads.test.mjs: 7 new cases covering envelope
  + legacy + null paths for readSeedSnapshot and verifySeedKey.
- Full seed suite: 67/67 pass (was 61, +6 new).

Addresses both of user's P1 findings on PR #3097.

* feat(seed-contract): envelope-aware reads in server + api helpers

Every RPC and public-boundary reader now automatically strips _seed from
contract-mode canonical keys. Legacy bare-shape values pass through unchanged
(unwrapEnvelope no-ops on non-envelope shapes).

Changed helpers (one-place fix — unblocks ~60 call sites):
- server/_shared/redis.ts: getRawJson, getCachedJson, getCachedJsonBatch
  unwrap by default. cachedFetchJson inherits via getCachedJson.
- api/_upstash-json.js: readJsonFromUpstash unwraps (covers api/mcp.ts
  tool responses + all its canonical-key reads).
- api/bootstrap.js: getCachedJsonBatch unwraps (public-boundary —
  clients never see envelope metadata).

Left intentionally unchanged:
- api/health.js / api/seed-health.js: read only seed-meta:* keys which
  remain bare-shape during dual-write. unwrapEnvelope already imported at
  the meta-read boundary (PR 1) as a defensive no-op.

Tests: 67/67 seed tests pass. typecheck + typecheck:api clean.

This is the blast-radius fix the PR #3097 review called out — external
readers that would otherwise see {_seed, data} after the writer side
migrated.

* fix(test): strip export keyword in vm.runInContext'd seed source

cross-source-signals-regulatory.test.mjs loads scripts/seed-cross-source-signals.mjs
via vm.runInContext, which cannot parse ESM `export` syntax. PR 2a added
`export function declareRecords` to every seeder, which broke this test's
static-analysis approach.

Fix: strip the `export` keyword from the declareRecords line in the
preprocessed source string so the function body still evaluates as a plain
declaration.

Full test:data suite: 5307/5307 pass. typecheck + typecheck:api clean.

* feat(seed-contract): consumer-prices publish.ts writes envelopes

Wrap the 5 canonical keys written by consumer-prices-core/src/jobs/publish.ts
(overview, movers:7d/30d, freshness, categories:7d/30d/90d, retailer-spread,
basket-series) in {_seed, data} envelopes. Legacy seed-meta:<key> writes
preserved for dual-write.

Inlined a buildEnvelope helper (10 lines) rather than taking a cross-package
dependency — consumer-prices-core is a standalone npm package. Documented the
four-file parity contract (mjs source, ts mirror, js edge mirror, this copy).

Contract fields: sourceVersion='consumer-prices-core-publish-v1', schemaVersion=1,
state='OK' (recordCount>0) or 'OK_ZERO' (legitimate zero).

Typecheck: no new errors in publish.ts.

* fix(seed-contract): 3 more server-side readers unwrap envelopes

Found during final audit:

- server/worldmonitor/resilience/v1/_shared.ts: resilience score reader
  parsed cached GetResilienceScoreResponse raw. Contract-mode seed-resilience-scores
  now envelopes those keys.
- server/worldmonitor/resilience/v1/get-resilience-ranking.ts: p05/p95
  interval lookup parsed raw from seed-resilience-scores' extra-key path.
- server/worldmonitor/infrastructure/v1/_shared.ts: mgetJson() used for
  count-source keys (wildfire:fires:v1, news:insights:v1) which are both
  contract-mode now.

All three now unwrap via server/_shared/seed-envelope. Legacy shapes pass
through unchanged.

Typecheck clean.

* feat(seed-contract): ais-relay.cjs direct writes produce envelopes

32 canonical-key write sites in scripts/ais-relay.cjs now produce {_seed, data}
envelopes. Inlined buildEnvelope() (CJS module can't require ESM source) +
envelopeWrite(key, data, ttlSeconds, meta) wrapper. Enveloped keys span market
bootstrap, aviation, cyber-threats, theater-posture, weather-alerts, economic
spending/fred/worldbank, tech-events, corridor-risk, usni-fleet, shipping-stress,
social:reddit, wsb-tickers, pizzint, product-catalog, chokepoint transits,
ucdp-events, satellites, oref.

Left bare (not seeded data keys): seed-meta:* (dual-write legacy),
classifyCacheKey LLM cache, notam:prev-closed-state internal state,
wm:notif:scan-dedup flags.

Updated tests/ucdp-seed-resilience.test.mjs regex to accept both upstashSet
(pre-contract) and envelopeWrite (post-contract) call patterns.

* feat(seed-contract): 15 bundle files add canonicalKey for envelope gate

54 bundle sections across 12 files now declare canonicalKey alongside the
existing seedMetaKey. _bundle-runner.mjs (from PR 1) prefers canonicalKey
when both are present — gates section runs on envelope._seed.fetchedAt
read directly from the data key, eliminating the meta-outlives-data class
of bugs.

Files touched:
- climate (5), derived-signals (2), ecb-eu (3), energy-sources (6),
  health (2), imf-extended (4), macro (10), market-backup (9),
  portwatch (4), relay-backup (2), resilience-recovery (5), static-ref (2)

Skipped (14 sections, 3 whole bundles): multi-key writers, dynamic
templated keys (displacement year-scoped), or non-runSeed orchestrators
(regional brief cron, resilience-scores' 222-country publish, validation/
benchmark scripts). These continue to use seedMetaKey or their own gate.

seedMetaKey preserved everywhere — dual-write. _bundle-runner.mjs falls
back to legacy when canonicalKey is absent.

All 15 bundles pass node --check. test:data: 5307/5307. typecheck:all: clean.

* fix(seed-contract): 4 PR #3097 review P1s — transform/declareRecords mismatches + envelope leaks

Addresses both P1 findings and the extra-key seed-meta leak surfaced in review:

1. runSeed helper-level invariant: seed-meta:* keys NEVER envelope.
   scripts/_seed-utils.mjs exports shouldEnvelopeKey(key) — returns false for
   any key starting with 'seed-meta:'. Both atomicPublish (canonical) and
   writeExtraKey (extras) gate the envelope wrap through this helper. Fixes
   seed-iea-oil-stocks' ANALYSIS_META_EXTRA_KEY silently getting enveloped,
   which broke health.js parsing the value as bare {fetchedAt, recordCount}.
   Also defends against any future manual writeExtraKey(..., envelopeMeta)
   call that happens to target a seed-meta:* key.

2. seed-token-panels canonical + extras fixed.
   publishTransform returns data.defi (the defi panel itself, shape {tokens}).
   Old declareRecords counted data.defi.tokens + data.ai.tokens + data.other.tokens
   on the transformed payload → 0 → RETRY path → canonical market:defi-tokens:v1
   never wrote, and because runSeed returned before the extraKeys loop,
   market:ai-tokens:v1 + market:other-tokens:v1 stayed stale too.
   New: declareRecords counts data.tokens on the transformed shape. AI_KEY +
   OTHER_KEY extras reuse the same function (transforms return structurally
   identical panels). Added isMain guard so test imports don't fire runSeed.

3. api/product-catalog.js cached reader unwraps envelope.
   ais-relay.cjs now envelopes product-catalog:v2 via envelopeWrite(). The
   edge reader did raw JSON.parse(result) and returned {_seed, data} to
   clients, breaking the cached path. Fix: import unwrapEnvelope from
   ./_seed-envelope.js, apply after JSON.parse. One site — :238-241 is
   downstream of getFromCache(), so the single reader fix covers both.

4. Regression lock tests/seed-contract-transform-regressions.test.mjs (11 cases):
   - shouldEnvelopeKey invariant: seed-meta:* false, canonical true
   - Token-panels declareRecords works on transformed shape (canonical + both extras)
   - Explicit repro of pre-fix buggy signature returning 0 — guards against revert
   - resolveRecordCount accepts 0, rejects non-integer
   - Product-catalog envelope unwrap returns bare shape; legacy passes through

Verification:
- npm run test:data → 5318/5318 pass (was 5307 — 11 new regressions)
- npm run typecheck:all → clean
- node --check on every modified script

iea-oil-stocks canonical declareRecords was NOT broken (user confirmed during
review — buildIndex preserves .members); only its ANALYSIS_META_EXTRA_KEY
was affected, now covered generically by commit 1's helper invariant.

* fix(seed-contract): seed-token-panels validateFn also runs on post-transform shape

Review finding: fixing declareRecords wasn't sufficient — atomicPublish() runs
validateFn(publishData) on the transformed payload too. seed-token-panels'
validate() checked data.defi/.ai/.other on the transformed {tokens} shape,
returned false, and runSeed took the early skipped-write branch (before even
reaching the declareRecords RETRY logic). Net effect: same as before the
declareRecords fix — canonical + both extras stayed stale.

Fix: validate() now checks the canonical defi panel directly (Array.isArray
(data?.tokens) && has at least one t.price > 0). AI/OTHER panels are validated
implicitly by their own extraKey declareRecords on write.

Audited the other 9 seeders with publishTransform (bls-series, bis-extended,
bis-data, gdelt-intel, trade-flows, iea-oil-stocks, jodi-gas, sanctions-pressure,
forecasts): all validateFn's correctly target the post-transform shape. Only
token-panels regressed.

Added 4 regression tests (tests/seed-contract-transform-regressions.test.mjs):
- validate accepts transformed panel with priced tokens
- validate rejects all-zero-price tokens
- validate rejects empty/missing tokens
- Explicit pre-fix repro (buggy old signature fails on transformed shape)

Verification:
- npm run test:data → 5322/5322 pass (was 5318; +4 new)
- npm run typecheck:all → clean
- node --check clean

* feat(seed-contract): add /api/seed-contract-probe validation endpoint

Single machine-readable gate for 'is PR #3097 working in production'.
Replaces the curl/jq ritual with one authenticated edge call that returns
HTTP 200 ok:true or 503 + failing check list.

What it validates:
- 8 canonical keys have {_seed, data} envelopes with required data fields
  and minRecords floors (fsi-eu, zone-normals, 3 token panels + minRecords
  guard against token-panels RETRY regression, product-catalog, wildfire,
  earthquakes).
- 2 seed-meta:* keys remain BARE (shouldEnvelopeKey invariant; guards
  against iea-oil-stocks ANALYSIS_META_EXTRA_KEY-class regressions).
- /api/product-catalog + /api/bootstrap responses contain no '_seed' leak.

Auth: x-probe-secret header must match RELAY_SHARED_SECRET (reuses existing
Vercel↔Railway internal trust boundary).

Probe logic is exported (checkProbe, checkPublicBoundary, DEFAULT_PROBES) for
hermetic testing. tests/seed-contract-probe.test.mjs covers every branch:
envelope pass/fail on field/records/shape, bare pass/fail on shape/field,
missing/malformed JSON, Redis non-2xx, boundary seed-leak detection,
DEFAULT_PROBES sanity (seed-meta invariant present, token-panels minRecords
guard present).

Usage:
  curl -H "x-probe-secret: $RELAY_SHARED_SECRET" \
       https://api.worldmonitor.app/api/seed-contract-probe

PR 3 will extend the probe with a stricter mode that asserts seed-meta:*
keys are GONE (not just bare) once legacy dual-write is removed.

Verification:
- tests/seed-contract-probe.test.mjs → 15/15 pass
- npm run test:data → 5338/5338 (was 5322; +16 new incl. conformance)
- npm run typecheck:all → clean

* fix(seed-contract): tighten probe — minRecords on AI/OTHER + cache-path source header

Review P2 findings: the probe's stated guards were weaker than advertised.

1. market:ai-tokens:v1 + market:other-tokens:v1 probes claimed to guard the
   token-panels extra-key RETRY regression but only checked shape='envelope'
   + dataHas:['tokens']. If an extra-key declareRecords regressed to 0, both
   probes would still pass because checkProbe() only inspects _seed.recordCount
   when minRecords is set. Now both enforce minRecords: 1.

2. /api/product-catalog boundary check only asserted no '_seed' leak — which
   is also true for the static fallback path. A broken cached reader
   (getFromCache returning null or throwing) could serve fallback silently
   and still pass this probe. Now:
   - api/product-catalog.js emits X-Product-Catalog-Source: cache|dodo|fallback
     on the response (the json() helper gained an optional source param wired
     to each of the three branches).
   - checkPublicBoundary declaratively requires that header's value match
     'cache' for /api/product-catalog, so a fallback-serve fails the probe
     with reason 'source:fallback!=cache' or 'source:missing!=cache'.

Test updates (tests/seed-contract-probe.test.mjs):
- Boundary check reworked to use a BOUNDARY_CHECKS config with optional
  requireSourceHeader per endpoint.
- New cases: served-from-cache passes, served-from-fallback fails with source
  mismatch, missing header fails, seed-leak still takes precedence, bad
  status fails.
- Token-panels sanity test now asserts minRecords≥1 on all 3 panels.

Verification:
- tests/seed-contract-probe.test.mjs → 17/17 pass (was 15, +2 net)
- npm run test:data → 5340/5340
- npm run typecheck:all → clean
2026-04-15 09:16:27 +04:00

484 lines
21 KiB
JavaScript
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env node
import { loadEnvFile, loadSharedConfig, CHROME_UA, runSeed, sleep, readSeedSnapshot, bulkReadLearnedRoutes, bulkWriteLearnedRoutes, isAllowedRouteHost, processItemRoute, getSharedFxRates, SHARED_FX_FALLBACKS } from './_seed-utils.mjs';
loadEnvFile(import.meta.url);
const config = loadSharedConfig('grocery-basket.json');
const CANONICAL_KEY = 'economic:grocery-basket:v1';
const CACHE_TTL = 864000; // 10 days — weekly seed with 3-day cron-drift buffer
// Bump when basket composition changes materially — invalidates WoW until a new baseline runs.
const BASKET_VERSION = 2; // v2: oil changed from sunflower to canola
const EXA_DELAY_MS = 150;
const FIRECRAWL_DELAY_MS = 500;
const FX_FALLBACKS = SHARED_FX_FALLBACKS;
async function searchExa(query, sites, locationCode) {
const apiKey = (process.env.EXA_API_KEYS || process.env.EXA_API_KEY || '').split(/[\n,]+/)[0].trim();
if (!apiKey) throw new Error('EXA_API_KEYS or EXA_API_KEY not set');
const body = {
query,
numResults: 5,
type: 'auto',
// Restrict to known local supermarket/retailer domains per country — prevents EXA
// neural search from returning USD-priced global comparison pages (Numbeo, Tridge, etc.)
includeDomains: sites,
// Bias results toward the target country's web
userLocation: locationCode,
contents: {
summary: {
// Explicitly request ISO currency code so regex can reliably match
query: 'What is the retail price of this product? State amount and ISO currency code (e.g. GBP 1.50, EUR 2.99, JPY 193).',
},
},
};
const resp = await fetch('https://api.exa.ai/search', {
method: 'POST',
headers: {
'x-api-key': apiKey,
'Content-Type': 'application/json',
'User-Agent': CHROME_UA,
},
body: JSON.stringify(body),
signal: AbortSignal.timeout(15_000),
});
if (!resp.ok) {
const text = await resp.text().catch(() => '');
console.warn(` EXA ${resp.status}: ${text.slice(0, 100)}`);
return null;
}
return resp.json();
}
// Firecrawl fallback — renders JS-heavy SPA pages and extracts prices via LLM schema
async function scrapeFirecrawl(url, expectedCurrency) {
const apiKey = process.env.FIRECRAWL_API_KEY || '';
if (!apiKey) return null;
try {
const resp = await fetch('https://api.firecrawl.dev/v1/scrape', {
method: 'POST',
headers: {
Authorization: `Bearer ${apiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
formats: ['extract'],
extract: {
prompt: `Find the retail unit price of the grocery product on this page. Return the numeric price and ISO 4217 currency code (e.g. ${expectedCurrency}).`,
schema: {
type: 'object',
properties: {
price: { type: 'number', description: 'Retail price as a number' },
currency: { type: 'string', description: 'ISO 4217 currency code, e.g. SAR, KRW, USD' },
},
required: ['price', 'currency'],
},
},
}),
signal: AbortSignal.timeout(30_000),
});
if (!resp.ok) {
const txt = await resp.text().catch(() => '');
console.warn(` [Firecrawl] ${resp.status}: ${txt.slice(0, 80)}`);
return null;
}
const data = await resp.json();
const ex = data?.data?.extract;
if (!ex?.price || ex.price <= 0) return null;
const ccy = (ex.currency || '').toUpperCase().trim();
if (ccy !== expectedCurrency) {
console.warn(` [Firecrawl] currency mismatch: got ${ccy}, expected ${expectedCurrency}`);
return null;
}
const minPrice = CURRENCY_MIN[expectedCurrency] ?? 0;
if (ex.price <= minPrice || ex.price >= 100000) return null;
return { price: ex.price, currency: expectedCurrency, source: url };
} catch (err) {
console.warn(` [Firecrawl] error: ${err.message}`);
return null;
}
}
// Fast learned-route replay: direct fetch + matchPrice + same guardrails as EXA/Firecrawl paths.
// Inline (not in _seed-utils) because it closes over CURRENCY_MIN, ITEM_USD_MAX, matchPrice.
async function tryDirectFetch(url, expectedCurrency, itemId, fxRate) {
try {
const resp = await fetch(url, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(8_000),
});
if (!resp.ok) return null;
const text = await resp.text();
const hit = matchPrice(text.slice(0, 10_000), url);
if (!hit || hit.currency !== expectedCurrency) return null;
const minPrice = CURRENCY_MIN[expectedCurrency] ?? 0;
if (hit.price <= minPrice || hit.price >= 100_000) return null;
if (fxRate && ITEM_USD_MAX[itemId] && hit.price * fxRate > ITEM_USD_MAX[itemId]) {
console.warn(` [learned bulk] ${itemId}: ${hit.price} ${expectedCurrency} ($${(hit.price * fxRate).toFixed(2)}) > max — skipping`);
return null;
}
return hit.price;
} catch {
return null;
}
}
// All supported currency codes — keep in sync with grocery-basket.json fxSymbols
const CCY = 'USD|GBP|EUR|JPY|CNY|INR|AUD|CAD|BRL|MXN|ZAR|TRY|NGN|KRW|SGD|PKR|AED|SAR|QAR|KWD|BHD|OMR|EGP|JOD|LBP|KES|ARS|IDR|PHP';
// Currency symbol → ISO code map for sites that use symbols instead of ISO codes
const SYMBOL_MAP = { '£': 'GBP', '€': 'EUR', '¥': 'JPY', '₩': 'KRW', '₹': 'INR', '₦': 'NGN', 'R$': 'BRL' };
// Minimum plausible local price per currency — prevents matching product codes / IDs
// e.g. IDR 4 = $0.0003 (nonsense), NGN 20 = $0.01 (nonsense), KRW 5 = $0.004 (nonsense)
// JPY: grocery items in Japan cost 100+ yen; under 50 = product code or sub-unit price.
// TRY: Turkish supermarket shelf prices ≥ 10 TRY; under 10 = per-100g sub-unit match.
// EGP: Egyptian supermarket prices ≥ 5 EGP; under 5 = subsidised/fractional unit.
// INR: Indian supermarket prices ≥ 12 INR; under 12 = product code or stale clearance.
const CURRENCY_MIN = { NGN: 50, IDR: 500, ARS: 50, KRW: 1000, ZAR: 2, PKR: 20, LBP: 1000, JPY: 50, TRY: 10, EGP: 5, INR: 12 };
// Maximum plausible USD price per item — catches bulk/wholesale/specialty products.
// Set to ~2× the most expensive legitimate retail price globally for each item.
// Previous caps were too loose (e.g. sugar: 8 allowed 5.99 EUR organic sugar from carrefour.fr).
const ITEM_USD_MAX = { sugar: 3.5, salt: 2.5, rice: 6, pasta: 3.5, potatoes: 6, oil: 10, flour: 4.5, eggs: 12, milk: 5, bread: 6 };
// Pattern order matters: try currency-FIRST (e.g. "GBP 1.50") before number-first
// to avoid matching pack sizes / weights that precede a currency token (e.g. "12 SAR" in "eggs 12 SAR 8.99")
const PRICE_PATTERNS = [
new RegExp(`(${CCY})\\s*(\\d+(?:\\.\\d{1,3})?)`, 'i'), // CCY then number (preferred)
new RegExp(`(\\d+(?:\\.\\d{1,3})?)\\s*(${CCY})`, 'i'), // number then CCY (fallback — use last match)
];
function matchPrice(text, url) {
// Pattern 0: currency-first — take the first match (safe, no ambiguity)
const re0 = PRICE_PATTERNS[0];
const m0 = text.match(re0);
if (m0) {
const price = parseFloat(m0[2]);
const currency = m0[1].toUpperCase();
const minPrice = CURRENCY_MIN[currency] ?? 0;
if (price > minPrice && price < 100000) return { price, currency, source: url || '' };
}
// Pattern 1: number-first — collect ALL matches and take the LAST one to avoid
// matching pack counts / weights (e.g. "12" in "eggs 12 pack SAR 8.99")
const re1 = PRICE_PATTERNS[1];
const allMatches = [...text.matchAll(new RegExp(re1.source, 'gi'))];
if (allMatches.length) {
for (const match of allMatches.reverse()) {
const price = parseFloat(match[1]);
const currency = match[2].toUpperCase();
const minPrice = CURRENCY_MIN[currency] ?? 0;
if (price > minPrice && price < 100000) return { price, currency, source: url || '' };
}
}
// Fallback: currency symbols (£, €, ¥, ₹, ₩, ₦, R$)
for (const [sym, iso] of Object.entries(SYMBOL_MAP)) {
const re = new RegExp(`${sym.replace('$', '\\$')}\\s*(\\d+(?:[.,]\\d{1,3})?)`, 'i');
const m = text.match(re);
if (m) {
const price = parseFloat(m[1].replace(',', '.'));
const minPrice = CURRENCY_MIN[iso] ?? 0;
if (price > minPrice && price < 100000) return { price, currency: iso, source: url || '' };
}
}
return null;
}
function extractPrice(result, expectedCurrency) {
const url = result.url || '';
const summary = result?.summary;
if (summary && typeof summary === 'string') {
const hit = matchPrice(summary, url);
if (hit && hit.currency !== expectedCurrency) {
console.warn(` [extractPrice] currency mismatch: got ${hit.currency}, expected ${expectedCurrency}${url}`);
return null;
}
if (hit) return hit;
}
// Fallback: title
const fromTitle = matchPrice(result.title || '', url);
if (fromTitle && fromTitle.currency !== expectedCurrency) return null;
return fromTitle;
}
async function fetchGroceryBasketPrices(prevSnapshot) {
const fxRates = await getSharedFxRates(config.fxSymbols, FX_FALLBACKS);
const countriesResult = [];
// Load all learned routes in one pipeline request before the country loop.
// Include sentinel keys for one-time migrations so each eviction only fires once.
const OIL_MIGRATION_KEY = '_migration:canola-oil-v1';
const BAD_PRICES_KEY = '_migration:bad-prices-v1'; // JP/TR/EG/IN sub-unit scrapes + JP site change
const routeKeys = [...config.countries.flatMap(c => config.items.map(i => `${c.code}:${i.id}`)), OIL_MIGRATION_KEY, BAD_PRICES_KEY];
const learnedRoutes = await bulkReadLearnedRoutes('grocery-basket', routeKeys).catch((err) => {
console.warn(` [routes] load failed (non-fatal): ${err.message}`);
return new Map();
});
const routeUpdates = new Map();
const routeDeletes = new Set();
console.log(` [routes] loaded ${learnedRoutes.size} learned routes`);
// One-time migration: evict stale oil routes when query changed sunflower → canola.
// Guarded by OIL_MIGRATION_KEY so it only fires once; subsequent runs skip entirely.
if (!learnedRoutes.has(OIL_MIGRATION_KEY)) {
const oilEvictions = new Set(config.countries.map(c => `${c.code}:oil`).filter(k => learnedRoutes.has(k)));
if (oilEvictions.size > 0) {
console.log(` [routes] one-time migration: evicting ${oilEvictions.size} stale oil routes (sunflower → canola)`);
await bulkWriteLearnedRoutes('grocery-basket', new Map(), oilEvictions).catch(err =>
console.warn(` [routes] oil eviction failed (non-fatal): ${err.message}`)
);
for (const k of oilEvictions) learnedRoutes.delete(k);
}
routeUpdates.set(OIL_MIGRATION_KEY, 'done'); // persisted at end of run alongside other route updates
}
// One-time eviction: clear known-bad routes from the previous site config (JP) and
// from confirmed sub-unit price scrapes (TR/EG/IN). Forces fresh EXA searches on next run.
// All JP routes evicted because sites changed from aggregators (kakaku.com) to supermarkets.
if (!learnedRoutes.has(BAD_PRICES_KEY)) {
const knownBad = new Set([
...config.countries.filter(c => c.code === 'JP').flatMap(c => config.items.map(i => `JP:${i.id}`)),
'TR:sugar', 'TR:eggs', 'TR:milk', 'TR:oil',
'EG:salt', 'EG:bread', 'EG:milk',
'IN:potatoes', 'IN:milk',
].filter(k => learnedRoutes.has(k)));
if (knownBad.size > 0) {
console.log(` [routes] one-time eviction: clearing ${knownBad.size} known-bad price routes (JP sites + TR/EG/IN sub-unit)`);
await bulkWriteLearnedRoutes('grocery-basket', new Map(), knownBad).catch(err =>
console.warn(` [routes] bad-prices eviction failed (non-fatal): ${err.message}`)
);
for (const k of knownBad) learnedRoutes.delete(k);
}
routeUpdates.set(BAD_PRICES_KEY, 'done');
}
for (const country of config.countries) {
console.log(`\n Processing ${country.flag} ${country.name} (${country.currency})...`);
const fxRate = fxRates[country.currency] || FX_FALLBACKS[country.currency] || null;
const allowedHosts = country.sites.map(s => s.replace(/^www\./, '').split('/')[0]);
// Process all items concurrently — 100ms stagger to respect EXA/Firecrawl rate limits
const itemPrices = await Promise.all(config.items.map(async (item, idx) => {
await sleep(idx * 200); // stagger starts — 200ms prevents EXA rate limit with 10 concurrent
const routeKey = `${country.code}:${item.id}`;
const learned = learnedRoutes.get(routeKey);
// --- Learned route fast path + EXA fallback ---
const { localPrice, sourceSite, routeUpdate, routeDelete } = await processItemRoute({
learned,
allowedHosts,
currency: country.currency,
itemId: item.id,
fxRate,
itemUsdMax: ITEM_USD_MAX[item.id] || null,
tryDirectFetch,
scrapeFirecrawl,
fetchViaExa: async () => {
let exaPrice = null;
let exaSite = '';
let exaUrls = [];
try {
const exaResult = await searchExa(`${item.query} price`, country.sites, country.code);
if (exaResult?.results?.length) {
exaUrls = exaResult.results.map(r => r.url).filter(Boolean);
for (const result of exaResult.results) {
const extracted = extractPrice(result, country.currency);
if (!extracted) continue;
if (fxRate && ITEM_USD_MAX[item.id]) {
const usdEquiv = extracted.price * fxRate;
if (usdEquiv > ITEM_USD_MAX[item.id]) {
console.warn(` [bulk] ${item.id}: ${extracted.price} ${country.currency} ($${usdEquiv.toFixed(2)}) > max $${ITEM_USD_MAX[item.id]} — skipping`);
continue;
}
}
exaPrice = extracted.price;
exaSite = extracted.source;
break;
}
}
} catch (err) {
console.warn(` [${country.code}/${item.id}] EXA error: ${err.message}`);
}
// Firecrawl fallback for EXA-discovered URLs (handles JS-heavy SPAs)
if (exaPrice === null && exaUrls.length > 0) {
for (const url of exaUrls.slice(0, 2)) {
const fc = await scrapeFirecrawl(url, country.currency);
if (!fc) continue;
if (fxRate && ITEM_USD_MAX[item.id]) {
const usdEquiv = fc.price * fxRate;
if (usdEquiv > ITEM_USD_MAX[item.id]) {
console.warn(` [FC bulk] ${item.id}: ${fc.price} ${country.currency} ($${usdEquiv.toFixed(2)}) > max — skipping`);
continue;
}
}
exaPrice = fc.price;
exaSite = fc.source;
console.log(` [FC✓] ${item.id}: ${url.slice(0, 55)}`);
break;
}
}
return exaPrice !== null ? { localPrice: exaPrice, sourceSite: exaSite } : null;
},
sleep,
firecrawlDelayMs: FIRECRAWL_DELAY_MS,
});
if (routeDelete) routeDeletes.add(routeKey);
if (routeUpdate) routeUpdates.set(routeKey, routeUpdate);
const usdPrice = localPrice !== null && fxRate ? +(localPrice * fxRate).toFixed(4) : null;
const status = localPrice !== null ? `${localPrice} ${country.currency} = $${usdPrice}` : 'N/A';
console.log(` ${item.id}: ${status}`);
return {
itemId: item.id,
itemName: item.name,
unit: item.unit,
localPrice: localPrice !== null ? +localPrice.toFixed(4) : null,
usdPrice,
currency: country.currency,
sourceSite,
available: localPrice !== null,
};
}));
let totalUsd = 0;
for (const ip of itemPrices) if (ip.usdPrice !== null) totalUsd += ip.usdPrice;
countriesResult.push({
code: country.code,
name: country.name,
currency: country.currency,
flag: country.flag,
totalUsd: +totalUsd.toFixed(2),
fxRate: fxRate || 0,
items: itemPrices,
});
}
// Persist learned routes for next run (non-fatal)
await bulkWriteLearnedRoutes('grocery-basket', routeUpdates, routeDeletes).catch(err =>
console.warn(` [routes] write failed (non-fatal): ${err.message}`)
);
// Cross-country outlier gate — bilateral: rejects per-item prices that are either
// > 4× the median (bulk/wholesale/specialty scrape error)
// < ¼ the median (sub-unit price, product code, stale scraped value)
// Both directions evict the learned route so the bad URL isn't replayed next seed.
const itemIds = config.items.map(i => i.id);
const outlierEvictions = new Set();
for (const itemId of itemIds) {
const pricePoints = countriesResult
.map(c => c.items.find(i => i.itemId === itemId)?.usdPrice)
.filter(p => p != null && p > 0);
if (pricePoints.length < 3) continue; // need ≥ 3 data points for meaningful median
pricePoints.sort((a, b) => a - b);
const median = pricePoints[Math.floor(pricePoints.length / 2)];
const ceiling = median * 4;
const floor = median / 4;
for (const country of countriesResult) {
const item = country.items.find(i => i.itemId === itemId);
if (!item?.usdPrice || item.usdPrice <= 0) continue;
const isHigh = item.usdPrice > ceiling;
const isLow = item.usdPrice < floor;
if (!isHigh && !isLow) continue;
const reason = isHigh
? `$${item.usdPrice.toFixed(4)} > 4× median $${median.toFixed(2)}`
: `$${item.usdPrice.toFixed(4)} < ¼ median $${median.toFixed(2)}`;
console.warn(` [outlier] ${country.code}/${itemId}: ${reason} — clearing + evicting learned route`);
item.available = false;
item.localPrice = null;
item.usdPrice = null;
outlierEvictions.add(`${country.code}:${itemId}`);
}
}
if (outlierEvictions.size > 0) {
await bulkWriteLearnedRoutes('grocery-basket', new Map(), outlierEvictions).catch(err =>
console.warn(` [routes] outlier eviction write failed (non-fatal): ${err.message}`)
);
}
// Recompute totals after outlier pass
for (const country of countriesResult) {
country.totalUsd = +country.items.reduce((s, ip) => s + (ip.usdPrice ?? 0), 0).toFixed(2);
}
// Only rank countries with enough items found — a country with 4/10 items
// could appear "cheapest" purely due to missing data, not actual prices.
const MIN_ITEMS_FOR_RANKING = Math.ceil(config.items.length * 0.7); // ≥ 70% coverage
const rankable = countriesResult.filter(c => {
const found = c.items.filter(ip => ip.available).length;
return c.totalUsd > 0 && found >= MIN_ITEMS_FOR_RANKING;
});
const cheapest = rankable.length ? rankable.reduce((a, b) => a.totalUsd < b.totalUsd ? a : b).code : '';
const mostExpensive = rankable.length ? rankable.reduce((a, b) => a.totalUsd > b.totalUsd ? a : b).code : '';
// Compute WoW per country — only valid when prev snapshot used the same basket composition.
// A version mismatch (e.g. oil changed from sunflower to canola) would produce bogus deltas.
const wowAvailable = prevSnapshot?.countries?.length > 0 && prevSnapshot.basketVersion === BASKET_VERSION;
if (wowAvailable) {
const prevMap = Object.fromEntries(prevSnapshot.countries.map(c => [c.code, c.totalUsd]));
for (const country of countriesResult) {
if (country.totalUsd > 0 && prevMap[country.code] != null && prevMap[country.code] > 0) {
country.wowPct = +((country.totalUsd - prevMap[country.code]) / prevMap[country.code] * 100).toFixed(2);
} else {
country.wowPct = null;
}
}
}
const wowCountries = wowAvailable ? countriesResult.filter(c => c.wowPct != null) : [];
const wowAvgPct = wowCountries.length > 0
? +(wowCountries.reduce((s, c) => s + c.wowPct, 0) / wowCountries.length).toFixed(2)
: 0;
return {
countries: countriesResult,
fetchedAt: new Date().toISOString(),
cheapestCountry: cheapest,
mostExpensiveCountry: mostExpensive,
wowAvgPct,
wowAvailable,
prevFetchedAt: wowAvailable ? (prevSnapshot.fetchedAt ?? '') : '',
basketVersion: BASKET_VERSION,
};
}
const prevSnapshot = await readSeedSnapshot(CANONICAL_KEY);
export function declareRecords(data) {
return Array.isArray(data?.countries) ? data.countries.length : 0;
}
await runSeed('economic', 'grocery-basket', CANONICAL_KEY, () => fetchGroceryBasketPrices(prevSnapshot), {
ttlSeconds: CACHE_TTL,
validateFn: (data) => {
if (!data?.countries?.length) return false;
const minItems = Math.ceil(config.items.length * 0.4); // 40% item coverage per country
const covered = data.countries.filter(c => c.items.filter(i => i.available).length >= minItems);
if (covered.length < 5) { console.warn(` [validate] only ${covered.length} countries with ≥40% item coverage — rejecting`); return false; }
return true;
},
recordCount: (data) => data?.countries?.length || 0,
extraKeys: prevSnapshot ? [{
key: `${CANONICAL_KEY}:prev`,
transform: () => prevSnapshot, // write PRE-overwrite snapshot; ignore new data
ttl: CACHE_TTL * 2,
declareRecords,
}] : undefined,
declareRecords,
schemaVersion: 1,
maxStaleMin: 10080,
sourceVersion: 'grocery-basket-v1',
});