eliott/worldmonitor - worldmonitor - lab48

eliott/worldmonitor

mirror of https://github.com/koala73/worldmonitor.git synced 2026-04-26 01:24:59 +02:00

Author	SHA1	Message	Date
Elie Habib	5f40f8a13a	feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup (#3384 ) * feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup Prereq for the re-export-share Comtrade seeder (plan 2026-04-24-003), usable by any cohort seeder whose consumer needs bundle-level freshness. Two coupled changes: 1. `_bundle-runner.mjs` injects `BUNDLE_RUN_STARTED_AT_MS` into every spawned child. All siblings in a single bundle run share one value (captured at `runBundle` start, not spawn time). Consumers use this to detect stale peer keys — if a peer's seed-meta predates the current bundle run, fall back to a hard default rather than read a cohort-peer's last-week output. 2. `_seed-utils.mjs::runSeed` registers a `process.once('SIGTERM')` handler that releases the acquired lock and extends existing-data TTL before exiting 143. `_bundle-runner.mjs` sends SIGTERM on section timeout, then SIGKILL after KILL_GRACE_MS (5s). Without this handler the `finally` path never runs on SIGKILL, leaving the 30-min acquireLock reservation in place until its own TTL expires — the next cron tick silently skips the resource. Regression guard memory: `bundle-runner-sigkill-leaks-child-lock` (PR #3128 root cause). Tests added: - bundle-runner env injection (value within run bounds) - sibling sections share the same timestamp (critical for the consumer freshness guard) - runSeed SIGTERM path: exit 143 + cleanup log - process.once contract: second SIGTERM does not re-enter handler * fix(seed): address P1/P2 review findings on SIGTERM + bundle contracts Addresses PR #3384 review findings (todos 256, 257, 259, 260): #256 (P1) — SIGTERM handler narrowed to fetch phase only. Was installed at runSeed entry and armed through every `process.exit` path; could race `emptyDataIsFailure: true` strict-floor exits (IMF-External, WB-bulk) and extend seed-meta TTL when the contract forbids it — silently re-masking 30-day outages. Now the handler is attached immediately before `withRetry(fetchFn)` and removed in a try/finally that covers all fetch-phase exit branches. #257 (P1) — `BUNDLE_RUN_STARTED_AT_MS` now has a first-class helper. Exported `getBundleRunStartedAtMs()` from `_seed-utils.mjs` with JSDoc describing the bundle-freshness contract. Fleet-wide helper so the next consumer seeder imports instead of rediscovering the idiom. #259 (P2) — SIGTERM cleanup runs `Promise.allSettled` on disjoint-key ops (`releaseLock` + `extendExistingTtl`). Serialising compounded Upstash latency during the exact failure mode (Redis degraded) this handler exists to handle, risking breach of the 5s SIGKILL grace. #260 (P2) — `_bundle-runner.mjs` asserts topological order on optional `dependsOn` section field. Throws on unknown-label refs and on deps appearing at a later index. Fleet-wide contract replacing the previous prose-comment ordering guarantee. Tests added/updated: - New: SIGTERM handler removed after fetchFn completes (narrowed-scope contract — post-fetch SIGTERM must NOT trigger TTL extension) - New: dependsOn unknown-label + out-of-order + happy-path (3 tests) Full test suite: 6,866 tests pass (+4 net). * fix(seed): getBundleRunStartedAtMs returns null outside a bundle run Review follow-up: the earlier `Math.floor(Date.now()/1000)1000` fallback regressed standalone (non-bundle) runs. A consumer seeder invoked manually just after its peer wrote `fetchedAt = (now - 5s)` would see `bundleStartMs = Date.now()`, reject the perfectly-fresh peer envelope as "stale", and fall back to defaults — defeating the point of the peer-read path outside the bundle. Returning null when `BUNDLE_RUN_STARTED_AT_MS` is unset/invalid keeps the freshness gate scoped to its real purpose (across-bundle-tick staleness) and lets standalone runs skip the gate entirely. Consumers check `bundleStartMs != null` before applying the comparison; see the companion `seed-sovereign-wealth.mjs` change on the stacked PR. test(seed): SIGTERM cleanup test now verifies Redis DEL + EXPIRE calls Greptile review P2 on PR #3384: the existing test only asserted exit code + log line, not that the Redis ops were actually issued. The log claim was ahead of the test. Fixture now logs every Upstash fetch call's shape (EVAL / pipeline- EXPIRE / other) to stderr. Test asserts: - >=1 EVAL op was issued during SIGTERM cleanup (releaseLock Lua script on the lock key) - >=1 pipeline-EXPIRE op was issued (extendExistingTtl on canonical + seed-meta keys) - The EVAL body carries the runSeed-generated runId (proves it's THIS run's release, not a phantom op) - The EXPIRE pipeline touches both the canonicalKey AND the seed-meta key (proves the keys[] array was built correctly including the extraKeys merge path) Full test suite: 6,866 tests pass, typecheck clean.	2026-04-25 00:14:04 +04:00
Elie Habib	8a988323d2	chore(bundle-runner): emit reliable per-section summary line on parent stdout (#3320 ) * chore(bundle-runner): emit reliable per-section summary line on parent stdout Fixes observability asymmetry in Railway bundle service logs where some seeders appeared to skip lines like \`Run ID\`, \`Mode\`, or the structured \`seed_complete\` JSON event. Root cause is Railway's log ingestion dropping child-stdout lines when multiple seeders emit at similar timestamps — observed in the PR #3294 launch run where Pipelines-Gas was missing its \`=== Seed ===\` banner, Pipelines-Oil had \`Key:\` emitted BEFORE the banner, Storage-Facilities and Energy- Disruptions were missing Run ID + Mode + seed_complete entirely, despite identical code paths. All child processes emit the same lines; Railway just loses some. Fix is to piggy-back on the observation that bundle-level lines (\`[Bundle:X] Starting\`, \`[Bundle:X] Finished\`) ARE reliably captured — they come from the parent process's single stdout stream. Changes in scripts/_bundle-runner.mjs: - spawnSeed now captures the child's \`{"event":"seed_complete",...}\` JSON line while streaming stdout, parses it, and attaches to the settle result. - Main loop emits one bundle-level summary line per section after child exit: [Bundle:X] section=NAME status=OK durationMs=1237 records=15 state=OK (or \`status=FAILED elapsed=...s reason=...\` for failures). - Summary line survives Railway's log ingestion even when per-section child lines drop, giving monitors a reliable event to key off. Observability consumers (log-based alerts, seed telemetry scrapers) should now key off the bundle-level summary rather than per-section child lines which remain best-effort. The per-section child lines stay as-is for interactive debugging. Verification: parse logic sanity-checked against the exact seed_complete line format. Node syntax check clean. No schema changes. * fix(bundle-runner): emit FAILED summary line to stderr, not stdout The prior commit introduced a bundle-level structured summary line per section. On success that correctly goes to stdout; on FAILED it was also going to stdout — but that broke tests/bundle-runner.test.mjs test 140 ("timeout emits terminal reason BEFORE SIGTERM/SIGKILL grace"). The test concatenates stdout+stderr and asserts that `SIGKILL` appears AFTER `Failed after` in the combined string (verifying the kill-decision log line is emitted BEFORE the 10s SIGTERM→SIGKILL grace window, so it survives container termination). My new FAILED summary line — which includes the reason string `timeout after 1s (signal SIGKILL)` — landed on stdout, which comes first in the concatenation, and its `SIGKILL` substring matched before the stderr-side `Did not exit on SIGTERM...SIGKILL` line. Ordering assertion failed. Fix: route the FAILED summary line through console.error (same stream as the pre-kill `Failed after ... sending SIGTERM` and the grace-window `Did not exit...SIGKILL` lines). Chronological ordering in combined output is preserved; test passes. OK summary lines stay on stdout — they're observability data, not error diagnostics, and belong on the normal stream alongside the bundle Starting/Finished lines. Local: `node --test tests/bundle-runner.test.mjs` — 4/4 pass including the previously-failing ordering test.	2026-04-23 08:36:05 +04:00
Elie Habib	e6a6d4e326	fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout (#3114 ) * fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout Silent Railway crashes in seed-bundle-portwatch — container exits after ~7min with ZERO logs from the hanging section. Root cause in the runner, not the seeder: execFile buffers child stdout until the callback fires, and its default SIGTERM never escalates to SIGKILL, so a child with in-flight HTTPS sockets can outlive the timeout and be killed by the container limit before any error is logged. Switch to spawn + live line-prefixed streaming. On timeout, send SIGTERM, then SIGKILL after a 10s grace. Always log the terminal reason (timeout / exit code / signal) so the next failing bundle surfaces the hung section on its own line instead of going dark. Applies to all 15 seed-bundle-.mjs services that use this runner. fix(bundle-runner): guard double-resolve, update docstring, add tests Review follow-ups: - Idempotent settle() so spawn 'error' + 'close' can't double-resolve - Header comment reflects spawn + streaming + SIGKILL behavior - tests/bundle-runner.test.mjs covers live streaming, SIGKILL escalation when a child ignores SIGTERM, and non-zero exit reporting * fix(bundle-runner): address PR review — declare softKill before settle, handle stdio error * fix(bundle-runner): log terminal reason BEFORE SIGKILL grace + include grace in budget Review P1 follow-up. Two gaps the previous commit left open: 1. A section with timeoutMs close to Railway's ~10min container cap could be killed by the container mid-grace, before the "Failed ... timeout" line reached the log stream. Fix: emit the terminal Failed line at the moment softKill fires (before SIGTERM), so the reason is flushed BEFORE any grace window that could be truncated by a container kill. 2. The admission check used raw timeoutMs, but worst-case runtime is timeoutMs + KILL_GRACE_MS when the child ignores SIGTERM. A section that "fit" the budget could still overrun. Fix: compare elapsed + timeout + grace against maxBundleMs. close handler still settles the promise but no longer re-logs on the timeout path (alreadyLogged flag). New test asserts the Failed line precedes SIGKILL escalation, and that budget accounts for grace.	2026-04-16 07:58:18 +04:00
Elie Habib	dc10e47197	feat(seed-contract): PR 1 foundation — envelope + contract + conformance test (#3095 ) * feat(seed-contract): PR 1 foundation — envelope helpers + contract validators + static conformance test Adds the foundational pieces for the unified seed contract rollout described in docs/plans/2026-04-14-002-fix-runseed-zero-record-lockout-plan.md. Behavior- preserving by construction: legacy-shape Redis values unwrap as { _seed: null, data: raw } and pass through every helper unchanged. New files: - scripts/_seed-envelope-source.mjs — single source of truth for unwrapEnvelope, stripSeedEnvelope, buildEnvelope. - api/_seed-envelope.js — edge-safe mirror (AGENTS.md:80 forbids api/* importing from server/). - server/_shared/seed-envelope.ts — TS mirror with SeedMeta, SeedEnvelope, UnwrapResult types. - scripts/_seed-contract.mjs — SeedContractError + validateDescriptor (10 required fields, 10 optional, unknown-field rejection) + resolveRecordCount (non-negative integer or throw). - scripts/verify-seed-envelope-parity.mjs — diffs function bodies between the two JS copies; TS copy guarded by tsc. - tests/seed-envelope.test.mjs — 14 tests for the three helpers (null, legacy-passthrough, stringified JSON, round-trip). - tests/seed-contract.test.mjs — 25 tests for validateDescriptor/ resolveRecordCount + a soft-warn conformance scan that STATICALLY parses scripts/seed-.mjs (never dynamic import — several seeders process.exit() at module load). Currently logs 91 seeders awaiting declareRecords migration. Wiring (minimal, behavior-preserving): - api/health.js: imports unwrapEnvelope; routes readSeedMeta's parsed value through it. Legacy meta has no _seed wrapper → passes through unchanged. - scripts/_bundle-runner.mjs: readSectionFreshness prefers envelope at section.canonicalKey when present, falls back to the existing seed-meta:<key> read via section.seedMetaKey (unchanged path today since no bundle defines canonicalKey yet). No seeder modified. No writes changed. All 5279 existing data tests still green; both typechecks clean; parity verifier green; 39 new tests pass. PR 2 will migrate seeders, bundles, and readers to envelope semantics. PR 3 removes the legacy path and hard-fails the conformance test. fix(seed-contract): address PR #3095 review — metaTtlSeconds opt, bundle fallback, strict conformance mode Review findings applied: P1 — metaTtlSeconds missing from OPTIONAL_FIELDS whitelist. scripts/seed-jodi-gas.mjs:250 passes metaTtlSeconds to runSeed(); field is consumed by _seed-utils writeSeedMeta. Without it in the whitelist, PR 2's validateDescriptor wiring would throw 'unknown field' the moment jodi-gas migrates. Added with a 'removed in PR 3' note. P2 — Bundle canonicalKey short-circuit over-runs during migration. readSectionFreshness previously returned null if canonicalKey had no envelope yet, even when a legacy seed-meta key was also declared — making every cron re-run the section. Fixed to fall through to seedMetaKey on null envelope so the transition state is safe. P3 — Conformance soft-warn signal was invisible in CI. tests/seed-contract.test.mjs now emits a t.diagnostic summary line ('N/M seeders export declareRecords') visible on every run and gates hard-fail behind SEED_CONTRACT_STRICT=1 so PR 3 can flip to strict without more code. Nitpick — parity regex missed 'export async function'. Added '(?:async\s+)?' to scripts/verify-seed-envelope-parity.mjs function extraction regex. Verified: 39 tests green, parity verifier green, strict mode correctly hard-fails with 91 seeders missing (expected during PR 1). * fix(seed-contract): address review round 2 — NaN/empty-string validation, Error cause, parity CI wiring P2 — Non-finite ttlSeconds/maxStaleMin bypassed validation. `typeof NaN === 'number'` and `NaN > 0 === false` meant a NaN duration passed the old typeof+<=0 checks and would have poisoned TTLs once validateDescriptor is wired into runSeed. Now gated by Number.isFinite, which rejects NaN and ±Infinity. Tests added for NaN/Infinity on both fields. P2 — Empty/whitespace-only strings for domain/resource/canonicalKey/sourceVersion bypassed validation. Added .trim() === '' rejection + tests per field. This mattered because canonicalKey='' would have landed writes at the empty key and seed-meta under a blank resource namespace. P3 — SeedContractError silently dropped the Error v2 cause option. Constructor now forwards { cause } through super() so err.cause works with standard tooling (Node's default stack printer, Sentry chained-cause serialization). resolveRecordCount's manual err.cause = err assignment was replaced with the options-bag form. Test added for both constructor direct-use and the resolveRecordCount wrap path. P3 — Parity verifier was not on an automated path. Added tests/seed-envelope-parity.test.mjs which spawns scripts/verify-seed-envelope-parity.mjs via execFile; non-zero exit (drift) → test fails. Now runs as part of `npm run test:data` (tsx --test tests/.test.mjs). Drift injection confirmed: sed -i modifying api/_seed-envelope.js makes the test fail with 'Command failed' from execFile. 51 tests total (was 39). All green on clean tree. fix(seed-contract): conformance test checks full descriptor, not just declareRecords Previous conformance check green-lit any seeder that exported declareRecords, even if the runSeed(...) call-site omitted other validateDescriptor-required opts (validateFn, ttlSeconds, sourceVersion, schemaVersion, maxStaleMin). That would have produced a false readiness signal for PR 3's strict flip: test goes green, but wiring validateDescriptor() into runSeed in PR 2 would still throw at runtime across the fleet. Examples verified on the PR head: - scripts/seed-cot.mjs:188-192 — no sourceVersion/schemaVersion/maxStaleMin - scripts/seed-market-breadth.mjs:121-124 — same - scripts/seed-jodi-gas.mjs:248-253 — no schemaVersion/maxStaleMin Now the conformance test: 1. AST-lite extracts the runSeed(...) call site with balanced parens, tolerating strings and comments. 2. Checks every REQUIRED_OPTS_FIELDS entry (validateFn, declareRecords, ttlSeconds, sourceVersion, schemaVersion, maxStaleMin) is present as an object key in that call-site. 3. Emits a per-file diagnostic listing missing fields. 4. Migration signal is now accurate: 0/91 seeders fully satisfy the descriptor (was claiming 0/91 missing just declareRecords). Matches the underlying validateDescriptor behavior. Verified: strict mode (SEED_CONTRACT_STRICT=1) surfaces 'opt:schemaVersion, opt:maxStaleMin' as missing fields per seeder — actionable for PR 2 migration work. 51 tests total (unchanged count; behavior change is in which seeders the one conformance test considers migrated). * fix(seed-contract): strip comments/strings before parsing runSeed() call site The conformance scanner located the first 'runSeed(' substring in the raw source, which caught commented-out mentions upstream of the real call. Offending files where this produced false 'incomplete' diagnoses: - scripts/seed-bis-data.mjs:209 // runSeed() calls process.exit(0)… real call at :220 - scripts/seed-economy.mjs:788 header comment mentioning runSeed() real call at :891 Three files had the same pattern. Under strict mode these would have been false hard failures in PR 3 even when the real descriptor was migrated. Fix: - stripCommentsAndStrings(src) produces a view where block comments, line comments, and string/template literals are replaced with spaces (line feeds preserved). Indices stay aligned with the original source so extractRunSeedCall can match against the stripped view and then slice the original source for the real call body. - descriptorFieldsPresent() also runs its field-presence regex against the stripped call body so '// TODO: validateFn' inside the call doesn't fool the check. - hasRunSeedCall() uses the stripped view too, which correctly excludes 5 seeders that only mentioned runSeed in comments. Count dropped 91→86 real callers. Added 4 targeted tests covering: - runSeed() inside a line comment ahead of the real call - runSeed() inside a block comment - runSeed() inside a string literal ("don't call runSeed() directly") - descriptor field names inside an inline comment don't count as present Verified on the actual files: seed-bis-data.mjs first real runSeed( in stripped source is at line 220 (was line 209 before fix). 40 tests total, all green. * fix(seed-contract): parity verifier survives unbalanced braces in string/template literals Addresses Greptile P2 on PR #3095: the body extractor in scripts/verify-seed-envelope-parity.mjs counted raw { and } on every character. A future helper body that legitimately contains `const marker = '{'` would have pushed depth past zero at the literal brace and truncated the body — silently masking drift in the rest of the function. Extracted the scan into scanBalanced(source, start, open, close) which skips characters inside line comments, block comments, and string / template literals (with escape handling and template-literal ${} recursion for interpolation). Call sites in extractFunctions updated to use the new scanner for both the arg-list parens and the function body braces. Made extractFunctions and scanBalanced exported so the new test file can exercise them directly. Gated main() behind an isMain check so importing the module from tests doesn't trigger process.exit. New tests in tests/seed-envelope-parity.test.mjs: - extractFunctions tolerates unbalanced braces in string literals - same for template literals - same for braces inside block comments - same for braces inside line comments - scanBalanced respects backslash-escapes inside strings - scanBalanced recurses into template-literal ${} interpolation Also addresses the other two Greptile P2s which were already fixed in earlier commits on this branch: - Empty-string gap (`99646dd9a`): .trim()==='' rejection added - SeedContractError cause drop (`99646dd9a`): constructor forwards cause through super's options bag per Error v2 spec 61 tests green. Both typechecks clean.	2026-04-14 22:11:56 +04:00
Elie Habib	9b180d6ee2	fix(bundle-runner): wall-time budget to prevent Railway 10min SIGKILL (#3094 ) * fix(bundle-runner): enforce wall-time budget to prevent Railway 10min SIGKILL Railway cron services SIGKILL the container at 10min. When a bundle happened to have two heavy sections due in the same tick (e.g. PW-Main + PW-Port-Activity with timeoutMs totaling 15min+), the second section's stdout never flushed and Railway marked the run as crashed — even though earlier sections published successfully. - _bundle-runner.mjs: add maxBundleMs budget (default 9min, 60s headroom under Railway's 10min ceiling). Sections whose worst-case timeout would exceed the remaining budget are deferred to the next tick with a clear log line. Summary now reports ran/skipped/deferred/failed. - seed-bundle-portwatch.mjs: lower PW-Port-Activity timeoutMs 600s→420s so a single section can no longer consume the entire budget. Observed on 2026-04-14 16:03 UTC portwatch run: PW-Disruptions + PW-Main ran cleanly, PW-Port-Activity started with ~9m37s of Railway budget and its 10min execFile timeout, got SIGKILL'd before any output flushed, job marked as crash. * fix(bundle-runner): make maxBundleMs opt-in to avoid deferring other bundles Greptile PR review flagged P1: default maxBundleMs=540_000 silently applied to all runBundle callers. At least 12 sections across 7 other bundles (energy-sources, climate, resilience, resilience-validation, imf-extended, static-ref, health) have timeoutMs >= 540_000, which means 0 + 600_000 > 540_000 is true on every first tick — those sections would be permanently deferred with no alarm. Default to Infinity; portwatch opts in via { maxBundleMs: 540_000 }. Other Railway-constrained bundles can opt in as their timeouts are audited.	2026-04-14 21:08:40 +04:00
Elie Habib	6d923108d8	refactor(seeds): bundle orchestrator to consolidate Railway cron services (100→65) (#2891 ) * refactor(seeds): add bundle orchestrator to consolidate Railway cron services Railway is at the 100-service limit. This adds a shared _bundle-runner.mjs orchestrator and 11 bundle scripts that group related seed cron services, reducing the count from 100 to ~65 when deployed. Each bundle spawns sub-seeds via child_process.execFile (proven pattern from ais-relay.cjs), with freshness-gated skipping so monthly seeds in a daily bundle only run when due. Original scripts are unchanged and independently runnable. Bundles: ecb-eu (4→1), portwatch (4→1), climate (5→1), energy-sources (6→1), macro (6→1), health (4→1), static-ref (3→1), resilience (2→1), derived-signals (2→1), market-backup (5→1), relay-backup (4→1). * refactor(seeds): deduplicate time constants across bundle scripts Export MIN/HOUR/DAY/WEEK from _bundle-runner.mjs so all 11 bundle scripts import shared constants instead of re-declaring them locally. Eliminates inconsistent computation styles (2460601000 vs 24HOUR). * fix(seeds): correct wb-indicators seedMetaKey in relay-backup bundle The seed writes to seed-meta:economic:worldbank-techreadiness:v1 but the bundle config was missing the :v1 suffix, causing the freshness gate to always return null and the seed to run every cycle instead of daily. Found by architecture-strategist review agent. * fix(seeds): address review findings in bundle runner - Remove em dashes from comment and log line (project convention) - Read Redis creds directly instead of via getRedisCredentials() which calls process.exit(1) on missing env vars, bypassing try/catch and silently killing the entire bundle before any seed runs - Missing creds now gracefully skip freshness check (seeds still run) * fix(seeds): correct intervalMs values and exit code in bundle runner P1 fixes from external review: 1. process.exit(0) on failure now exits non-zero (exit 1 when failed > 0) so Railway cron monitoring detects degraded runs. 2. Corrected intervalMs to match actual cron cadences (was using TTL values): - crypto-quotes: 15min -> 5min (actual cron is 5min) - stablecoin-markets: 15min -> 10min (actual cron is 10min) - gulf-quotes: 15min -> 10min (actual cron is 10min) - health-air-quality: 3h -> 1h (actual cron is 1h) - bls-series: 3d -> 1d (actual cron is daily) - eurostat: 3d -> 1d (actual cron is daily) - fao-ffpi: 30d -> 1d (runs daily to catch monthly release window) - imf-macro: 35d -> 30d (monthly data) - national-debt: 35d -> 30d (monthly data) * docs: add Railway seed consolidation runbook Complete migration checklist with: - 46 services to delete (with Railway UUIDs) - 11 bundle services to create (with cron, start cmd, watch paths) - 43 standalone services that stay (with reasons) - Execution order, verification checklist, env var guidance - Watch paths: scripts/ + shared/ (covers loadSharedConfig resolution) - Inventory checksum: 4+4+3+46+43 = 100	2026-04-10 11:50:32 +04:00