eliott/worldmonitor - worldmonitor - lab48

eliott/worldmonitor

mirror of https://github.com/koala73/worldmonitor.git synced 2026-05-09 08:42:03 +02:00

Author	SHA1	Message	Date
Elie Habib	e6a6d4e326	fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout (#3114 ) * fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout Silent Railway crashes in seed-bundle-portwatch — container exits after ~7min with ZERO logs from the hanging section. Root cause in the runner, not the seeder: execFile buffers child stdout until the callback fires, and its default SIGTERM never escalates to SIGKILL, so a child with in-flight HTTPS sockets can outlive the timeout and be killed by the container limit before any error is logged. Switch to spawn + live line-prefixed streaming. On timeout, send SIGTERM, then SIGKILL after a 10s grace. Always log the terminal reason (timeout / exit code / signal) so the next failing bundle surfaces the hung section on its own line instead of going dark. Applies to all 15 seed-bundle-.mjs services that use this runner. fix(bundle-runner): guard double-resolve, update docstring, add tests Review follow-ups: - Idempotent settle() so spawn 'error' + 'close' can't double-resolve - Header comment reflects spawn + streaming + SIGKILL behavior - tests/bundle-runner.test.mjs covers live streaming, SIGKILL escalation when a child ignores SIGTERM, and non-zero exit reporting * fix(bundle-runner): address PR review — declare softKill before settle, handle stdio error * fix(bundle-runner): log terminal reason BEFORE SIGKILL grace + include grace in budget Review P1 follow-up. Two gaps the previous commit left open: 1. A section with timeoutMs close to Railway's ~10min container cap could be killed by the container mid-grace, before the "Failed ... timeout" line reached the log stream. Fix: emit the terminal Failed line at the moment softKill fires (before SIGTERM), so the reason is flushed BEFORE any grace window that could be truncated by a container kill. 2. The admission check used raw timeoutMs, but worst-case runtime is timeoutMs + KILL_GRACE_MS when the child ignores SIGTERM. A section that "fit" the budget could still overrun. Fix: compare elapsed + timeout + grace against maxBundleMs. close handler still settles the promise but no longer re-logs on the timeout path (alreadyLogged flag). New test asserts the Failed line precedes SIGKILL escalation, and that budget accounts for grace.	2026-04-16 07:58:18 +04:00
Elie Habib	dc10e47197	feat(seed-contract): PR 1 foundation — envelope + contract + conformance test (#3095 ) * feat(seed-contract): PR 1 foundation — envelope helpers + contract validators + static conformance test Adds the foundational pieces for the unified seed contract rollout described in docs/plans/2026-04-14-002-fix-runseed-zero-record-lockout-plan.md. Behavior- preserving by construction: legacy-shape Redis values unwrap as { _seed: null, data: raw } and pass through every helper unchanged. New files: - scripts/_seed-envelope-source.mjs — single source of truth for unwrapEnvelope, stripSeedEnvelope, buildEnvelope. - api/_seed-envelope.js — edge-safe mirror (AGENTS.md:80 forbids api/* importing from server/). - server/_shared/seed-envelope.ts — TS mirror with SeedMeta, SeedEnvelope, UnwrapResult types. - scripts/_seed-contract.mjs — SeedContractError + validateDescriptor (10 required fields, 10 optional, unknown-field rejection) + resolveRecordCount (non-negative integer or throw). - scripts/verify-seed-envelope-parity.mjs — diffs function bodies between the two JS copies; TS copy guarded by tsc. - tests/seed-envelope.test.mjs — 14 tests for the three helpers (null, legacy-passthrough, stringified JSON, round-trip). - tests/seed-contract.test.mjs — 25 tests for validateDescriptor/ resolveRecordCount + a soft-warn conformance scan that STATICALLY parses scripts/seed-.mjs (never dynamic import — several seeders process.exit() at module load). Currently logs 91 seeders awaiting declareRecords migration. Wiring (minimal, behavior-preserving): - api/health.js: imports unwrapEnvelope; routes readSeedMeta's parsed value through it. Legacy meta has no _seed wrapper → passes through unchanged. - scripts/_bundle-runner.mjs: readSectionFreshness prefers envelope at section.canonicalKey when present, falls back to the existing seed-meta:<key> read via section.seedMetaKey (unchanged path today since no bundle defines canonicalKey yet). No seeder modified. No writes changed. All 5279 existing data tests still green; both typechecks clean; parity verifier green; 39 new tests pass. PR 2 will migrate seeders, bundles, and readers to envelope semantics. PR 3 removes the legacy path and hard-fails the conformance test. fix(seed-contract): address PR #3095 review — metaTtlSeconds opt, bundle fallback, strict conformance mode Review findings applied: P1 — metaTtlSeconds missing from OPTIONAL_FIELDS whitelist. scripts/seed-jodi-gas.mjs:250 passes metaTtlSeconds to runSeed(); field is consumed by _seed-utils writeSeedMeta. Without it in the whitelist, PR 2's validateDescriptor wiring would throw 'unknown field' the moment jodi-gas migrates. Added with a 'removed in PR 3' note. P2 — Bundle canonicalKey short-circuit over-runs during migration. readSectionFreshness previously returned null if canonicalKey had no envelope yet, even when a legacy seed-meta key was also declared — making every cron re-run the section. Fixed to fall through to seedMetaKey on null envelope so the transition state is safe. P3 — Conformance soft-warn signal was invisible in CI. tests/seed-contract.test.mjs now emits a t.diagnostic summary line ('N/M seeders export declareRecords') visible on every run and gates hard-fail behind SEED_CONTRACT_STRICT=1 so PR 3 can flip to strict without more code. Nitpick — parity regex missed 'export async function'. Added '(?:async\s+)?' to scripts/verify-seed-envelope-parity.mjs function extraction regex. Verified: 39 tests green, parity verifier green, strict mode correctly hard-fails with 91 seeders missing (expected during PR 1). * fix(seed-contract): address review round 2 — NaN/empty-string validation, Error cause, parity CI wiring P2 — Non-finite ttlSeconds/maxStaleMin bypassed validation. `typeof NaN === 'number'` and `NaN > 0 === false` meant a NaN duration passed the old typeof+<=0 checks and would have poisoned TTLs once validateDescriptor is wired into runSeed. Now gated by Number.isFinite, which rejects NaN and ±Infinity. Tests added for NaN/Infinity on both fields. P2 — Empty/whitespace-only strings for domain/resource/canonicalKey/sourceVersion bypassed validation. Added .trim() === '' rejection + tests per field. This mattered because canonicalKey='' would have landed writes at the empty key and seed-meta under a blank resource namespace. P3 — SeedContractError silently dropped the Error v2 cause option. Constructor now forwards { cause } through super() so err.cause works with standard tooling (Node's default stack printer, Sentry chained-cause serialization). resolveRecordCount's manual err.cause = err assignment was replaced with the options-bag form. Test added for both constructor direct-use and the resolveRecordCount wrap path. P3 — Parity verifier was not on an automated path. Added tests/seed-envelope-parity.test.mjs which spawns scripts/verify-seed-envelope-parity.mjs via execFile; non-zero exit (drift) → test fails. Now runs as part of `npm run test:data` (tsx --test tests/.test.mjs). Drift injection confirmed: sed -i modifying api/_seed-envelope.js makes the test fail with 'Command failed' from execFile. 51 tests total (was 39). All green on clean tree. fix(seed-contract): conformance test checks full descriptor, not just declareRecords Previous conformance check green-lit any seeder that exported declareRecords, even if the runSeed(...) call-site omitted other validateDescriptor-required opts (validateFn, ttlSeconds, sourceVersion, schemaVersion, maxStaleMin). That would have produced a false readiness signal for PR 3's strict flip: test goes green, but wiring validateDescriptor() into runSeed in PR 2 would still throw at runtime across the fleet. Examples verified on the PR head: - scripts/seed-cot.mjs:188-192 — no sourceVersion/schemaVersion/maxStaleMin - scripts/seed-market-breadth.mjs:121-124 — same - scripts/seed-jodi-gas.mjs:248-253 — no schemaVersion/maxStaleMin Now the conformance test: 1. AST-lite extracts the runSeed(...) call site with balanced parens, tolerating strings and comments. 2. Checks every REQUIRED_OPTS_FIELDS entry (validateFn, declareRecords, ttlSeconds, sourceVersion, schemaVersion, maxStaleMin) is present as an object key in that call-site. 3. Emits a per-file diagnostic listing missing fields. 4. Migration signal is now accurate: 0/91 seeders fully satisfy the descriptor (was claiming 0/91 missing just declareRecords). Matches the underlying validateDescriptor behavior. Verified: strict mode (SEED_CONTRACT_STRICT=1) surfaces 'opt:schemaVersion, opt:maxStaleMin' as missing fields per seeder — actionable for PR 2 migration work. 51 tests total (unchanged count; behavior change is in which seeders the one conformance test considers migrated). * fix(seed-contract): strip comments/strings before parsing runSeed() call site The conformance scanner located the first 'runSeed(' substring in the raw source, which caught commented-out mentions upstream of the real call. Offending files where this produced false 'incomplete' diagnoses: - scripts/seed-bis-data.mjs:209 // runSeed() calls process.exit(0)… real call at :220 - scripts/seed-economy.mjs:788 header comment mentioning runSeed() real call at :891 Three files had the same pattern. Under strict mode these would have been false hard failures in PR 3 even when the real descriptor was migrated. Fix: - stripCommentsAndStrings(src) produces a view where block comments, line comments, and string/template literals are replaced with spaces (line feeds preserved). Indices stay aligned with the original source so extractRunSeedCall can match against the stripped view and then slice the original source for the real call body. - descriptorFieldsPresent() also runs its field-presence regex against the stripped call body so '// TODO: validateFn' inside the call doesn't fool the check. - hasRunSeedCall() uses the stripped view too, which correctly excludes 5 seeders that only mentioned runSeed in comments. Count dropped 91→86 real callers. Added 4 targeted tests covering: - runSeed() inside a line comment ahead of the real call - runSeed() inside a block comment - runSeed() inside a string literal ("don't call runSeed() directly") - descriptor field names inside an inline comment don't count as present Verified on the actual files: seed-bis-data.mjs first real runSeed( in stripped source is at line 220 (was line 209 before fix). 40 tests total, all green. * fix(seed-contract): parity verifier survives unbalanced braces in string/template literals Addresses Greptile P2 on PR #3095: the body extractor in scripts/verify-seed-envelope-parity.mjs counted raw { and } on every character. A future helper body that legitimately contains `const marker = '{'` would have pushed depth past zero at the literal brace and truncated the body — silently masking drift in the rest of the function. Extracted the scan into scanBalanced(source, start, open, close) which skips characters inside line comments, block comments, and string / template literals (with escape handling and template-literal ${} recursion for interpolation). Call sites in extractFunctions updated to use the new scanner for both the arg-list parens and the function body braces. Made extractFunctions and scanBalanced exported so the new test file can exercise them directly. Gated main() behind an isMain check so importing the module from tests doesn't trigger process.exit. New tests in tests/seed-envelope-parity.test.mjs: - extractFunctions tolerates unbalanced braces in string literals - same for template literals - same for braces inside block comments - same for braces inside line comments - scanBalanced respects backslash-escapes inside strings - scanBalanced recurses into template-literal ${} interpolation Also addresses the other two Greptile P2s which were already fixed in earlier commits on this branch: - Empty-string gap (`99646dd9a`): .trim()==='' rejection added - SeedContractError cause drop (`99646dd9a`): constructor forwards cause through super's options bag per Error v2 spec 61 tests green. Both typechecks clean.	2026-04-14 22:11:56 +04:00
Elie Habib	9b180d6ee2	fix(bundle-runner): wall-time budget to prevent Railway 10min SIGKILL (#3094 ) * fix(bundle-runner): enforce wall-time budget to prevent Railway 10min SIGKILL Railway cron services SIGKILL the container at 10min. When a bundle happened to have two heavy sections due in the same tick (e.g. PW-Main + PW-Port-Activity with timeoutMs totaling 15min+), the second section's stdout never flushed and Railway marked the run as crashed — even though earlier sections published successfully. - _bundle-runner.mjs: add maxBundleMs budget (default 9min, 60s headroom under Railway's 10min ceiling). Sections whose worst-case timeout would exceed the remaining budget are deferred to the next tick with a clear log line. Summary now reports ran/skipped/deferred/failed. - seed-bundle-portwatch.mjs: lower PW-Port-Activity timeoutMs 600s→420s so a single section can no longer consume the entire budget. Observed on 2026-04-14 16:03 UTC portwatch run: PW-Disruptions + PW-Main ran cleanly, PW-Port-Activity started with ~9m37s of Railway budget and its 10min execFile timeout, got SIGKILL'd before any output flushed, job marked as crash. * fix(bundle-runner): make maxBundleMs opt-in to avoid deferring other bundles Greptile PR review flagged P1: default maxBundleMs=540_000 silently applied to all runBundle callers. At least 12 sections across 7 other bundles (energy-sources, climate, resilience, resilience-validation, imf-extended, static-ref, health) have timeoutMs >= 540_000, which means 0 + 600_000 > 540_000 is true on every first tick — those sections would be permanently deferred with no alarm. Default to Infinity; portwatch opts in via { maxBundleMs: 540_000 }. Other Railway-constrained bundles can opt in as their timeouts are audited.	2026-04-14 21:08:40 +04:00
Elie Habib	6d923108d8	refactor(seeds): bundle orchestrator to consolidate Railway cron services (100→65) (#2891 ) * refactor(seeds): add bundle orchestrator to consolidate Railway cron services Railway is at the 100-service limit. This adds a shared _bundle-runner.mjs orchestrator and 11 bundle scripts that group related seed cron services, reducing the count from 100 to ~65 when deployed. Each bundle spawns sub-seeds via child_process.execFile (proven pattern from ais-relay.cjs), with freshness-gated skipping so monthly seeds in a daily bundle only run when due. Original scripts are unchanged and independently runnable. Bundles: ecb-eu (4→1), portwatch (4→1), climate (5→1), energy-sources (6→1), macro (6→1), health (4→1), static-ref (3→1), resilience (2→1), derived-signals (2→1), market-backup (5→1), relay-backup (4→1). * refactor(seeds): deduplicate time constants across bundle scripts Export MIN/HOUR/DAY/WEEK from _bundle-runner.mjs so all 11 bundle scripts import shared constants instead of re-declaring them locally. Eliminates inconsistent computation styles (2460601000 vs 24HOUR). * fix(seeds): correct wb-indicators seedMetaKey in relay-backup bundle The seed writes to seed-meta:economic:worldbank-techreadiness:v1 but the bundle config was missing the :v1 suffix, causing the freshness gate to always return null and the seed to run every cycle instead of daily. Found by architecture-strategist review agent. * fix(seeds): address review findings in bundle runner - Remove em dashes from comment and log line (project convention) - Read Redis creds directly instead of via getRedisCredentials() which calls process.exit(1) on missing env vars, bypassing try/catch and silently killing the entire bundle before any seed runs - Missing creds now gracefully skip freshness check (seeds still run) * fix(seeds): correct intervalMs values and exit code in bundle runner P1 fixes from external review: 1. process.exit(0) on failure now exits non-zero (exit 1 when failed > 0) so Railway cron monitoring detects degraded runs. 2. Corrected intervalMs to match actual cron cadences (was using TTL values): - crypto-quotes: 15min -> 5min (actual cron is 5min) - stablecoin-markets: 15min -> 10min (actual cron is 10min) - gulf-quotes: 15min -> 10min (actual cron is 10min) - health-air-quality: 3h -> 1h (actual cron is 1h) - bls-series: 3d -> 1d (actual cron is daily) - eurostat: 3d -> 1d (actual cron is daily) - fao-ffpi: 30d -> 1d (runs daily to catch monthly release window) - imf-macro: 35d -> 30d (monthly data) - national-debt: 35d -> 30d (monthly data) * docs: add Railway seed consolidation runbook Complete migration checklist with: - 46 services to delete (with Railway UUIDs) - 11 bundle services to create (with cron, start cmd, watch paths) - 43 standalone services that stay (with reasons) - Execution order, verification checklist, env var guidance - Watch paths: scripts/ + shared/ (covers loadSharedConfig resolution) - Inventory checksum: 4+4+3+46+43 = 100	2026-04-10 11:50:32 +04:00