mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-26 01:24:59 +02:00
2f5445284b6e62493c8d2e1b31f95b57c283f1fd
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
5f40f8a13a |
feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup (#3384)
* feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup
Prereq for the re-export-share Comtrade seeder (plan 2026-04-24-003),
usable by any cohort seeder whose consumer needs bundle-level freshness.
Two coupled changes:
1. `_bundle-runner.mjs` injects `BUNDLE_RUN_STARTED_AT_MS` into every
spawned child. All siblings in a single bundle run share one value
(captured at `runBundle` start, not spawn time). Consumers use this
to detect stale peer keys — if a peer's seed-meta predates the
current bundle run, fall back to a hard default rather than read
a cohort-peer's last-week output.
2. `_seed-utils.mjs::runSeed` registers a `process.once('SIGTERM')`
handler that releases the acquired lock and extends existing-data
TTL before exiting 143. `_bundle-runner.mjs` sends SIGTERM on
section timeout, then SIGKILL after KILL_GRACE_MS (5s). Without
this handler the `finally` path never runs on SIGKILL, leaving
the 30-min acquireLock reservation in place until its own TTL
expires — the next cron tick silently skips the resource.
Regression guard memory: `bundle-runner-sigkill-leaks-child-lock` (PR
#3128 root cause).
Tests added:
- bundle-runner env injection (value within run bounds)
- sibling sections share the same timestamp (critical for the
consumer freshness guard)
- runSeed SIGTERM path: exit 143 + cleanup log
- process.once contract: second SIGTERM does not re-enter handler
* fix(seed): address P1/P2 review findings on SIGTERM + bundle contracts
Addresses PR #3384 review findings (todos 256, 257, 259, 260):
#256 (P1) — SIGTERM handler narrowed to fetch phase only. Was installed
at runSeed entry and armed through every `process.exit` path; could
race `emptyDataIsFailure: true` strict-floor exits (IMF-External,
WB-bulk) and extend seed-meta TTL when the contract forbids it —
silently re-masking 30-day outages. Now the handler is attached
immediately before `withRetry(fetchFn)` and removed in a try/finally
that covers all fetch-phase exit branches.
#257 (P1) — `BUNDLE_RUN_STARTED_AT_MS` now has a first-class helper.
Exported `getBundleRunStartedAtMs()` from `_seed-utils.mjs` with JSDoc
describing the bundle-freshness contract. Fleet-wide helper so the
next consumer seeder imports instead of rediscovering the idiom.
#259 (P2) — SIGTERM cleanup runs `Promise.allSettled` on disjoint-key
ops (`releaseLock` + `extendExistingTtl`). Serialising compounded
Upstash latency during the exact failure mode (Redis degraded) this
handler exists to handle, risking breach of the 5s SIGKILL grace.
#260 (P2) — `_bundle-runner.mjs` asserts topological order on
optional `dependsOn` section field. Throws on unknown-label refs and
on deps appearing at a later index. Fleet-wide contract replacing
the previous prose-comment ordering guarantee.
Tests added/updated:
- New: SIGTERM handler removed after fetchFn completes (narrowed-scope
contract — post-fetch SIGTERM must NOT trigger TTL extension)
- New: dependsOn unknown-label + out-of-order + happy-path (3 tests)
Full test suite: 6,866 tests pass (+4 net).
* fix(seed): getBundleRunStartedAtMs returns null outside a bundle run
Review follow-up: the earlier `Math.floor(Date.now()/1000)*1000` fallback
regressed standalone (non-bundle) runs. A consumer seeder invoked
manually just after its peer wrote `fetchedAt = (now - 5s)` would see
`bundleStartMs = Date.now()`, reject the perfectly-fresh peer envelope
as "stale", and fall back to defaults — defeating the point of the
peer-read path outside the bundle.
Returning null when `BUNDLE_RUN_STARTED_AT_MS` is unset/invalid keeps
the freshness gate scoped to its real purpose (across-bundle-tick
staleness) and lets standalone runs skip the gate entirely. Consumers
check `bundleStartMs != null` before applying the comparison; see the
companion `seed-sovereign-wealth.mjs` change on the stacked PR.
* test(seed): SIGTERM cleanup test now verifies Redis DEL + EXPIRE calls
Greptile review P2 on PR #3384: the existing test only asserted exit
code + log line, not that the Redis ops were actually issued. The
log claim was ahead of the test.
Fixture now logs every Upstash fetch call's shape (EVAL / pipeline-
EXPIRE / other) to stderr. Test asserts:
- >=1 EVAL op was issued during SIGTERM cleanup (releaseLock Lua
script on the lock key)
- >=1 pipeline-EXPIRE op was issued (extendExistingTtl on canonical
+ seed-meta keys)
- The EVAL body carries the runSeed-generated runId (proves it's
THIS run's release, not a phantom op)
- The EXPIRE pipeline touches both the canonicalKey AND the
seed-meta key (proves the keys[] array was built correctly
including the extraKeys merge path)
Full test suite: 6,866 tests pass, typecheck clean.
|
||
|
|
8a988323d2 |
chore(bundle-runner): emit reliable per-section summary line on parent stdout (#3320)
* chore(bundle-runner): emit reliable per-section summary line on parent stdout Fixes observability asymmetry in Railway bundle service logs where some seeders appeared to skip lines like \`Run ID\`, \`Mode\`, or the structured \`seed_complete\` JSON event. Root cause is Railway's log ingestion dropping child-stdout lines when multiple seeders emit at similar timestamps — observed in the PR #3294 launch run where Pipelines-Gas was missing its \`=== Seed ===\` banner, Pipelines-Oil had \`Key:\` emitted BEFORE the banner, Storage-Facilities and Energy- Disruptions were missing Run ID + Mode + seed_complete entirely, despite identical code paths. All child processes emit the same lines; Railway just loses some. Fix is to piggy-back on the observation that bundle-level lines (\`[Bundle:X] Starting\`, \`[Bundle:X] Finished\`) ARE reliably captured — they come from the parent process's single stdout stream. Changes in scripts/_bundle-runner.mjs: - spawnSeed now captures the child's \`{"event":"seed_complete",...}\` JSON line while streaming stdout, parses it, and attaches to the settle result. - Main loop emits one bundle-level summary line per section after child exit: [Bundle:X] section=NAME status=OK durationMs=1237 records=15 state=OK (or \`status=FAILED elapsed=...s reason=...\` for failures). - Summary line survives Railway's log ingestion even when per-section child lines drop, giving monitors a reliable event to key off. Observability consumers (log-based alerts, seed telemetry scrapers) should now key off the bundle-level summary rather than per-section child lines which remain best-effort. The per-section child lines stay as-is for interactive debugging. Verification: parse logic sanity-checked against the exact seed_complete line format. Node syntax check clean. No schema changes. * fix(bundle-runner): emit FAILED summary line to stderr, not stdout The prior commit introduced a bundle-level structured summary line per section. On success that correctly goes to stdout; on FAILED it was also going to stdout — but that broke tests/bundle-runner.test.mjs test 140 ("timeout emits terminal reason BEFORE SIGTERM/SIGKILL grace"). The test concatenates stdout+stderr and asserts that `SIGKILL` appears AFTER `Failed after` in the combined string (verifying the kill-decision log line is emitted BEFORE the 10s SIGTERM→SIGKILL grace window, so it survives container termination). My new FAILED summary line — which includes the reason string `timeout after 1s (signal SIGKILL)` — landed on stdout, which comes first in the concatenation, and its `SIGKILL` substring matched before the stderr-side `Did not exit on SIGTERM...SIGKILL` line. Ordering assertion failed. Fix: route the FAILED summary line through console.error (same stream as the pre-kill `Failed after ... sending SIGTERM` and the grace-window `Did not exit...SIGKILL` lines). Chronological ordering in combined output is preserved; test passes. OK summary lines stay on stdout — they're observability data, not error diagnostics, and belong on the normal stream alongside the bundle Starting/Finished lines. Local: `node --test tests/bundle-runner.test.mjs` — 4/4 pass including the previously-failing ordering test. |
||
|
|
e6a6d4e326 |
fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout (#3114)
* fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout Silent Railway crashes in seed-bundle-portwatch — container exits after ~7min with ZERO logs from the hanging section. Root cause in the runner, not the seeder: execFile buffers child stdout until the callback fires, and its default SIGTERM never escalates to SIGKILL, so a child with in-flight HTTPS sockets can outlive the timeout and be killed by the container limit before any error is logged. Switch to spawn + live line-prefixed streaming. On timeout, send SIGTERM, then SIGKILL after a 10s grace. Always log the terminal reason (timeout / exit code / signal) so the next failing bundle surfaces the hung section on its own line instead of going dark. Applies to all 15 seed-bundle-*.mjs services that use this runner. * fix(bundle-runner): guard double-resolve, update docstring, add tests Review follow-ups: - Idempotent settle() so spawn 'error' + 'close' can't double-resolve - Header comment reflects spawn + streaming + SIGKILL behavior - tests/bundle-runner.test.mjs covers live streaming, SIGKILL escalation when a child ignores SIGTERM, and non-zero exit reporting * fix(bundle-runner): address PR review — declare softKill before settle, handle stdio error * fix(bundle-runner): log terminal reason BEFORE SIGKILL grace + include grace in budget Review P1 follow-up. Two gaps the previous commit left open: 1. A section with timeoutMs close to Railway's ~10min container cap could be killed by the container mid-grace, before the "Failed ... timeout" line reached the log stream. Fix: emit the terminal Failed line at the moment softKill fires (before SIGTERM), so the reason is flushed BEFORE any grace window that could be truncated by a container kill. 2. The admission check used raw timeoutMs, but worst-case runtime is timeoutMs + KILL_GRACE_MS when the child ignores SIGTERM. A section that "fit" the budget could still overrun. Fix: compare elapsed + timeout + grace against maxBundleMs. close handler still settles the promise but no longer re-logs on the timeout path (alreadyLogged flag). New test asserts the Failed line precedes SIGKILL escalation, and that budget accounts for grace. |
||
|
|
dc10e47197 |
feat(seed-contract): PR 1 foundation — envelope + contract + conformance test (#3095)
* feat(seed-contract): PR 1 foundation — envelope helpers + contract validators + static conformance test
Adds the foundational pieces for the unified seed contract rollout described in
docs/plans/2026-04-14-002-fix-runseed-zero-record-lockout-plan.md. Behavior-
preserving by construction: legacy-shape Redis values unwrap as { _seed: null,
data: raw } and pass through every helper unchanged.
New files:
- scripts/_seed-envelope-source.mjs — single source of truth for unwrapEnvelope,
stripSeedEnvelope, buildEnvelope.
- api/_seed-envelope.js — edge-safe mirror (AGENTS.md:80 forbids api/* importing
from server/).
- server/_shared/seed-envelope.ts — TS mirror with SeedMeta, SeedEnvelope,
UnwrapResult types.
- scripts/_seed-contract.mjs — SeedContractError + validateDescriptor (10
required fields, 10 optional, unknown-field rejection) + resolveRecordCount
(non-negative integer or throw).
- scripts/verify-seed-envelope-parity.mjs — diffs function bodies between the
two JS copies; TS copy guarded by tsc.
- tests/seed-envelope.test.mjs — 14 tests for the three helpers (null,
legacy-passthrough, stringified JSON, round-trip).
- tests/seed-contract.test.mjs — 25 tests for validateDescriptor/
resolveRecordCount + a soft-warn conformance scan that STATICALLY parses
scripts/seed-*.mjs (never dynamic import — several seeders process.exit() at
module load). Currently logs 91 seeders awaiting declareRecords migration.
Wiring (minimal, behavior-preserving):
- api/health.js: imports unwrapEnvelope; routes readSeedMeta's parsed value
through it. Legacy meta has no _seed wrapper → passes through unchanged.
- scripts/_bundle-runner.mjs: readSectionFreshness prefers envelope at
section.canonicalKey when present, falls back to the existing
seed-meta:<key> read via section.seedMetaKey (unchanged path today since no
bundle defines canonicalKey yet).
No seeder modified. No writes changed. All 5279 existing data tests still
green; both typechecks clean; parity verifier green; 39 new tests pass.
PR 2 will migrate seeders, bundles, and readers to envelope semantics. PR 3
removes the legacy path and hard-fails the conformance test.
* fix(seed-contract): address PR #3095 review — metaTtlSeconds opt, bundle fallback, strict conformance mode
Review findings applied:
P1 — metaTtlSeconds missing from OPTIONAL_FIELDS whitelist.
scripts/seed-jodi-gas.mjs:250 passes metaTtlSeconds to runSeed(); field is
consumed by _seed-utils writeSeedMeta. Without it in the whitelist, PR 2's
validateDescriptor wiring would throw 'unknown field' the moment jodi-gas
migrates. Added with a 'removed in PR 3' note.
P2 — Bundle canonicalKey short-circuit over-runs during migration.
readSectionFreshness previously returned null if canonicalKey had no envelope
yet, even when a legacy seed-meta key was also declared — making every cron
re-run the section. Fixed to fall through to seedMetaKey on null envelope so
the transition state is safe.
P3 — Conformance soft-warn signal was invisible in CI.
tests/seed-contract.test.mjs now emits a t.diagnostic summary line
('N/M seeders export declareRecords') visible on every run and gates hard-fail
behind SEED_CONTRACT_STRICT=1 so PR 3 can flip to strict without more code.
Nitpick — parity regex missed 'export async function'.
Added '(?:async\s+)?' to scripts/verify-seed-envelope-parity.mjs function
extraction regex.
Verified: 39 tests green, parity verifier green, strict mode correctly
hard-fails with 91 seeders missing (expected during PR 1).
* fix(seed-contract): address review round 2 — NaN/empty-string validation, Error cause, parity CI wiring
P2 — Non-finite ttlSeconds/maxStaleMin bypassed validation.
`typeof NaN === 'number'` and `NaN > 0 === false` meant a NaN duration
passed the old typeof+<=0 checks and would have poisoned TTLs once
validateDescriptor is wired into runSeed. Now gated by Number.isFinite,
which rejects NaN and ±Infinity. Tests added for NaN/Infinity on both
fields.
P2 — Empty/whitespace-only strings for domain/resource/canonicalKey/sourceVersion
bypassed validation. Added .trim() === '' rejection + tests per field.
This mattered because canonicalKey='' would have landed writes at the
empty key and seed-meta under a blank resource namespace.
P3 — SeedContractError silently dropped the Error v2 cause option.
Constructor now forwards { cause } through super() so err.cause works
with standard tooling (Node's default stack printer, Sentry chained-cause
serialization). resolveRecordCount's manual err.cause = err assignment
was replaced with the options-bag form. Test added for both constructor
direct-use and the resolveRecordCount wrap path.
P3 — Parity verifier was not on an automated path. Added
tests/seed-envelope-parity.test.mjs which spawns scripts/verify-seed-envelope-parity.mjs
via execFile; non-zero exit (drift) → test fails. Now runs as part of
`npm run test:data` (tsx --test tests/*.test.mjs). Drift injection
confirmed: sed -i modifying api/_seed-envelope.js makes the test fail
with 'Command failed' from execFile.
51 tests total (was 39). All green on clean tree.
* fix(seed-contract): conformance test checks full descriptor, not just declareRecords
Previous conformance check green-lit any seeder that exported
declareRecords, even if the runSeed(...) call-site omitted other
validateDescriptor-required opts (validateFn, ttlSeconds, sourceVersion,
schemaVersion, maxStaleMin). That would have produced a false readiness
signal for PR 3's strict flip: test goes green, but wiring
validateDescriptor() into runSeed in PR 2 would still throw at runtime
across the fleet.
Examples verified on the PR head:
- scripts/seed-cot.mjs:188-192 — no sourceVersion/schemaVersion/maxStaleMin
- scripts/seed-market-breadth.mjs:121-124 — same
- scripts/seed-jodi-gas.mjs:248-253 — no schemaVersion/maxStaleMin
Now the conformance test:
1. AST-lite extracts the runSeed(...) call site with balanced parens,
tolerating strings and comments.
2. Checks every REQUIRED_OPTS_FIELDS entry (validateFn, declareRecords,
ttlSeconds, sourceVersion, schemaVersion, maxStaleMin) is present as
an object key in that call-site.
3. Emits a per-file diagnostic listing missing fields.
4. Migration signal is now accurate: 0/91 seeders fully satisfy the
descriptor (was claiming 0/91 missing just declareRecords). Matches
the underlying validateDescriptor behavior.
Verified: strict mode (SEED_CONTRACT_STRICT=1) surfaces 'opt:schemaVersion,
opt:maxStaleMin' as missing fields per seeder — actionable for PR 2
migration work. 51 tests total (unchanged count; behavior change is in
which seeders the one conformance test considers migrated).
* fix(seed-contract): strip comments/strings before parsing runSeed() call site
The conformance scanner located the first 'runSeed(' substring in the raw
source, which caught commented-out mentions upstream of the real call.
Offending files where this produced false 'incomplete' diagnoses:
- scripts/seed-bis-data.mjs:209 // runSeed() calls process.exit(0)…
real call at :220
- scripts/seed-economy.mjs:788 header comment mentioning runSeed()
real call at :891
Three files had the same pattern. Under strict mode these would have been
false hard failures in PR 3 even when the real descriptor was migrated.
Fix:
- stripCommentsAndStrings(src) produces a view where block comments, line
comments, and string/template literals are replaced with spaces (line
feeds preserved). Indices stay aligned with the original source so
extractRunSeedCall can match against the stripped view and then slice
the original source for the real call body.
- descriptorFieldsPresent() also runs its field-presence regex against
the stripped call body so '// TODO: validateFn' inside the call doesn't
fool the check.
- hasRunSeedCall() uses the stripped view too, which correctly excludes
5 seeders that only mentioned runSeed in comments. Count dropped
91→86 real callers.
Added 4 targeted tests covering:
- runSeed() inside a line comment ahead of the real call
- runSeed() inside a block comment
- runSeed() inside a string literal ("don't call runSeed() directly")
- descriptor field names inside an inline comment don't count as present
Verified on the actual files: seed-bis-data.mjs first real runSeed( in
stripped source is at line 220 (was line 209 before fix).
40 tests total, all green.
* fix(seed-contract): parity verifier survives unbalanced braces in string/template literals
Addresses Greptile P2 on PR #3095: the body extractor in
scripts/verify-seed-envelope-parity.mjs counted raw { and } on every
character. A future helper body that legitimately contains
`const marker = '{'` would have pushed depth past zero at the literal
brace and truncated the body — silently masking drift in the rest of
the function.
Extracted the scan into scanBalanced(source, start, open, close) which
skips characters inside line comments, block comments, and string /
template literals (with escape handling and template-literal ${} recursion
for interpolation). Call sites in extractFunctions updated to use the new
scanner for both the arg-list parens and the function body braces.
Made extractFunctions and scanBalanced exported so the new test file
can exercise them directly. Gated main() behind an isMain check so
importing the module from tests doesn't trigger process.exit.
New tests in tests/seed-envelope-parity.test.mjs:
- extractFunctions tolerates unbalanced braces in string literals
- same for template literals
- same for braces inside block comments
- same for braces inside line comments
- scanBalanced respects backslash-escapes inside strings
- scanBalanced recurses into template-literal ${} interpolation
Also addresses the other two Greptile P2s which were already fixed in
earlier commits on this branch:
- Empty-string gap (
|
||
|
|
9b180d6ee2 |
fix(bundle-runner): wall-time budget to prevent Railway 10min SIGKILL (#3094)
* fix(bundle-runner): enforce wall-time budget to prevent Railway 10min SIGKILL
Railway cron services SIGKILL the container at 10min. When a bundle
happened to have two heavy sections due in the same tick (e.g.
PW-Main + PW-Port-Activity with timeoutMs totaling 15min+), the second
section's stdout never flushed and Railway marked the run as crashed —
even though earlier sections published successfully.
- _bundle-runner.mjs: add maxBundleMs budget (default 9min, 60s headroom
under Railway's 10min ceiling). Sections whose worst-case timeout would
exceed the remaining budget are deferred to the next tick with a clear
log line. Summary now reports ran/skipped/deferred/failed.
- seed-bundle-portwatch.mjs: lower PW-Port-Activity timeoutMs 600s→420s
so a single section can no longer consume the entire budget.
Observed on 2026-04-14 16:03 UTC portwatch run: PW-Disruptions +
PW-Main ran cleanly, PW-Port-Activity started with ~9m37s of Railway
budget and its 10min execFile timeout, got SIGKILL'd before any output
flushed, job marked as crash.
* fix(bundle-runner): make maxBundleMs opt-in to avoid deferring other bundles
Greptile PR review flagged P1: default maxBundleMs=540_000 silently
applied to all runBundle callers. At least 12 sections across 7 other
bundles (energy-sources, climate, resilience, resilience-validation,
imf-extended, static-ref, health) have timeoutMs >= 540_000, which
means 0 + 600_000 > 540_000 is true on every first tick — those
sections would be permanently deferred with no alarm.
Default to Infinity; portwatch opts in via { maxBundleMs: 540_000 }.
Other Railway-constrained bundles can opt in as their timeouts are
audited.
|
||
|
|
6d923108d8 |
refactor(seeds): bundle orchestrator to consolidate Railway cron services (100→65) (#2891)
* refactor(seeds): add bundle orchestrator to consolidate Railway cron services Railway is at the 100-service limit. This adds a shared _bundle-runner.mjs orchestrator and 11 bundle scripts that group related seed cron services, reducing the count from 100 to ~65 when deployed. Each bundle spawns sub-seeds via child_process.execFile (proven pattern from ais-relay.cjs), with freshness-gated skipping so monthly seeds in a daily bundle only run when due. Original scripts are unchanged and independently runnable. Bundles: ecb-eu (4→1), portwatch (4→1), climate (5→1), energy-sources (6→1), macro (6→1), health (4→1), static-ref (3→1), resilience (2→1), derived-signals (2→1), market-backup (5→1), relay-backup (4→1). * refactor(seeds): deduplicate time constants across bundle scripts Export MIN/HOUR/DAY/WEEK from _bundle-runner.mjs so all 11 bundle scripts import shared constants instead of re-declaring them locally. Eliminates inconsistent computation styles (24*60*60*1000 vs 24*HOUR). * fix(seeds): correct wb-indicators seedMetaKey in relay-backup bundle The seed writes to seed-meta:economic:worldbank-techreadiness:v1 but the bundle config was missing the :v1 suffix, causing the freshness gate to always return null and the seed to run every cycle instead of daily. Found by architecture-strategist review agent. * fix(seeds): address review findings in bundle runner - Remove em dashes from comment and log line (project convention) - Read Redis creds directly instead of via getRedisCredentials() which calls process.exit(1) on missing env vars, bypassing try/catch and silently killing the entire bundle before any seed runs - Missing creds now gracefully skip freshness check (seeds still run) * fix(seeds): correct intervalMs values and exit code in bundle runner P1 fixes from external review: 1. process.exit(0) on failure now exits non-zero (exit 1 when failed > 0) so Railway cron monitoring detects degraded runs. 2. Corrected intervalMs to match actual cron cadences (was using TTL values): - crypto-quotes: 15min -> 5min (actual cron is 5min) - stablecoin-markets: 15min -> 10min (actual cron is 10min) - gulf-quotes: 15min -> 10min (actual cron is 10min) - health-air-quality: 3h -> 1h (actual cron is 1h) - bls-series: 3d -> 1d (actual cron is daily) - eurostat: 3d -> 1d (actual cron is daily) - fao-ffpi: 30d -> 1d (runs daily to catch monthly release window) - imf-macro: 35d -> 30d (monthly data) - national-debt: 35d -> 30d (monthly data) * docs: add Railway seed consolidation runbook Complete migration checklist with: - 46 services to delete (with Railway UUIDs) - 11 bundle services to create (with cron, start cmd, watch paths) - 43 standalone services that stay (with reasons) - Execution order, verification checklist, env var guidance - Watch paths: scripts/** + shared/** (covers loadSharedConfig resolution) - Inventory checksum: 4+4+3+46+43 = 100 |