Commit Graph

2 Commits

Author SHA1 Message Date
Elie Habib
5f40f8a13a feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup (#3384)
* feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup

Prereq for the re-export-share Comtrade seeder (plan 2026-04-24-003),
usable by any cohort seeder whose consumer needs bundle-level freshness.

Two coupled changes:

1. `_bundle-runner.mjs` injects `BUNDLE_RUN_STARTED_AT_MS` into every
   spawned child. All siblings in a single bundle run share one value
   (captured at `runBundle` start, not spawn time). Consumers use this
   to detect stale peer keys — if a peer's seed-meta predates the
   current bundle run, fall back to a hard default rather than read
   a cohort-peer's last-week output.

2. `_seed-utils.mjs::runSeed` registers a `process.once('SIGTERM')`
   handler that releases the acquired lock and extends existing-data
   TTL before exiting 143. `_bundle-runner.mjs` sends SIGTERM on
   section timeout, then SIGKILL after KILL_GRACE_MS (5s). Without
   this handler the `finally` path never runs on SIGKILL, leaving
   the 30-min acquireLock reservation in place until its own TTL
   expires — the next cron tick silently skips the resource.

Regression guard memory: `bundle-runner-sigkill-leaks-child-lock` (PR
#3128 root cause).

Tests added:
- bundle-runner env injection (value within run bounds)
- sibling sections share the same timestamp (critical for the
  consumer freshness guard)
- runSeed SIGTERM path: exit 143 + cleanup log
- process.once contract: second SIGTERM does not re-enter handler

* fix(seed): address P1/P2 review findings on SIGTERM + bundle contracts

Addresses PR #3384 review findings (todos 256, 257, 259, 260):

#256 (P1) — SIGTERM handler narrowed to fetch phase only. Was installed
at runSeed entry and armed through every `process.exit` path; could
race `emptyDataIsFailure: true` strict-floor exits (IMF-External,
WB-bulk) and extend seed-meta TTL when the contract forbids it —
silently re-masking 30-day outages. Now the handler is attached
immediately before `withRetry(fetchFn)` and removed in a try/finally
that covers all fetch-phase exit branches.

#257 (P1) — `BUNDLE_RUN_STARTED_AT_MS` now has a first-class helper.
Exported `getBundleRunStartedAtMs()` from `_seed-utils.mjs` with JSDoc
describing the bundle-freshness contract. Fleet-wide helper so the
next consumer seeder imports instead of rediscovering the idiom.

#259 (P2) — SIGTERM cleanup runs `Promise.allSettled` on disjoint-key
ops (`releaseLock` + `extendExistingTtl`). Serialising compounded
Upstash latency during the exact failure mode (Redis degraded) this
handler exists to handle, risking breach of the 5s SIGKILL grace.

#260 (P2) — `_bundle-runner.mjs` asserts topological order on
optional `dependsOn` section field. Throws on unknown-label refs and
on deps appearing at a later index. Fleet-wide contract replacing
the previous prose-comment ordering guarantee.

Tests added/updated:
- New: SIGTERM handler removed after fetchFn completes (narrowed-scope
  contract — post-fetch SIGTERM must NOT trigger TTL extension)
- New: dependsOn unknown-label + out-of-order + happy-path (3 tests)

Full test suite: 6,866 tests pass (+4 net).

* fix(seed): getBundleRunStartedAtMs returns null outside a bundle run

Review follow-up: the earlier `Math.floor(Date.now()/1000)*1000` fallback
regressed standalone (non-bundle) runs. A consumer seeder invoked
manually just after its peer wrote `fetchedAt = (now - 5s)` would see
`bundleStartMs = Date.now()`, reject the perfectly-fresh peer envelope
as "stale", and fall back to defaults — defeating the point of the
peer-read path outside the bundle.

Returning null when `BUNDLE_RUN_STARTED_AT_MS` is unset/invalid keeps
the freshness gate scoped to its real purpose (across-bundle-tick
staleness) and lets standalone runs skip the gate entirely. Consumers
check `bundleStartMs != null` before applying the comparison; see the
companion `seed-sovereign-wealth.mjs` change on the stacked PR.

* test(seed): SIGTERM cleanup test now verifies Redis DEL + EXPIRE calls

Greptile review P2 on PR #3384: the existing test only asserted exit
code + log line, not that the Redis ops were actually issued. The
log claim was ahead of the test.

Fixture now logs every Upstash fetch call's shape (EVAL / pipeline-
EXPIRE / other) to stderr. Test asserts:

- >=1 EVAL op was issued during SIGTERM cleanup (releaseLock Lua
  script on the lock key)
- >=1 pipeline-EXPIRE op was issued (extendExistingTtl on canonical
  + seed-meta keys)
- The EVAL body carries the runSeed-generated runId (proves it's
  THIS run's release, not a phantom op)
- The EXPIRE pipeline touches both the canonicalKey AND the
  seed-meta key (proves the keys[] array was built correctly
  including the extraKeys merge path)

Full test suite: 6,866 tests pass, typecheck clean.
2026-04-25 00:14:04 +04:00
Elie Habib
e6a6d4e326 fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout (#3114)
* fix(bundle-runner): stream child stdio + SIGKILL escalation on timeout

Silent Railway crashes in seed-bundle-portwatch — container exits after
~7min with ZERO logs from the hanging section. Root cause in the runner,
not the seeder: execFile buffers child stdout until the callback fires,
and its default SIGTERM never escalates to SIGKILL, so a child with
in-flight HTTPS sockets can outlive the timeout and be killed by the
container limit before any error is logged.

Switch to spawn + live line-prefixed streaming. On timeout, send SIGTERM,
then SIGKILL after a 10s grace. Always log the terminal reason (timeout
/ exit code / signal) so the next failing bundle surfaces the hung
section on its own line instead of going dark.

Applies to all 15 seed-bundle-*.mjs services that use this runner.

* fix(bundle-runner): guard double-resolve, update docstring, add tests

Review follow-ups:
- Idempotent settle() so spawn 'error' + 'close' can't double-resolve
- Header comment reflects spawn + streaming + SIGKILL behavior
- tests/bundle-runner.test.mjs covers live streaming, SIGKILL escalation
  when a child ignores SIGTERM, and non-zero exit reporting

* fix(bundle-runner): address PR review — declare softKill before settle, handle stdio error

* fix(bundle-runner): log terminal reason BEFORE SIGKILL grace + include grace in budget

Review P1 follow-up. Two gaps the previous commit left open:

1. A section with timeoutMs close to Railway's ~10min container cap could
   be killed by the container mid-grace, before the "Failed ... timeout"
   line reached the log stream. Fix: emit the terminal Failed line at the
   moment softKill fires (before SIGTERM), so the reason is flushed BEFORE
   any grace window that could be truncated by a container kill.

2. The admission check used raw timeoutMs, but worst-case runtime is
   timeoutMs + KILL_GRACE_MS when the child ignores SIGTERM. A section
   that "fit" the budget could still overrun. Fix: compare elapsed +
   timeout + grace against maxBundleMs.

close handler still settles the promise but no longer re-logs on the
timeout path (alreadyLogged flag). New test asserts the Failed line
precedes SIGKILL escalation, and that budget accounts for grace.
2026-04-16 07:58:18 +04:00