mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(seed): BUNDLE_RUN_STARTED_AT_MS env + runSeed SIGTERM cleanup
Prereq for the re-export-share Comtrade seeder (plan 2026-04-24-003),
usable by any cohort seeder whose consumer needs bundle-level freshness.
Two coupled changes:
1. `_bundle-runner.mjs` injects `BUNDLE_RUN_STARTED_AT_MS` into every
spawned child. All siblings in a single bundle run share one value
(captured at `runBundle` start, not spawn time). Consumers use this
to detect stale peer keys — if a peer's seed-meta predates the
current bundle run, fall back to a hard default rather than read
a cohort-peer's last-week output.
2. `_seed-utils.mjs::runSeed` registers a `process.once('SIGTERM')`
handler that releases the acquired lock and extends existing-data
TTL before exiting 143. `_bundle-runner.mjs` sends SIGTERM on
section timeout, then SIGKILL after KILL_GRACE_MS (5s). Without
this handler the `finally` path never runs on SIGKILL, leaving
the 30-min acquireLock reservation in place until its own TTL
expires — the next cron tick silently skips the resource.
Regression guard memory: `bundle-runner-sigkill-leaks-child-lock` (PR
#3128 root cause).
Tests added:
- bundle-runner env injection (value within run bounds)
- sibling sections share the same timestamp (critical for the
consumer freshness guard)
- runSeed SIGTERM path: exit 143 + cleanup log
- process.once contract: second SIGTERM does not re-enter handler
* fix(seed): address P1/P2 review findings on SIGTERM + bundle contracts
Addresses PR #3384 review findings (todos 256, 257, 259, 260):
#256 (P1) — SIGTERM handler narrowed to fetch phase only. Was installed
at runSeed entry and armed through every `process.exit` path; could
race `emptyDataIsFailure: true` strict-floor exits (IMF-External,
WB-bulk) and extend seed-meta TTL when the contract forbids it —
silently re-masking 30-day outages. Now the handler is attached
immediately before `withRetry(fetchFn)` and removed in a try/finally
that covers all fetch-phase exit branches.
#257 (P1) — `BUNDLE_RUN_STARTED_AT_MS` now has a first-class helper.
Exported `getBundleRunStartedAtMs()` from `_seed-utils.mjs` with JSDoc
describing the bundle-freshness contract. Fleet-wide helper so the
next consumer seeder imports instead of rediscovering the idiom.
#259 (P2) — SIGTERM cleanup runs `Promise.allSettled` on disjoint-key
ops (`releaseLock` + `extendExistingTtl`). Serialising compounded
Upstash latency during the exact failure mode (Redis degraded) this
handler exists to handle, risking breach of the 5s SIGKILL grace.
#260 (P2) — `_bundle-runner.mjs` asserts topological order on
optional `dependsOn` section field. Throws on unknown-label refs and
on deps appearing at a later index. Fleet-wide contract replacing
the previous prose-comment ordering guarantee.
Tests added/updated:
- New: SIGTERM handler removed after fetchFn completes (narrowed-scope
contract — post-fetch SIGTERM must NOT trigger TTL extension)
- New: dependsOn unknown-label + out-of-order + happy-path (3 tests)
Full test suite: 6,866 tests pass (+4 net).
* fix(seed): getBundleRunStartedAtMs returns null outside a bundle run
Review follow-up: the earlier `Math.floor(Date.now()/1000)*1000` fallback
regressed standalone (non-bundle) runs. A consumer seeder invoked
manually just after its peer wrote `fetchedAt = (now - 5s)` would see
`bundleStartMs = Date.now()`, reject the perfectly-fresh peer envelope
as "stale", and fall back to defaults — defeating the point of the
peer-read path outside the bundle.
Returning null when `BUNDLE_RUN_STARTED_AT_MS` is unset/invalid keeps
the freshness gate scoped to its real purpose (across-bundle-tick
staleness) and lets standalone runs skip the gate entirely. Consumers
check `bundleStartMs != null` before applying the comparison; see the
companion `seed-sovereign-wealth.mjs` change on the stacked PR.
* test(seed): SIGTERM cleanup test now verifies Redis DEL + EXPIRE calls
Greptile review P2 on PR #3384: the existing test only asserted exit
code + log line, not that the Redis ops were actually issued. The
log claim was ahead of the test.
Fixture now logs every Upstash fetch call's shape (EVAL / pipeline-
EXPIRE / other) to stderr. Test asserts:
- >=1 EVAL op was issued during SIGTERM cleanup (releaseLock Lua
script on the lock key)
- >=1 pipeline-EXPIRE op was issued (extendExistingTtl on canonical
+ seed-meta keys)
- The EVAL body carries the runSeed-generated runId (proves it's
THIS run's release, not a phantom op)
- The EXPIRE pipeline touches both the canonicalKey AND the
seed-meta key (proves the keys[] array was built correctly
including the extraKeys merge path)
Full test suite: 6,866 tests pass, typecheck clean.
296 lines
13 KiB
JavaScript
296 lines
13 KiB
JavaScript
#!/usr/bin/env node
|
|
/**
|
|
* Bundle orchestrator: spawns multiple seed scripts sequentially via
|
|
* child_process.spawn, with line-streamed stdio, SIGTERM→SIGKILL escalation on
|
|
* timeout, and freshness-gated skipping. Streaming matters because a hanging
|
|
* section would otherwise buffer its logs until exit and look like a silent
|
|
* container crash (see PR that replaced execFile).
|
|
*
|
|
* Usage from a bundle script:
|
|
* import { runBundle } from './_bundle-runner.mjs';
|
|
* await runBundle('ecb-eu', [ { label, script, seedMetaKey, intervalMs, timeoutMs } ]);
|
|
*
|
|
* Budget (opt-in): Railway cron services SIGKILL the container at 10min. If
|
|
* the sum of timeoutMs for sections that happen to be due exceeds ~9min, we
|
|
* risk losing the in-flight section's logs AND marking the job as crashed.
|
|
* Callers on Railway cron can pass `{ maxBundleMs }` to enforce a wall-time
|
|
* budget — sections whose worst-case timeout wouldn't fit in the remaining
|
|
* budget are deferred to the next tick. Default is Infinity (no budget) so
|
|
* existing bundles whose individual sections already exceed 9min (e.g.
|
|
* 600_000-1 timeouts in imf-extended, energy-sources) are not silently
|
|
* broken by adopting the runner.
|
|
*/
|
|
|
|
import { spawn } from 'node:child_process';
|
|
import { dirname, join } from 'node:path';
|
|
import { fileURLToPath } from 'node:url';
|
|
import { loadEnvFile } from './_seed-utils.mjs';
|
|
import { unwrapEnvelope } from './_seed-envelope-source.mjs';
|
|
|
|
const __dirname = dirname(fileURLToPath(import.meta.url));
|
|
|
|
export const MIN = 60_000;
|
|
export const HOUR = 3_600_000;
|
|
export const DAY = 86_400_000;
|
|
export const WEEK = 604_800_000;
|
|
|
|
loadEnvFile(import.meta.url);
|
|
|
|
const REDIS_URL = process.env.UPSTASH_REDIS_REST_URL;
|
|
const REDIS_TOKEN = process.env.UPSTASH_REDIS_REST_TOKEN;
|
|
|
|
async function readRedisKey(key) {
|
|
if (!REDIS_URL || !REDIS_TOKEN) return null;
|
|
try {
|
|
const resp = await fetch(`${REDIS_URL}/get/${encodeURIComponent(key)}`, {
|
|
headers: { Authorization: `Bearer ${REDIS_TOKEN}` },
|
|
signal: AbortSignal.timeout(5_000),
|
|
});
|
|
if (!resp.ok) return null;
|
|
const body = await resp.json();
|
|
return body.result ? JSON.parse(body.result) : null;
|
|
} catch {
|
|
return null;
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Read section freshness for the interval gate.
|
|
*
|
|
* Returns `{ fetchedAt }` or null. Prefers envelope-form data when the section
|
|
* declares `canonicalKey` (PR 2+); falls back to the legacy `seed-meta:<key>`
|
|
* read used by every bundle file today. PR 1 keeps legacy as the ONLY live
|
|
* path — `unwrapEnvelope` here is behavior-preserving because legacy seed-meta
|
|
* values have no `_seed` field and pass through as `data` unchanged. When PR 2
|
|
* migrates bundles to `canonicalKey`, this function starts reading envelopes.
|
|
*/
|
|
async function readSectionFreshness(section) {
|
|
// Try the envelope path first when a canonicalKey is declared. If the canonical
|
|
// key isn't yet written as an envelope (PR 2 writer migration lagging reader
|
|
// migration, or a legacy payload still present), fall through to the legacy
|
|
// seed-meta read so the bundle doesn't over-run during the transition.
|
|
if (section.canonicalKey) {
|
|
const raw = await readRedisKey(section.canonicalKey);
|
|
const { _seed } = unwrapEnvelope(raw);
|
|
if (_seed?.fetchedAt) return { fetchedAt: _seed.fetchedAt };
|
|
}
|
|
if (section.seedMetaKey) {
|
|
const raw = await readRedisKey(`seed-meta:${section.seedMetaKey}`);
|
|
// Legacy seed-meta is `{ fetchedAt, recordCount, sourceVersion }` at top
|
|
// level. It has no `_seed` wrapper so unwrapEnvelope returns it as data.
|
|
const meta = unwrapEnvelope(raw).data;
|
|
if (meta?.fetchedAt) return { fetchedAt: meta.fetchedAt };
|
|
}
|
|
return null;
|
|
}
|
|
|
|
// Stream child stdio line-by-line so hung sections surface progress instead of
|
|
// looking like a silent crash. Escalate SIGTERM → SIGKILL on timeout so child
|
|
// processes with in-flight HTTPS sockets can't outlive the deadline.
|
|
const KILL_GRACE_MS = 10_000;
|
|
|
|
function streamLines(stream, onLine) {
|
|
let buf = '';
|
|
stream.setEncoding('utf8');
|
|
stream.on('data', (chunk) => {
|
|
buf += chunk;
|
|
let idx;
|
|
while ((idx = buf.indexOf('\n')) !== -1) {
|
|
const line = buf.slice(0, idx);
|
|
buf = buf.slice(idx + 1);
|
|
if (line) onLine(line);
|
|
}
|
|
});
|
|
stream.on('end', () => { if (buf) onLine(buf); });
|
|
// Child-stdio `error` is rare (SIGKILL emits `end`), but Node throws on an
|
|
// unhandled `error` event. Log it instead of crashing the runner.
|
|
stream.on('error', (err) => onLine(`<stdio error: ${err.message}>`));
|
|
}
|
|
|
|
function spawnSeed(scriptPath, { timeoutMs, label, bundleStartedAtMs }) {
|
|
return new Promise((resolve) => {
|
|
const t0 = Date.now();
|
|
// Capture the child's structured `seed_complete` event if emitted, so
|
|
// the parent can re-emit the key fields on a single bundle-level line.
|
|
// Railway log ingestion drops child-stdout lines when many seeders log
|
|
// at similar timestamps (observed across Storage-Facilities /
|
|
// Energy-Disruptions / Pipelines-Gas in PR #3294 launch run: each
|
|
// dropped a different subset of Run ID / Mode / seed_complete lines
|
|
// despite identical code paths). Bundle-level lines survive reliably.
|
|
let lastSeedComplete = null;
|
|
// BUNDLE_RUN_STARTED_AT_MS lets consumer seeders detect when a cohort
|
|
// peer's seed-meta predates the current bundle run and fall back to a
|
|
// hard default instead of reading a stale peer key. See plan
|
|
// 2026-04-24-003 §"Phase 2 — SWF seeder" bundle-freshness guard.
|
|
const child = spawn(process.execPath, [scriptPath], {
|
|
env: {
|
|
...process.env,
|
|
BUNDLE_RUN_STARTED_AT_MS: String(bundleStartedAtMs ?? Date.now()),
|
|
},
|
|
stdio: ['ignore', 'pipe', 'pipe'],
|
|
});
|
|
|
|
streamLines(child.stdout, (line) => {
|
|
console.log(` [${label}] ${line}`);
|
|
const idx = line.indexOf('{"event":"seed_complete"');
|
|
if (idx >= 0) {
|
|
try {
|
|
lastSeedComplete = JSON.parse(line.slice(idx));
|
|
} catch { /* malformed JSON — keep previous */ }
|
|
}
|
|
});
|
|
streamLines(child.stderr, (line) => console.warn(` [${label}] ${line}`));
|
|
|
|
let settled = false;
|
|
let timedOut = false;
|
|
let killTimer = null;
|
|
// Fire the terminal "Failed ... timeout" log the moment we decide to kill,
|
|
// BEFORE the SIGTERM→SIGKILL grace window. This guarantees the reason
|
|
// reaches the log stream even if the container itself is killed during
|
|
// the grace period (Railway's ~10min cap can land inside the grace for
|
|
// sections whose timeoutMs is close to 10min).
|
|
const softKill = setTimeout(() => {
|
|
timedOut = true;
|
|
const elapsedAtTimeout = ((Date.now() - t0) / 1000).toFixed(1);
|
|
console.error(` [${label}] Failed after ${elapsedAtTimeout}s: timeout after ${Math.round(timeoutMs / 1000)}s — sending SIGTERM`);
|
|
child.kill('SIGTERM');
|
|
killTimer = setTimeout(() => {
|
|
console.warn(` [${label}] Did not exit on SIGTERM within ${KILL_GRACE_MS / 1000}s — sending SIGKILL`);
|
|
child.kill('SIGKILL');
|
|
}, KILL_GRACE_MS);
|
|
}, timeoutMs);
|
|
const settle = (value) => {
|
|
if (settled) return;
|
|
settled = true;
|
|
clearTimeout(softKill);
|
|
if (killTimer) clearTimeout(killTimer);
|
|
resolve(value);
|
|
};
|
|
|
|
child.on('error', (err) => {
|
|
const elapsed = ((Date.now() - t0) / 1000).toFixed(1);
|
|
console.error(` [${label}] Failed after ${elapsed}s: spawn error: ${err.message}`);
|
|
settle({ elapsed, ok: false, reason: `spawn error: ${err.message}`, alreadyLogged: true });
|
|
});
|
|
|
|
child.on('close', (code, signal) => {
|
|
const elapsed = ((Date.now() - t0) / 1000).toFixed(1);
|
|
if (timedOut) {
|
|
// Terminal reason already logged by softKill — just record the outcome.
|
|
settle({ elapsed, ok: false, reason: `timeout after ${Math.round(timeoutMs / 1000)}s (signal ${signal || 'SIGTERM'})`, alreadyLogged: true });
|
|
} else if (code === 0) {
|
|
settle({ elapsed, ok: true, seedComplete: lastSeedComplete });
|
|
} else {
|
|
settle({ elapsed, ok: false, reason: `exit ${code ?? 'null'}${signal ? ` (signal ${signal})` : ''}` });
|
|
}
|
|
});
|
|
});
|
|
}
|
|
|
|
/**
|
|
* @param {string} label - Bundle name for logging
|
|
* @param {Array<{
|
|
* label: string,
|
|
* script: string,
|
|
* seedMetaKey?: string, // legacy (pre-contract); reads `seed-meta:<key>`
|
|
* canonicalKey?: string, // PR 2+: reads envelope from the canonical data key
|
|
* intervalMs: number,
|
|
* timeoutMs?: number,
|
|
* dependsOn?: string[], // labels that MUST run earlier in the array
|
|
* }>} sections
|
|
* @param {{ maxBundleMs?: number }} [opts]
|
|
*/
|
|
export async function runBundle(label, sections, opts = {}) {
|
|
// Topological-order assertion. A consumer seeder reading a peer's
|
|
// Redis output in-bundle depends on the peer running first; if a
|
|
// future edit (e.g. alphabetizing sections) reorders them, the
|
|
// consumer reads last-bundle's stale output. The freshness-guard in
|
|
// the consumer is a safety net; this assertion is the contract.
|
|
// Throws on violation so misconfiguration surfaces before any cron
|
|
// tick runs.
|
|
const labelIndex = new Map(sections.map((s, i) => [s.label, i]));
|
|
for (let i = 0; i < sections.length; i++) {
|
|
const deps = sections[i].dependsOn;
|
|
if (!Array.isArray(deps)) continue;
|
|
for (const depLabel of deps) {
|
|
const depIdx = labelIndex.get(depLabel);
|
|
if (depIdx == null) {
|
|
throw new Error(`[Bundle:${label}] section '${sections[i].label}' dependsOn unknown label '${depLabel}'`);
|
|
}
|
|
if (depIdx >= i) {
|
|
throw new Error(`[Bundle:${label}] section '${sections[i].label}' dependsOn '${depLabel}' but '${depLabel}' is at index ${depIdx} (must be < ${i})`);
|
|
}
|
|
}
|
|
}
|
|
|
|
const t0 = Date.now();
|
|
const maxBundleMs = opts.maxBundleMs ?? Infinity;
|
|
const budgetLabel = Number.isFinite(maxBundleMs) ? `, budget ${Math.round(maxBundleMs / 1000)}s` : '';
|
|
console.log(`[Bundle:${label}] Starting (${sections.length} sections${budgetLabel})`);
|
|
|
|
let ran = 0, skipped = 0, deferred = 0, failed = 0;
|
|
|
|
for (const section of sections) {
|
|
const scriptPath = join(__dirname, section.script);
|
|
const timeout = section.timeoutMs || 300_000;
|
|
|
|
const freshness = await readSectionFreshness(section);
|
|
if (freshness?.fetchedAt) {
|
|
const elapsed = Date.now() - freshness.fetchedAt;
|
|
if (elapsed < section.intervalMs * 0.8) {
|
|
const agoMin = Math.round(elapsed / 60_000);
|
|
const intervalMin = Math.round(section.intervalMs / 60_000);
|
|
console.log(` [${section.label}] Skipped, last seeded ${agoMin}min ago (interval: ${intervalMin}min)`);
|
|
skipped++;
|
|
continue;
|
|
}
|
|
}
|
|
|
|
const elapsedBundle = Date.now() - t0;
|
|
// Worst-case runtime is timeoutMs + KILL_GRACE_MS (child may ignore SIGTERM
|
|
// and need SIGKILL after grace). Admit only when the full worst-case fits.
|
|
const worstCase = timeout + KILL_GRACE_MS;
|
|
if (elapsedBundle + worstCase > maxBundleMs) {
|
|
const remainingSec = Math.max(0, Math.round((maxBundleMs - elapsedBundle) / 1000));
|
|
const needSec = Math.round(worstCase / 1000);
|
|
console.log(` [${section.label}] Deferred, needs ${needSec}s (timeout+grace) but only ${remainingSec}s left in bundle budget`);
|
|
deferred++;
|
|
continue;
|
|
}
|
|
|
|
const result = await spawnSeed(scriptPath, { timeoutMs: timeout, label: section.label, bundleStartedAtMs: t0 });
|
|
if (result.ok) {
|
|
console.log(` [${section.label}] Done (${result.elapsed}s)`);
|
|
// Bundle-level per-section summary — emitted from parent stdout so
|
|
// Railway log ingestion captures it reliably even when child lines
|
|
// drop. Observability tools should key off this line, not per-section
|
|
// Run ID / Mode / seed_complete lines which are best-effort only.
|
|
const sc = result.seedComplete;
|
|
if (sc && typeof sc === 'object') {
|
|
console.log(`[Bundle:${label}] section=${section.label} status=OK durationMs=${sc.durationMs ?? ''} records=${sc.recordCount ?? ''} state=${sc.state || 'OK'}`);
|
|
} else {
|
|
// Seeder didn't emit seed_complete (legacy non-contract seeders, or
|
|
// the child's event line was dropped before parsing).
|
|
console.log(`[Bundle:${label}] section=${section.label} status=OK elapsed=${result.elapsed}s`);
|
|
}
|
|
ran++;
|
|
} else {
|
|
if (!result.alreadyLogged) {
|
|
console.error(` [${section.label}] Failed after ${result.elapsed}s: ${result.reason}`);
|
|
}
|
|
// Emit the FAILED summary to stderr (same stream as the Failed line
|
|
// and SIGKILL escalation log) so chronological ordering in combined
|
|
// output is preserved. If we went to stdout here, the line would
|
|
// appear before those stderr lines when consumers concatenate
|
|
// stdout+stderr, breaking tests (and log readers) that rely on
|
|
// signal-escalation ordering.
|
|
console.error(`[Bundle:${label}] section=${section.label} status=FAILED elapsed=${result.elapsed}s reason=${(result.reason || 'unknown').replace(/\s+/g, ' ')}`);
|
|
failed++;
|
|
}
|
|
}
|
|
|
|
const totalSec = ((Date.now() - t0) / 1000).toFixed(1);
|
|
console.log(`[Bundle:${label}] Finished in ${totalSec}s, ran:${ran} skipped:${skipped} deferred:${deferred} failed:${failed}`);
|
|
process.exit(failed > 0 ? 1 : 0);
|
|
}
|