fix(supply-chain): popup-keyed history re-query + dataAvailable flag (#3187)

* fix(supply-chain): popup-keyed history re-query + dataAvailable flag for partial coverage

Two P1 findings on #3185 post-merge review:

1. MapPopup cross-chokepoint history contamination
   Popup's async history resolve re-queried [data-transit-chart] without a
   cpId key. User opens popup A → fetch starts for cpA; user opens popup B
   before it resolves → cpA's history mounts into cpB's chart container.
   Fix: add data-transit-chart-id keyed by cpId; re-query by it on resolve.
   Mirrors SupplyChainPanel's existing data-chart-cp-id pattern.

2. Partial portwatch coverage still looked healthy
   Previous fix emits all 13 canonical summaries (zero-state fill for
   missing IDs) and records pwCovered in seed-meta, but:
   - get-chokepoint-status still zero-filled missing chokepoints and cached
     the response as healthy — panel rendered silent empty rows.
   - api/health.js only degrades on recordCount=0, so 10/13 partial read
     as OK despite the UI hiding entire chokepoints.
   Fix:
   - proto: TransitSummary.data_available (field 12). Writer tags with
     Boolean(cpData). Status RPC passes through; defaults true for pre-fix
     payloads (absence = covered).
   - Status RPC writes seed-meta recordCount as covered count (not shape
     size), and flips response-level upstreamUnavailable on partial.
   - api/health.js: new minRecordCount field on SEED_META entries + new
     COVERAGE_PARTIAL status (warn rollup). chokepoints entry declares
     minRecordCount: 13. recordCount < 13 → COVERAGE_PARTIAL.
   - Client (panel + popup): skip stats/chart rendering when
     !dataAvailable; show "Transit data unavailable (upstream partial)"
     microcopy so users understand the gap.

5759/5759 data tests pass. Typecheck + typecheck:api clean.

* fix(supply-chain): guarantee Simulate Closure button exits Computing state

User reports "Simulate Closure does nothing beyond write Computing…" — the
button sticks at Computing forever. Two causes:

1. Scenario worker appears down (0 scenario-result:* keys in Redis in the
   last 24h of 24h-TTL). Railway-side — separate intervention needed to
   redeploy scripts/scenario-worker.mjs.

2. Client leaked the "Computing…" state on multiple exit paths:
   - signal.aborted early-return inside the poll loop never reset the
     button. Second click fired abort on first → first returned without
     resetting → button stayed "Computing…" until next render.
   - !this.content.isConnected early-return also skipped reset (less
     user-visible but same class of bug).
   - catch block swallowed AbortError without resetting.
   - POST /run had no hard timeout — a hanging edge function left the
     button in Computing indefinitely.

Fix:
- resetButton(text) helper touches the btn only if still connected;
  applied in every exit path (abort, timeout, post-success, catch).
- AbortSignal.any([caller, AbortSignal.timeout(20_000)]) on POST /run.
- console.error on failure so Simulate Closure errors surface in ops.
- Error message includes "scenario worker may be down" on loop timeout
  so operators see the right suspect.

Backend observations (for follow-up):
- Hormuz backend is healthy (/api/health chokepoints OK, 13 records,
  1 min old; live RPC has hormuz_strait.riskLevel=critical, wow=-22,
  flowEstimate present; GetChokepointHistory returns 174 entries).
  User-reported "Hormuz empty" is likely browser/CDN stale cache from
  before PR #3185; hard refresh should resolve.
- scenario-worker.mjs has zero result keys in 24h. Railway service
  needs verification/redeployment.

* fix(scenario): wrong Upstash RPUSH format silently broke every Simulate Closure

Railway scenario-worker log shows every job failing field validation since
at least 03:06Z today:

  [scenario-worker] Job failed field validation, discarding:
    ["{\"jobId\":\"scenario:1776535792087:cynxx5v4\",...

The leading [" in the payload is the smoking gun. api/scenario/v1/run.ts
was POSTing to /rpush/{key} with body `[payload]`, expecting Upstash to
unpack the array and push one string value. Upstash does NOT parse that
form — it stored the literal `["{...}"]` string as a single list value.

Worker BLMOVEs the literal string → JSON.parse → array → destructure
`{jobId, scenarioId, iso2}` on an array returns undefined for all three
→ every job discarded without writing a result. Client poll returns
`pending` for the full 60s timeout, then (on the prior client code path)
leaked the stuck "Computing…" button state indefinitely.

Fix: use the standard Upstash REST command format — POST to the base URL
with body `["RPUSH", key, value]`. Matches scripts/ais-relay.cjs upstashLpush.

After this, the scenario-queue:pending list stores the raw payload string,
BLMOVE returns the payload, JSON.parse gives the object, validation passes,
computeScenario runs, result key gets written, client poll sees `done`.

Zero result keys existed in prod Redis in the last 24h (24h TTL on
scenario-result:*) — confirms the fix addresses the production outage.
This commit is contained in:
Elie Habib
2026-04-18 23:38:33 +04:00
committed by GitHub
parent d37ffb375e
commit 96fca1dc2b
12 changed files with 116 additions and 29 deletions

View File

@@ -245,7 +245,7 @@ const SEED_META = {
imfLabor: { key: 'seed-meta:economic:imf-labor', maxStaleMin: 100800 }, // monthly seed; 70d threshold matches imfMacro
imfExternal: { key: 'seed-meta:economic:imf-external', maxStaleMin: 100800 }, // monthly seed; 70d threshold matches imfMacro
shippingRates: { key: 'seed-meta:supply_chain:shipping', maxStaleMin: 420 },
chokepoints: { key: 'seed-meta:supply_chain:chokepoints', maxStaleMin: 60 },
chokepoints: { key: 'seed-meta:supply_chain:chokepoints', maxStaleMin: 60, minRecordCount: 13 }, // 13 canonical chokepoints; get-chokepoint-status writes covered-count → < 13 = upstream partial (portwatch/ArcGIS dropped some)
// minerals + giving: on-demand cachedFetchJson only, no seed-meta writer — freshness checked via TTL
// bisExchange + bisCredit: extras written by same BIS script via writeExtraKey, no dedicated seed-meta
fxYoy: { key: 'seed-meta:economic:fx-yoy', maxStaleMin: 1500 }, // daily cron; 25h tolerance + 1h drift
@@ -554,11 +554,18 @@ function classifyKey(name, redisKey, opts, ctx) {
else if (isOnDemand) status = 'EMPTY_ON_DEMAND';
else status = 'EMPTY_DATA';
} else if (seedStale === true) status = 'STALE_SEED';
// Coverage threshold: producers that know their canonical shape size can
// declare minRecordCount. When the writer reports a count below threshold
// (e.g., 10/13 chokepoints because portwatch dropped some), this degrades
// to COVERAGE_PARTIAL (warn) instead of reporting OK. Producer must write
// seed-meta.recordCount using the *covered* count, not the shape size.
else if (seedCfg?.minRecordCount != null && records < seedCfg.minRecordCount) status = 'COVERAGE_PARTIAL';
else status = 'OK';
const entry = { status, records };
if (seedAge !== null) entry.seedAgeMin = seedAge;
if (seedCfg) entry.maxStaleMin = seedCfg.maxStaleMin;
if (seedCfg?.minRecordCount != null) entry.minRecordCount = seedCfg.minRecordCount;
return entry;
}
@@ -569,6 +576,7 @@ const STATUS_COUNTS = {
SEED_ERROR: 'warn',
EMPTY_ON_DEMAND: 'warn',
REDIS_PARTIAL: 'warn',
COVERAGE_PARTIAL: 'warn',
EMPTY: 'crit',
EMPTY_DATA: 'crit',
};

View File

@@ -128,13 +128,20 @@ export default async function handler(req: Request): Promise<Response> {
enqueuedAt: Date.now(),
});
const redisResp = await fetch(`${url}/rpush/scenario-queue%3Apending`, {
// Upstash REST command format: POST base URL with body `[CMD, ...args]`.
// The previous `/rpush/{key}` + `body: [payload]` form caused Upstash to
// store the literal array-string (`["{jobId:...}"]`) as the list value,
// which broke the scenario-worker's JSON.parse → destructure flow and
// made every job fail field validation (Railway log 2026-04-18: repeated
// `[scenario-worker] Job failed field validation, discarding: ["{...`).
// This command format matches `upstashLpush` in scripts/ais-relay.cjs.
const redisResp = await fetch(url, {
method: 'POST',
headers: {
Authorization: `Bearer ${token}`,
'Content-Type': 'application/json',
},
body: JSON.stringify([payload]),
body: JSON.stringify(['RPUSH', 'scenario-queue:pending', payload]),
signal: AbortSignal.timeout(5_000),
});

File diff suppressed because one or more lines are too long

View File

@@ -594,6 +594,13 @@ components:
type: string
riskReportAction:
type: string
dataAvailable:
type: boolean
description: |-
False when the upstream portwatch/relay source did not return data for
this chokepoint in the current cycle — the summary fields are zero-state
fill, not a genuine "zero traffic" reading. Client should render a
"transit data unavailable" indicator and skip stat/chart rendering.
TransitDayCount:
type: object
properties:

View File

@@ -97,6 +97,11 @@ message TransitSummary {
double disruption_pct = 9;
string risk_summary = 10;
string risk_report_action = 11;
// False when the upstream portwatch/relay source did not return data for
// this chokepoint in the current cycle — the summary fields are zero-state
// fill, not a genuine "zero traffic" reading. Client should render a
// "transit data unavailable" indicator and skip stat/chart rendering.
bool data_available = 12;
}
message MineralProducer {

View File

@@ -7469,6 +7469,10 @@ async function seedTransitSummaries() {
// Compact summary: no history field. Consumed by get-chokepoint-status on
// every request, so keep it small.
// dataAvailable distinguishes genuine zero-traffic (cpData present, 0
// crossings) from zero-state fill (upstream missing this cycle). False
// here makes the RPC response explicit and lets the client render a
// "data unavailable" indicator instead of silently-empty stat rows.
summaries[cpId] = {
todayTotal: relayTransit?.total ?? 0,
todayTanker: relayTransit?.tanker ?? 0,
@@ -7481,6 +7485,7 @@ async function seedTransitSummaries() {
riskSummary: cr?.riskSummary ?? '',
riskReportAction: cr?.riskReportAction ?? '',
anomaly,
dataAvailable: Boolean(cpData),
};
// Per-id history key — only fetched on card expand via GetChokepointHistory.

View File

@@ -84,6 +84,10 @@ interface PreBuiltTransitSummary {
riskSummary: string;
riskReportAction: string;
anomaly: { dropPct: number; signal: boolean };
// Optional for back-compat: writers prior to the partial-coverage fix
// emitted no flag. Missing = treat as available (pre-fix writers only
// emitted summaries they had data for).
dataAvailable?: boolean;
}
interface TransitSummariesPayload {
@@ -339,7 +343,10 @@ async function fetchChokepointData(): Promise<ChokepointFetchResult> {
disruptionPct: ts.disruptionPct,
riskSummary: ts.riskSummary,
riskReportAction: ts.riskReportAction,
} : { todayTotal: 0, todayTanker: 0, todayCargo: 0, todayOther: 0, wowChangePct: 0, history: [], riskLevel: '', incidentCount7d: 0, disruptionPct: 0, riskSummary: '', riskReportAction: '' },
// Default true for pre-fix writers (absence = covered). New writers
// explicitly emit false for canonical zero-state fills.
dataAvailable: ts.dataAvailable ?? true,
} : { todayTotal: 0, todayTanker: 0, todayCargo: 0, todayOther: 0, wowChangePct: 0, history: [], riskLevel: '', incidentCount7d: 0, disruptionPct: 0, riskSummary: '', riskReportAction: '', dataAvailable: false },
flowEstimate: flowsData?.[cp.id] ? {
currentMbd: flowsData[cp.id]!.currentMbd,
baselineMbd: flowsData[cp.id]!.baselineMbd,
@@ -367,8 +374,22 @@ export async function getChokepointStatus(
async () => {
const { chokepoints, upstreamUnavailable } = await fetchChokepointData();
if (upstreamUnavailable) return null;
const response = { chokepoints, fetchedAt: new Date().toISOString(), upstreamUnavailable };
setCachedJson('seed-meta:supply_chain:chokepoints', { fetchedAt: Date.now(), recordCount: chokepoints.length }, 604800).catch(() => {});
// recordCount reflects the count of chokepoints with REAL upstream data
// (not the canonical shape size — always 13). Lets api/health.js
// distinguish 13/13 healthy from partial (e.g., 10/13) via the
// minRecordCount threshold. Before this, a partial portwatch failure
// showed as OK despite the UI rendering 3 zero-state rows.
const coveredCount = chokepoints.filter((c) => c.transitSummary?.dataAvailable !== false).length;
// Response-level signal: if any canonical chokepoint lost upstream,
// flip upstreamUnavailable so clients can show a partial-coverage
// banner without breaking the cached response (data still useful).
const partialCoverage = coveredCount < chokepoints.length;
const response = {
chokepoints,
fetchedAt: new Date().toISOString(),
upstreamUnavailable: upstreamUnavailable || partialCoverage,
};
setCachedJson('seed-meta:supply_chain:chokepoints', { fetchedAt: Date.now(), recordCount: coveredCount }, 604800).catch(() => {});
return response;
},
);

View File

@@ -296,7 +296,11 @@ export class MapPopup {
MapPopup.historyInflight.add(cpId);
void fetchChokepointHistory(cpId).then(resp => {
MapPopup.historyInflight.delete(cpId);
const liveEl = this.popup?.querySelector<HTMLElement>('[data-transit-chart]');
// Re-query keyed by cpId — if the user opened a different popup
// since this fetch started, the live [data-transit-chart] element
// belongs to the NEW chokepoint. Matching by id prevents mounting
// A's history into B's chart container.
const liveEl = this.popup?.querySelector<HTMLElement>(`[data-transit-chart-id="${cpId}"]`);
if (!liveEl) return;
if (resp.history.length) {
MapPopup.historyCache.set(cpId, resp.history);
@@ -308,7 +312,7 @@ export class MapPopup {
}
}).catch(() => {
MapPopup.historyInflight.delete(cpId);
const liveEl = this.popup?.querySelector<HTMLElement>('[data-transit-chart]');
const liveEl = this.popup?.querySelector<HTMLElement>(`[data-transit-chart-id="${cpId}"]`);
if (liveEl) liveEl.textContent = t('components.supplyChain.historyUnavailable') || 'History unavailable';
});
}
@@ -1332,10 +1336,11 @@ export class MapPopup {
const cp = this.chokepointData?.chokepoints?.find(
c => c.id === waterway.chokepointId,
);
// Chart is now lazy-loaded via GetChokepointHistory on popup mount. Always
// render the section for any known chokepoint; the initial placeholder
// swaps to a chart (PRO) or "History unavailable" as the fetch resolves.
const hasChart = !!cp;
// Chart is lazy-loaded via GetChokepointHistory on popup mount. Render the
// section only when the chokepoint is known AND upstream flagged data
// available — a zero-state fill (partial portwatch) means the per-id
// history key is also empty, so there's nothing to fetch.
const hasChart = !!cp && cp.transitSummary?.dataAvailable !== false;
const isPro = hasPremiumAccess(getAuthState());
const sectors = CHOKEPOINT_HS2_SECTORS[waterway.chokepointId];
@@ -1349,7 +1354,7 @@ export class MapPopup {
let chartSection = '';
if (hasChart) {
if (isPro) {
chartSection = `<div data-transit-chart="${escapeHtml(waterway.name)}" style="margin-top:10px;min-height:200px;display:flex;align-items:center;justify-content:center;color:var(--text-dim,#888);font-size:12px">${t('components.supplyChain.loadingHistory') || 'Loading transit history\u2026'}</div>`;
chartSection = `<div data-transit-chart="${escapeHtml(waterway.name)}" data-transit-chart-id="${escapeHtml(cp?.id ?? '')}" style="margin-top:10px;min-height:200px;display:flex;align-items:center;justify-content:center;color:var(--text-dim,#888);font-size:12px">${t('components.supplyChain.loadingHistory') || 'Loading transit history\u2026'}</div>`;
} else {
chartSection = `
<div class="sector-pro-gate" data-gate="chokepoint-transit-chart" style="position:relative;overflow:hidden;border-radius:6px;margin-top:10px;min-height:120px;background:var(--surface-elevated, #111)">

View File

@@ -342,11 +342,10 @@ export class SupplyChainPanel extends Panel {
const actionRow = expanded && ts?.riskReportAction
? `<div class="sc-routing-advisory">${escapeHtml(ts.riskReportAction)}</div>`
: '';
// Always render the chart placeholder when expanded — history is now
// lazy-loaded via GetChokepointHistory RPC (see mountTransitChart below).
// The placeholder shows a loading hint that's swapped to a chart once
// history resolves, or to a graceful "unavailable" message on empty.
const chartPlaceholder = expanded
// Render the chart placeholder only when expanded AND upstream reported
// data available for this chokepoint. If dataAvailable === false, the
// per-id history key would also be zero (we skip the lazy-fetch).
const chartPlaceholder = expanded && ts?.dataAvailable !== false
? `<div data-chart-cp="${escapeHtml(cp.name)}" data-chart-cp-id="${escapeHtml(cp.id)}" style="margin-top:8px;min-height:200px;display:flex;align-items:center;justify-content:center;color:var(--text-dim,#888);font-size:12px">${t('components.supplyChain.loadingHistory') || 'Loading transit history\u2026'}</div>`
: '';
@@ -397,7 +396,8 @@ export class SupplyChainPanel extends Panel {
<span>${cp.activeWarnings} ${t('components.supplyChain.warnings')} · ${aisDisruptions} ${t('components.supplyChain.aisDisruptions')}</span>
${cp.directions?.length ? `<span>${cp.directions.map(d => escapeHtml(d)).join('/')}</span>` : ''}
</div>
${ts && (ts.todayTotal > 0 || hasWow || disruptPct > 0) ? `<div class="sc-metric-row">
${ts && ts.dataAvailable === false ? `<div class="sc-metric-row" style="opacity:0.5;font-size:11px"><span>${t('components.supplyChain.transitDataUnavailable') || 'Transit data unavailable (upstream partial)'}</span></div>` : ''}
${ts && ts.dataAvailable !== false && (ts.todayTotal > 0 || hasWow || disruptPct > 0) ? `<div class="sc-metric-row">
${ts.todayTotal > 0 ? `<span>${ts.todayTotal} ${t('components.supplyChain.vessels')}</span>` : ''}
${hasWow ? `<span>${t('components.supplyChain.wowChange')}: ${wowSpan}</span>` : ''}
${disruptPct > 0 ? `<span>${t('components.supplyChain.disruption')}: <span class="${disruptClass}">${disruptPct.toFixed(1)}%</span></span>` : ''}
@@ -719,18 +719,38 @@ export class SupplyChainPanel extends Panel {
const scenarioId = trigger.dataset.scenarioId!;
btn.disabled = true;
btn.textContent = 'Computing\u2026';
// Guarantee the button never stays stuck at "Computing…" regardless of
// exit path. Prior logic early-returned on `signal.aborted` and
// `!this.content.isConnected` without ever re-enabling the button, and
// swallowed AbortError in the catch block. When the scenario-worker is
// down (no result key written in 24h), the polling loop DID fire a
// timeout but the abort paths above it leaked the stuck state.
const resetButton = (text: string) => {
// Only touch the button if it's still the same element in the DOM.
// A re-render may have replaced it — updating the detached node is
// invisible and harmless, but we skip to avoid confusion.
if (btn.isConnected) {
btn.textContent = text;
btn.disabled = false;
}
};
try {
// Hard timeout on POST /run so a hanging edge function can't leave
// the button in "Computing…" indefinitely.
const runSignal = AbortSignal.any([signal, AbortSignal.timeout(20_000)]);
const runResp = await premiumFetch('/api/scenario/v1/run', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ scenarioId }),
signal,
signal: runSignal,
});
if (!runResp.ok) throw new Error('Run failed');
if (!runResp.ok) throw new Error(`Run failed: ${runResp.status}`);
const { jobId } = await runResp.json() as { jobId: string };
let result: ScenarioResult | null = null;
for (let i = 0; i < 30; i++) {
if (signal.aborted || !this.content.isConnected) return;
if (signal.aborted) { resetButton('Simulate Closure'); return; }
if (!this.content.isConnected) return; // panel gone — nothing to update
if (i > 0) await new Promise(r => setTimeout(r, 2000));
const statusResp = await premiumFetch(`/api/scenario/v1/status?jobId=${encodeURIComponent(jobId)}`, { signal });
if (!statusResp.ok) throw new Error(`Status poll failed: ${statusResp.status}`);
@@ -743,14 +763,20 @@ export class SupplyChainPanel extends Panel {
}
if (status.status === 'failed') throw new Error('Scenario failed');
}
if (!result) throw new Error('Timeout');
if (signal.aborted || !this.content.isConnected) return;
if (!result) throw new Error('Timeout — scenario worker may be down');
if (signal.aborted) { resetButton('Simulate Closure'); return; }
if (!this.content.isConnected) return;
this.onScenarioActivate?.(scenarioId, result);
btn.textContent = 'Active';
resetButton('Active');
btn.disabled = true; // active state stays disabled until user dismisses
} catch (err) {
if (err instanceof Error && err.name === 'AbortError') return;
btn.textContent = 'Error \u2014 retry';
btn.disabled = false;
// Abort from a new click = user-triggered retry, no error banner needed.
if (err instanceof Error && err.name === 'AbortError') {
resetButton('Simulate Closure');
return;
}
console.error('[scenario] run failed:', err);
resetButton('Error \u2014 retry');
}
}
}

View File

@@ -73,6 +73,7 @@ export interface TransitSummary {
disruptionPct: number;
riskSummary: string;
riskReportAction: string;
dataAvailable: boolean;
}
export interface TransitDayCount {

View File

@@ -73,6 +73,7 @@ export interface TransitSummary {
disruptionPct: number;
riskSummary: string;
riskReportAction: string;
dataAvailable: boolean;
}
export interface TransitDayCount {

View File

@@ -949,6 +949,7 @@
"loadingCorridors": "Loading corridor data...",
"loadingHistory": "Loading transit history…",
"historyUnavailable": "Transit history unavailable",
"transitDataUnavailable": "Transit data unavailable (upstream partial)",
"mineral": "Mineral",
"topProducers": "Top Producers",
"risk": "Risk",