* fix(forecasts): unwrap seed-contract envelope in canonical-key sim patcher
Production bug observed 2026-04-23 across both forecast worker services
(seed-forecasts-simulation + seed-forecasts-deep): every successful run
logs `[SimulationDecorations] Cannot patch canonical key — predictions
missing or not an array` and silently fails to write simulation
adjustments back to forecast:predictions:v2.
Root cause: PR #3097 (seed-contract envelope dual-write) wraps canonical
seed writes in `{_seed: {...}, data: {predictions: [...]}}` via runSeed.
The Lua patcher (_SIM_PATCH_LUA) and its JS test-path mirror both read
`payload.predictions` directly with no envelope unwrap, so they always
return 'MISSING' against the new shape — meeting the documented pattern
in the project's worldmonitor-seed-envelope-consumer-drift learning
(91 producers enveloped, private-helper consumers not migrated).
User-visible impact: ForecastPanel renders simulation-adjusted scores
only when a fast-path seed has touched a forecast since the bug landed;
deep-forecast and simulation re-scores never reach the canonical feed.
Fix:
- _SIM_PATCH_LUA detects envelope shape (`type(payload._seed) == 'table'
and type(payload.data) == 'table'`), reads `inner.predictions`, and
re-encodes preserving the wrapper so envelope shape persists across
patches. Legacy bare values still pass through unchanged.
- JS test path mirrors the same unwrap/rewrap.
- New test WD-20b locks the regression: enveloped store fixture, asserts
`_seed` wrapper preserved on write + inner predictions patched.
Also resolves the per-run `[seed-contract] forecast:predictions missing
fields: sourceVersion — required in PR 3` warning by passing
`sourceVersion: 'detectors+llm-pipeline'` to runSeed (PR 3 of the
seed-contract migration will start enforcing this; cheap to fix now).
Verified: typecheck (both tsconfigs) clean; lint 0 errors; test:data
6631/6631 green (forecast suite 309/309 incl new WD-20b); edge-functions
176/176 green; markdown + version-check clean.
* fix(forecasts): tighten JS envelope guard to match Lua's strict table check
PR #3348 review (P2):
JS test path used `!!published._seed` (any truthy value) while the Lua
script requires `type(payload._seed) == 'table'` (strict object check).
Asymmetry: a fixture with `_seed: true`, `_seed: 1`, or `_seed: 'string'`
would be treated as enveloped by JS and bare by Lua — meaning the JS
test mirror could silently miss real Lua regressions that bisect on
fixture shape, defeating the purpose of having a parity test path.
Tighten JS to require both `_seed` and `data` be plain objects (rejecting
truthy non-objects + arrays), matching Lua's `type() == 'table'` semantics
exactly.
New test WD-20c locks the parity: fixture with non-table `_seed` (string)
+ bare-shape `predictions` → must succeed via bare path, identical to
what Lua would do.
Verified: 6632/6632 tests pass; new WD-20c green.
* feat(simulation): add keyActorRoles field to fix actor overlap bonus vocabulary mismatch
The +0.04 actor overlap bonus never reliably fired in production because
stateSummary.actors uses role-category strings ('Commodity traders',
'Policy officials') while simulation keyActors uses named geo-political
entities ('Iran', 'Houthi'). 53 production runs audited showed the bonus
fired once out of 53.
Fix: add keyActorRoles?: string[] to SimulationTopPath. The Round 2 prompt
now includes a CANDIDATE ACTOR ROLES section with theater-local role vocab
seeded from candidatePacket.stateSummary.actors. The LLM copies matching
roles into keyActorRoles. applySimulationMerge scores overlap against
keyActorRoles when actorSource=stateSummary, preserving the existing
keyActors entity-overlap path for the affectedAssets fallback.
- buildSimulationPackageFromDeepSnapshot: add actorRoles[] to each theater
from candidate.stateSummary.actors (theater-scoped, no cross-theater noise)
- buildSimulationRound2SystemPrompt: inject CANDIDATE ACTOR ROLES section
with exact-copy instruction and keyActorRoles in JSON template
- tryParseSimulationRoundPayload: extract keyActorRoles from round 2 output
- mergedPaths.map(): filter keyActorRoles against theater.actorRoles guardrail
- computeSimulationAdjustment: dual-path overlap — roleOverlapCount for
stateSummary, keyActorsOverlapCount for affectedAssets (backwards compat)
- summarizeImpactPathScore: project roleOverlapCount + keyActorsOverlapCount
into path-scorecards.json simDetail
New fields: roleOverlapCount, keyActorsOverlapCount in SimulationAdjustmentDetail
and ScorecardSimDetail. actorOverlapCount preserved as backwards-compat alias.
Tests: 308 pass (was 301 before). New tests T-P1/T-P2/T-P3 (prompt/parser),
T-RO1/T-RO2/T-RO3 (role overlap logic), T-PKG1 (pkg builder actorRoles),
plus fixture updates for T2/T-F/T-G/T-J/T-K/T-N2/T-SC-4.
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(simulation): address CE review findings from PR #2582
- Add SimulationPackageTheater interface to seed-forecasts.types.d.ts
(actorRoles was untyped under @ts-check)
- Add keyActorRoles to uiTheaters Redis projection in writeSimulationOutcome
(field was stripped from Redis snapshot; only visible in R2 artifact)
- Extract keyActorRoles IIFE to named sanitizeKeyActorRoles() function;
hoist allowedRoles Set computation out of per-path loop
- Harden bonusOverlap ternary: explicit branch for actorSource='none'
prevents silent fallthrough if new actorSource values are added
- Eliminate roleOverlap intermediate array in computeSimulationAdjustment
- Add U+2028/U+2029 Unicode line-separator stripping to sanitizeForPrompt
- Apply sanitizeForPrompt at tryParseSimulationRoundPayload parse boundary;
add JSDoc to newly-exported function
All 308 tests pass, typecheck + typecheck:api clean.
* fix(sim): restore const sanitized in sanitizeKeyActorRoles after early-return guard
Prior edit added `if (!allowedRoles.length) return []` but accidentally removed
the `const sanitized = ...` line, leaving the filter on line below referencing an
undefined variable. Restores the full function body:
if (!allowedRoles.length) return [];
const sanitized = (Array.isArray(rawRoles) ? rawRoles : [])
.map((s) => sanitizeForPrompt(String(s)).slice(0, 80));
const allowedNorm = new Set(allowedRoles.map(normalizeActorName));
return sanitized.filter((s) => allowedNorm.has(normalizeActorName(s))).slice(0, 8);
308/308 tests pass.
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
Each path-scorecards.json entry now includes a simDetail object containing
the full SimulationAdjustmentDetail fields: bucketChannelMatch, actorOverlapCount,
candidateActorCount, actorSource, resolvedChannel, channelSource, invalidatorHit,
stabilizerHit.
Previously, details from computeSimulationAdjustment were only stored in the
in-memory adjustments[] array and never reached the R2 artifact. This meant
production runs could not be retroactively audited to confirm whether the +0.04
actor-overlap bonus fired, which channel matched, or what actor source was used.
Changes:
- applySimulationMerge: attach details to path.simulationAdjustmentDetail
- summarizeImpactPathScore: project simDetail from simulationAdjustmentDetail
- seed-forecasts.types.d.ts: add simulationAdjustmentDetail to ExpandedPath, add ScorecardSimDetail interface
- Tests T-SC-1 through T-SC-4: 301/301 passing
* feat(forecast): plumb simulation decorations from ExpandedPath to Forecast Redis key
- Add `simulation_adjustment`, `sim_path_confidence`, `demoted_by_simulation` fields
to `Forecast` proto (fields 20-22) and regenerate service client
- `writeSimulationDecorations`: after simulation rescore, builds candidateStateId →
forecastIds map from snapshot.fullRunStateUnits, picks strongest signal per candidate,
writes `forecast:sim-decorations:v1` to Redis (3-day TTL)
- `applySimulationDecorationsToForecasts`: reads decorations at start of fast-path seed,
mutates predictions in-place (stale-by-one-run, non-fatal on failure)
- `buildPublishedForecastPayload`: passes sim fields through to published Redis payload
- Add `__setRedisStoreForTests` / test-store bypass in `getRedisCredentials` for unit tests
- 12 new WD-* tests covering early-exits, strongest-signal selection, candidate conflict
resolution, demotion flag, non-fatal error handling, and full round-trip
* fix(forecast): always overwrite simulation decorations to prevent 3-day staleness
Without this fix, writeSimulationDecorations() returned early on empty adjustments
or zero decorationCount, leaving the prior run's forecast:sim-decorations:v1 intact
for the full 3-day TTL. Subsequent fast-path seeds would blindly reapply old flags.
Fixes:
- Split the `!adjustments?.length` guard: only skip on missing simulationEvidence
(bogus data); write an empty byForecastId map when adjustments is [] so later
runs always overwrite stale entries
- Remove the `decorationCount === 0` early return for the same reason
- Add SIMULATION_DECORATIONS_MAX_AGE_MS (48h) guard in applySimulationDecorationsToForecasts
as defense-in-depth for the edge case where no simulation has run recently
Tests: WD-1, WD-3 updated; WD-2b, WD-13, WD-14 added (287 total, all pass)
* fix(forecast): call writeSimulationDecorations from inline deep-forecast path
processDeepForecastTask called applySimulationMerge but only extracted
simulationEvidence, discarding mergeResult without writing decorations.
Forecasts processed through the inline deep path (the common case) never
got their forecast:sim-decorations:v1 key populated.
Fix: fire-and-forget writeSimulationDecorations(mergeResult, snapshot)
immediately after applySimulationMerge in processDeepForecastTask, same
pattern as applyPostSimulationRescore. The deep path now writes on the
SAME run (no stale-by-one lag).
Test WD-15: exercises applySimulationMerge → writeSimulationDecorations
with snapshot.fullRunStateUnits, verifying both mapped forecast IDs are
decorated via the full chain used by processDeepForecastTask (288 total).
* fix(forecast): patch forecast:predictions:v2 immediately after sim decorations are written
writeSimulationDecorations only wrote the side key (forecast:sim-decorations:v1).
forecast:predictions:v2 is published by runSeed() before deep/rescore paths run,
so same-run simulation evidence never reached the canonical key — panel consumers
saw prior-run data until the next fast seed re-applied decorations.
Fix: add patchPublishedForecastsWithSimDecorations() called immediately after the
side key write in writeSimulationDecorations(). Reads forecast:predictions:v2,
updates sim fields for matched forecasts, resets to 0 for unmatched (clearing any
stale values from prior runs), writes back non-fatal.
Tests WD-16 and WD-17 assert on forecast:predictions:v2 directly — not just the
side key — proving the canonical key is updated on the same run (290 total).
* fix(forecast): await writeSimulationDecorations and applyPostSimulationRescore to prevent process.exit abandonment
Two fire-and-forget chains caused patchPublishedForecastsWithSimDecorations to
silently fail when runSimulationWorker(once=true) returned and process.exit fired:
1. processNextSimulationTask fired-and-forgot applyPostSimulationRescore then
returned immediately, so the rescore and its canonical-key patch never ran.
2. applyPostSimulationRescore fired-and-forgot writeSimulationDecorations and
returned on the no_path_changes branch before the Redis writes completed.
Both call sites are now awaited. The writeSimulationDecorations call in
processDeepForecastTask is also changed to await for consistency -- the deep
path was already safe (heavy artifact writing follows) but the comment said
"fire-and-forget" which contradicted the intent.
* fix(forecast): write seed-meta:forecast:sim-decorations:v1 for health monitoring (AGENTS.md requirement)
* fix(forecast): guard patchPublishedForecastsWithSimDecorations against cross-run stamping
Blind read-modify-write on forecast:predictions:v2 risked an older deep-forecast
or simulation worker overwriting a newer fast-path seed's payload (or zeroing out
newer sim fields) when workers finish out of order.
Fix: patchPublishedForecastsWithSimDecorations now accepts runGeneratedAt and
skips the patch when published.generatedAt > runGeneratedAt — i.e. the canonical
key already belongs to a newer run. writeSimulationDecorations passes
snapshot.generatedAt as the run identity token (same value baked into the
canonical key by buildPublishedSeedPayload for the matching run).
Add WD-18 (mismatch skips) and WD-19 (same-run patches) to cover overlapping-run
scenarios that the prior tests did not exercise.
* fix(forecast): atomic write-if-newer for sim-decorations side key; store run origin timestamp
Two P1 issues closed:
1. generatedAt: Date.now() on the side key made a late older worker's write look
fresh to applySimulationDecorationsToForecasts(), which measures staleness from
write time. Changed to snapshot?.generatedAt so freshness reflects the originating
run's age, not the wall-clock write timestamp.
2. No run-order guard on forecast:sim-decorations:v1 write. A late older worker
could overwrite a newer run's decorations, poisoning the source that the next
fast seed reads. Added redisAtomicWriteSimDecorations (Lua EVAL, symmetric with
the canonical key fix): atomically reads existing generatedAt, skips write if
existing is from a newer run.
writeSimulationDecorations now returns early on side key skip, also bypassing the
canonical patch (which has its own Lua guard regardless).
New tests: WD-21 (end-to-end skip of both side key + canonical), WD-22 (stored
generatedAt equals snapshot.generatedAt), WD-23/WD-24 (atomic helper unit cases).
297 tests passing.
* feat(simulation): confidence-weighted adjustments + simulationSignal trace lane
Scale +0.08 and +0.04 simulation bonuses by simPathConfidence (missing/zero
falls back to 1.0 so old artifacts are not penalized). Negative adjustments
(-0.12/-0.15) remain flat — they are structural, not sim-confidence-dependent.
Attach a compact simulationSignal object (backed, adjustmentDelta, channelSource,
demoted, simPathConfidence) to each ExpandedPath when adjustment != 0. Written
to R2 trace artifacts. ForecastPanel UI chip deferred to follow-up PR (requires
proto field + buf generate + prediction-to-path plumbing).
Add simPathConfidence to SimulationAdjustmentDetail for observability.
Add T-N1..T-N8 (confidence weighting) and T-O1..T-O4 (simulationSignal) tests.
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(simulation): correct zero-confidence handling + wire simulationSignal to scorecard summaries
P1: explicit confidence=0 from LLM now correctly yields simConf=0 (no positive bonus)
instead of falling back to 1.0. Absent/non-finite confidence still uses 1.0 fallback
(conservative — old LLM artifacts without the field are not penalized). The previous
rawConf > 0 guard conflated "absent" and "explicitly unsupported" paths.
P2: summarizeImpactPathScore now forwards simulationSignal into scorecard summaries,
so path-scorecards.json and impact-expansion-debug.json include the new lane alongside
simulationAdjustment and mergedAcceptanceScore. Only forecast-eval.json had it before.
Also exports summarizeImpactPathScore for direct unit testing.
Tests: T-N5 corrected (explicit 0 → no bonus), T-N5b (zero conf + invalidator still
fires flat -0.12), T-O5 (summarizeImpactPathScore includes simulationSignal),
T-O6 (omits field when absent). 270 passing.
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(simulation): clear stale sim fields before re-merge + fix simPathConfidence comment
P1 (blocking): applySimulationMerge now deletes simulationAdjustment, mergedAcceptanceScore,
simulationSignal, demotedBySimulation, and promotedBySimulation from every expanded path
before running computeSimulationAdjustment. This fires before any early-continue
(no theater match, no candidate packet, zero confidence), so reloaded forecast-eval.json
paths from applyPostSimulationRescore never retain stale simulation metadata from a prior cycle.
P3: corrected simPathConfidence JSDoc in SimulationAdjustmentDetail and SimulationSignal —
absent/non-finite → 1.0 fallback, explicit 0 preserved as 0 (was: "missing/null/zero fall back to 1.0").
Tests: T-O7 (zero-confidence match clears stale fields), T-O8 (no-theater match clears
stale fields). 272 passing.
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(types): correct SimulationSignal.backed JSDoc — only true for positive adjustments
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(simulation): source actor overlap from candidatePacket.stateSummary.actors (#2459)
The +0.04 actor overlap bonus in computeSimulationAdjustment has never
fired in production because extractPathActors read hop.affectedAssets
(financial instruments: "TTF gas futures", "European utility stocks")
while sim keyActors are geo/political actors ("Houthi", "Iran", "Saudi
Arabia") -- disjoint namespaces, actorOverlapCount=0 on every run.
Fix: source actors from candidatePacket.stateSummary.actors (grounded
stateUnit.actors, same domain as simulation keyActors). Strict precedence
with no union: if stateSummary.actors raw list is non-empty, use only
that; else fall back to affectedAssets for backward compat. Precedence
is on raw list presence, not post-normalization length, so entries like
['---'] use stateSummary path without falling through.
Also fixes normalizeActorName to handle underscore-separated LLM output
("Saudi_Arabia") and entity ID prefix stripping ("state-abc:iran").
Pattern /^[a-z][a-z0-9_-]*$/ strips lowercase ID prefixes but rejects
uppercase ("US", "EU") and natural language ("New York") to prevent
false strips. Both sides of the comparison go through the same function.
Guards Array.isArray on both stateSummary.actors and keyActors to prevent
crashes on non-array truthy values from malformed LLM snapshots.
Retires extractPathActors (single call site, now inlined). Adds
candidateActorCount and actorSource to SimulationAdjustmentDetail for
observability. Exports normalizeActorName for test coverage.
Tests: N-1/2/3/4 (normalizeActorName), T-F/G/H/I/K/L/L-pre/M
(computeSimulationAdjustment), T-J (applySimulationMerge end-to-end).
* fix(types): narrow channelSource to 'direct' | 'market' | 'none' union
* feat(simulation): post-simulation re-score fixes stale-sim pipeline sequencing
The deep forecast worker runs before the current run's simulation completes,
so applySimulationMerge always received a prior-run simulation (isCurrentRun=false).
Prior-run stateIds don't match current-run paths → adjustments=[], bucketChannelMatch
never fires, near-threshold paths stay rejected.
Fix (Option A): fire-and-forget re-score after writeSimulationOutcome completes.
- buildForecastEvalPayload(): stores full selectedPaths/rejectedPaths (with direct/
second/third hypothesis data) in forecast-eval.json so re-score can rebuild the
impact expansion bundle for promoted paths
- writeForecastTraceArtifacts(): writes forecast-eval.json alongside path-scorecards
- applyPostSimulationRescore(): loads forecast-eval + snapshot from R2, reconstructs
evaluation, calls applySimulationMerge with isCurrentRun=true, re-writes trace
artifacts if any path was promoted or demoted
- processNextSimulationTask(): fire-and-forget call to applyPostSimulationRescore
after writeSimulationOutcome + completeSimulationTask
Guard: only re-scores when status='completed_no_material_change' OR a rejected
expanded path is in the 0.42–0.50 acceptanceScore band (within reach of +0.08
bucketChannelMatch bonus). Skips silently otherwise.
Tests: R-1 through R-5 covering skip conditions, no-change case, promotion of a
near-threshold rejected path, and marketContext fallback for invalid LLM channel keys.
* fix(simulation): correct post-rescore guard and generatedAt derivation (P1+P2)
P1: Replace hasNearThreshold (checked rejected paths for promotion — no-op for
`completed` status since applySimulationMerge has no completed+promotion case)
with hasDemotionRisk (checks selected paths with acceptanceScore < 0.62, where
a -0.12 invalidator could push below the 0.50 acceptance threshold).
P2: Use freshOutcome.generatedAt as primary timestamp instead of
parseForecastRunGeneratedAt(runId). RunId carries the run-start timestamp; R2
artifact paths use data.generatedAt at write time. UTC midnight crossings cause
a key mismatch, silently returning no_eval_data. Fallback to runId parse only
when freshOutcome.generatedAt is absent.
* fix(simulation): add CASE 3 for partial demotion + skip duplicate pointer on rescore
applySimulationMerge CASE 3: handles completed run where some (not all) expanded
paths are demoted or promoted. Previously only CASE 1 (completed_no_material_change
→ completed) and CASE 2 (completed → completed_no_material_change) existed. With
the hasDemotionRisk guard in applyPostSimulationRescore, CASE 3 is now reachable:
selected expanded paths are non-empty after adjustment, status stays 'completed',
but evaluation.selectedPaths was never updated. Fixes the silent no-op: demoted
path stayed in selectedPaths in the published evaluation. Bundle and worldState are
rebuilt from the surviving expanded paths.
writeForecastTraceArtifacts: add context.skipPointer to suppress writeForecastTracePointer
on rescore re-writes. Rescore overwrites the same R2 artifact keys — the LPUSH
was consuming a history slot and evicting an older unique run earlier than intended.
readForecastWorldStateHistory dedupes by worldStateKey so no functional double-read,
but the extra slot consumed is still waste. Pass skipPointer: true from rescore call site.
Add R-6: partial demotion in completed run reproduces the two-path case (0.55 and
0.75 selected, invalidator fires on 0.55) and asserts selectedPaths is updated.
* fix(simulation): add hasPromotionOpportunity guard for completed-run rescore
hasDemotionRisk only checked selected paths, leaving completed runs with
near-threshold rejected paths (0.42-0.50) incorrectly skipped as
no_actionable_paths. applySimulationMerge CASE 3 now correctly handles
completed+promotion, but the gate never let it through.
Add hasPromotionOpportunity: check rejected expanded paths with
acceptanceScore in [0.42, 0.50) — a +0.08 bucketChannelMatch bonus
can push them over threshold. The outer guard now passes if either
hasDemotionRisk or hasPromotionOpportunity is true.
Also drop the evalData.status === 'completed' guard from both checks:
the outer condition (status !== 'completed_no_material_change') already
handles routing; the inner checks don't need to repeat it.
Add R-7: completed eval with safe selected (0.75) + near-threshold
rejected (0.44), simulation matches rejected path → +0.08 → 0.52 →
promoted via CASE 3, status stays completed, both paths in selectedPaths.
* fix(simulation): correct actionable-path guard thresholds to match max adjustments
The guard hardcoded 0.42/0.62 based on the +0.08 (bucketChannelMatch-only) and
-0.12 (invalidator) adjustments. computeSimulationAdjustment can reach +0.12
with actor overlap (>= 2 actors: +0.08 + 0.04) and -0.15 with a stabilizer hit.
Correct bounds derived from actual max adjustments:
max positive: +0.12 → promotion window lower bound: 0.50 - 0.12 = 0.38 (was 0.42)
max negative: -0.15 → demotion risk upper bound: 0.50 + 0.15 = 0.65 (was 0.62)
Production effects of old bounds:
- Rejected path at 0.39 with two matching actors (+0.12 → 0.51) was skipped
- Selected path at 0.64 with a matching stabilizer (-0.15 → 0.49) was skipped
Add R-8: path at 0.39 + two-actor overlap (+0.12 total) → 0.51 → promoted.
Add R-9: path at 0.64 + stabilizer (-0.15) → 0.49 → demoted via CASE 3.
THEATER_GEO_GROUPS previously mapped both 'Middle East' and 'Red Sea' to
'MENA', causing buildSimulationPackageFromDeepSnapshot to silently drop one
when both were candidates. With an active Iran/Hormuz conflict, Red Sea's
higher rankingScore could evict the Hormuz theater entirely.
Split into 'MENA_Gulf' (Middle East, Persian Gulf) and 'MENA_RedSea' (Red Sea)
so both theaters are independently eligible for simulation selection.
Tests: updated geo-dedup test to assert 3 selected theaters (Hormuz + Red Sea
+ Malacca), added within-group dedup test (Red Sea vs Suez Canal still compete).
Bug 1 (shape): pathBucket fallback read root candidatePacket.topBucketId
which is always undefined for new-schema packets — the grounded bucket
lives at candidatePacket.marketContext.topBucketId. Fixed by adding the
nested path before the legacy root fallback.
Bug 2 (semantic): path.direct.channel carries LLM marketImpact enum
values (supply_disruption, price_spike, risk_off, etc.) that are disjoint
from CHANNEL_KEYWORDS keys — only 3 of 11 are exact matches. The grounded
fallback marketContext.topChannel was never reached because the LLM value
is always truthy. Fixed with tiered Object.hasOwn() validation: use direct
channel if it is a known key, else fall back to marketContext.topChannel
(nested) or topChannel (legacy flat), else return '' to prevent greedy
literal substring matching.
Also adds resolvedChannel and channelSource to adjustment details for
runtime observability in debug artifacts.
5 new tests cover: valid direct key, invalid→nested market fallback,
invalid→flat legacy fallback, invalid+empty→no match, and the production
case where both bucket and channel are absent from direct and must resolve
from marketContext.
- Add scripts/seed-forecasts.types.d.ts with ambient interfaces for all
simulation data structures (CandidatePacket, TheaterResult, SimulationOutcome,
ExpandedPath, SimulationEvidence, SimulationAdjustmentDetail)
- Add // @ts-check + @param JSDoc to contradictsPremise, negatesDisruption,
computeSimulationAdjustment, applySimulationMerge
- Add scripts/jsconfig.json to enable tsc --checkJs on seed-forecasts.mjs
- Remove dead legacy fallbacks candidatePacket?.topBucketId and
candidatePacket?.topChannel (these fields were never at top level in
production; the correct path is candidatePacket.marketContext.topBucketId).
TypeScript now enforces this at write time.
- Update all test fixtures to use marketContext: { topBucketId, topChannel }
shape, matching production CandidatePacket structure
Two matching bugs causing adjustment=0 even after #2402+#2404:
1. commodityKey underscore: code checked text.includes('crude_oil') but
LLM-generated text always uses natural language ('crude oil'). Fixed
by .replace(/_/g, ' ') in both contradictsPremise and negatesDisruption.
2. CHANNEL_KEYWORDS too narrow: risk_off_rotation only matched 4 exact
phrases ('risk off', 'risk aversion', 'flight to safety', 'sell off').
Simulation paths use broader language ('capital flight', 'risk premium',
'sell-off', 'retreat', etc.). Expanded to cover LLM output vocabulary.
Also expanded security_escalation to include 'military', 'geopolit'.
Update test: negatesDisruption commodity test now uses natural language
('crude oil') matching what LLM actually generates, not the internal key.
* fix(simulation): store candidateStateId in theaterResults to enable Phase 3 merge
applySimulationMerge looks up paths by candidateStateId but theaterResults
only stored theaterId ("theater-1"). The map lookup always returned undefined,
silently no-oping all simulationAdjustment writes.
Fix: write candidateStateId alongside theaterId in theaterResults, and key
the simByTheater map by candidateStateId with theaterId as fallback.
* fix(e2e): add missing Earthquake fields to map-harness test fixture
* Revert "fix(e2e): add missing Earthquake fields to map-harness test fixture"
This reverts commit 69d8930b88.
* test(simulation): add T13 to cover positional theaterId vs candidateStateId mismatch
* feat(simulation): Phase 3 outcome re-ingestion with simulationAdjustment scoring
Parse simulation-outcome.json back into the deep forecast evaluation pipeline
as a simulationEvidence lane. Simulation informs the forecast without replacing
structural validation or observed world signals.
Changes:
- computeSimulationAdjustment: +0.08 bucket/channel match, +0.04 actor overlap,
-0.12 invalidator contradiction, -0.15 stabilizer negation (spec lines 447-478)
- applySimulationMerge: post-evaluation pass that demotes/promotes expanded paths
using mergedAcceptanceScore = clamp01(acceptanceScore + simulationAdjustment)
- fetchSimulationOutcomeForMerge: reads R2 full outcome; accepts current run or
previous run under 6h old; gracefully returns null when unavailable
- processDeepForecastTask: wires in simulation merge between evaluateDeepForecastPaths
and writeForecastTraceArtifacts; wrapped in try/catch for non-blocking degradation
- buildImpactExpansionDebugPayload: adds simulationEvidence field to debug artifacts
- summarizeImpactPathScore: adds simulationAdjustment + mergedAcceptanceScore fields
- 24 new tests (203 total, 0 failures) covering all adjustment rules, demotion,
promotion, matching helpers, and debug payload field
* feat(simulation): replace maritime type gate with significance-based eligibility (#2342)
* feat(simulation): replace maritime type gate with significance-based eligibility
isMaritimeChokeEnergyCandidate was a hardcoded filter requiring a route
in CHOKEPOINT_MARKET_REGIONS plus an energy/freight bucket. Any theater
driven by rate hikes, sovereign stress, political instability, volcanic
events, or infrastructure failure was silently dropped before simulation.
Replace with isSimulationEligible: rankingScore >= 0.40 + hasBucket +
hasCandidateStateId. The ranking formula already encodes signal lift,
transmission strength, and specificity — theater type is irrelevant.
Also generalize buildSimulationRequirementText with stateKind-branched
prose templates (market_repricing, political_instability, security_
escalation, infrastructure_fragility, cyber_pressure, fallback) and
extend buildSimulationPackageConstraints with two new constraint classes:
macro_financial_posture (soft) and structural_event_premise (hard:true)
to prevent MiroFish hallucinating maritime artifacts for non-maritime
theaters.
Maritime simulations are fully backward-compatible: maritime_disruption
stateKind still produces identical requirement text and constraints.
Tests: 213 pass (8 new eligibility tests, 5 requirement-text tests,
4 constraint tests replacing 7 old maritime-specific tests).
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.40.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* test(simulation): fill coverage gaps in theater-agnostic eligibility tests
Three untested paths identified post-review:
T-R6: cyber_pressure branch in buildSimulationRequirementText was
unexercised. Tests "systems availability" and "financial network
continuity" appear.
T-R7: governance_pressure was aliased to the political_instability
template but had no dedicated test. Confirms same prose as T-R3.
T-C5: political_instability + sovereign_risk bucket stacked-constraint
case. sovereign_risk IS in MACRO_FIN_BUCKETS, so this theater generates
both macro_financial_posture (soft) AND structural_event_premise (hard).
T-C2 only asserted the latter; T-C5 explicitly verifies both are
present and have the correct hard/soft values.
Also confirmed: the "critical bug" theory (rankingScore absent on theater
objects) is not a bug — buildSimulationPackageFromDeepSnapshot copies
rankingScore onto each theater object at L12593.
2388 tests pass.
* fix(simulation): theater-agnostic evaluation targets and symmetric demotion
Two P1 issues from PR review:
1. buildSimulationPackageEvaluationTargets still had maritime-specific
questions hard-coded for all theater types: "disruption at ${route}",
"freight rate delta on affected trade lanes", "energy price direction
($/bbl or %)". For market_repricing/political/cyber theaters this gave
MiroFish contradictory instructions — constraints said no chokepoints,
eval targets asked it to model route disruptions.
Fix: extract buildEvalTargetQuestions() branching on stateKind (7
branches: maritime_disruption, market_repricing, political_instability,
governance_pressure, security_escalation, infrastructure_fragility,
cyber_pressure, fallback). Maritime branch is identical to original.
Also generalize the T+24h timing marker description and the round-1
prompt market_cascade instruction.
2. contradictsPremise() and negatesDisruption() only matched on
routeFacilityKey and commodityKey. For the newly eligible theater types
(blank route/commodity), invalidators and stabilizers could never fire
the -0.12 / -0.15 demotion. New theaters were promotable but not
demotable — asymmetric scoring.
Fix: when route+commodity are both blank, fall back to subject keywords
derived from stateKind and topBucketId (split on _, keep >= 4 chars).
Maritime theater behavior is unchanged (early exit on first branch).
Tests: 2393 pass (+5 new: contradictsPremise non-maritime match,
negatesDisruption non-maritime match and non-match, evalTargets market_
repricing has no maritime framing, negatesDisruption description fix).
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
Both buildSimulationPackageEvaluationTargets and buildSimulationPackageConstraints
returned flat arrays. The prompt builders consumed them as
pkg.evaluationTargets?.[theater.theaterId] — object lookup on an array always
returns undefined, falling back to the generic string on every run.
Fix: both builders now return Record<theaterId, ...>. Prompt builders updated
to properly render constraint class/statement and evalTarget pathType/question.
Result: evaluation questions (escalation/containment/market_cascade) and
WorldMonitor structural constraints now reach the LLM on every simulation run.
* feat(simulation): geographic theater diversity + market_cascade economic paths
Three improvements to the MiroFish simulation pipeline:
1. Geographic deduplication: adds THEATER_GEO_GROUPS constant mapping
CHOKEPOINT_MARKET_REGIONS values to macro-groups (MENA, AsiaPacific,
EastEurope, etc.). buildSimulationPackageFromDeepSnapshot now skips
candidates whose macro-group is already represented, preventing
Red Sea + Middle East (both MENA) from appearing as separate theaters.
2. Label cleanup: strips trailing (stateKind) parenthetical from theater
labels before writing to selectedTheaters, so "Black Sea maritime
disruption state (supply_chain)" becomes "Black Sea maritime disruption
state" in the UI.
3. market_cascade path: renames spillover → market_cascade across 4 sites
(evaluationTargets, Round 1 prompt + JSON template, Round 2 prompt +
JSON template, tryParseSimulationRoundPayload expectedIds). The
market_cascade path instructs the LLM to model 2nd/3rd order economic
consequences: energy price direction ($/bbl), freight rate delta,
downstream sector impacts, and FX stress on import-dependent economies.
Tests: 176 pass (3 net new — geo-dedup, label cleanup, market_cascade
prompt; plus updated entity-collision and path-validation tests).
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* docs: fix markdownlint MD032 in simulation diversity plan
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* feat(simulation): MiroFish Phase 2 — theater-limited simulation runner
Adds the simulation execution layer that consumes simulation-package.json
and produces simulation-outcome.json for maritime chokepoint + energy/logistics
theaters, closing the WorldMonitor → MiroFish handoff loop.
Changes:
- scripts/seed-forecasts.mjs: 2-round LLM simulation runner (prompt builders,
JSON extractor, runTheaterSimulation, writeSimulationOutcome, task queue
with NX dedup lock, runSimulationWorker poll loop)
- scripts/process-simulation-tasks.mjs: standalone worker entry point
- proto: GetSimulationOutcome RPC + make generate
- server/worldmonitor/forecast/v1/get-simulation-outcome.ts: RPC handler
- server/gateway.ts: slow tier for get-simulation-outcome
- api/health.js: simulationOutcomeLatest in STANDALONE + ON_DEMAND keys
- tests: 14 new tests for simulation runner functions
* fix(simulation): address P1/P2 code review findings from PR #2220
Security (P1 #018):
- sanitizeForPrompt() applied to all entity/seed fields interpolated into
Round 1 prompt (entityId, class, stance, seedId, type, timing)
- sanitizeForPrompt() applied to actorId and entityIds in Round 2 prompt
- sanitizeForPrompt() + length caps applied to all LLM array fields written
to R2 (dominantReactions, stabilizers, invalidators, keyActors, timingMarkers)
Validation (P1 #019):
- Added validateRunId() regex guard
- Applied in enqueueSimulationTask() and processNextSimulationTask() loop
Type safety (P1 #020):
- Added isOutcomePointer() and isPackagePointer() type guards in TS handlers
- Replaced unsafe as-casts with runtime-validated guards in both handlers
Correctness (P2 #022):
- Log warning when pkgPointer.runId does not match task runId
Architecture (P2 #024):
- isMaritimeChokeEnergyCandidate() accepts both flat and nested topBucketId
- Call site simplified to pass theater directly
Performance (P2 #025):
- SIMULATION_ROUND1_MAX_TOKENS raised 1800 to 2200
- Added max 3 initialReactions instruction to Round 1 prompt
Maintainability (P2 #026):
- Simulation pointer keys exported from server/_shared/cache-keys.ts
- Both TS handlers import from shared location
Documentation (P2 #027):
- Strengthened runId no-op description in proto and OpenAPI spec
* fix(todos): add blank lines around lists in markdown todo files
* style(api): reformat openapi yaml to match linter output
* test(simulation): add flat-shape filter test + getSimulationOutcome handler coverage
Two tests identified as missing during PR #2220 review:
1. isMaritimeChokeEnergyCandidate flat-shape tests — covers the || candidate.topBucketId
normalization added in the P1/P2 review pass. The existing tests only used the nested
marketContext.topBucketId shape; this adds the flat root-field shape that arrives from
the simulation-package.json JSON (selectedTheaters entries have topBucketId at root).
2. getSimulationOutcome handler structural tests — verifies the isOutcomePointer guard,
found:false NOT_FOUND return, found:true success path, note population on runId mismatch,
and redis_unavailable error string. Follows the readSrc static-analysis pattern used
elsewhere in server-handlers.test.mjs (handler imports Redis so full integration test
would require a test Redis instance).
* feat(forecast): Phase 2 simulation package read path (getSimulationPackage RPC + Redis existence key)
- writeSimulationPackage now writes forecast:simulation-package:latest to Redis after
successful R2 write, containing { runId, pkgKey, schemaVersion, theaterCount, generatedAt }
with TTL matching TRACE_REDIS_TTL_SECONDS (60 days)
- New getSimulationPackage RPC handler reads Redis key, returns pointer metadata without
requiring an R2 fetch (zero R2 cost for existence check)
- Wired into ForecastServiceHandler and server/gateway.ts cache tier (medium)
- Proto: GetSimulationPackage RPC + get_simulation_package.proto message definitions
- api/health.js: simulationPackageLatest added to STANDALONE_KEYS + ON_DEMAND_KEYS
- Tests: SIMULATION_PACKAGE_LATEST_KEY constant + writeSimulationPackage null-guard test
Closes todo #017 (Phase 2 prerequisites for MiroFish integration)
* chore(generated): regenerate proto types for GetSimulationPackage RPC
* fix(simulation-rpc): distinguish Redis failure from not-found; signal runId mismatch
- Add `error` field to GetSimulationPackageResponse: populated with
"redis_unavailable" on Redis errors so callers can distinguish a
healthy not-found (found=false, error="") from a Redis failure
(found=false, error="redis_unavailable"). Adds console.warn on error.
- Add `note` field: populated when req.runId is supplied but does not
match the latest package's runId, signalling that per-run filtering
is not yet active (Phase 3).
- Add proto comment on run_id: "Currently ignored; reserved for Phase 3"
- Add milliseconds annotation to generated_at description.
- Simplify handler: extract NOT_FOUND constant, remove SimulationPackagePointer
interface, remove || '' / || 0 guards on guaranteed-present fields.
- Regenerate all buf-generated files.
Fixes todos #018 (runId silently ignored) and #019 (error indistinguishable
from not-found). Also resolves todos #022 (simplifications) and #023
(OpenAPI required fields / generatedAt unit annotation).
* fix(simulation-rpc): change cache tier from medium to slow (aligns with deep-run update frequency)
* fix(simulation-rpc): fix key prefixing, make Redis errors reachable, no-cache not-found
Three P1 regressions caught in external review:
1. Key prefix bug: getCachedJson() applies preview:<sha>: prefix in non-production
environments, but writeSimulationPackage writes the raw key via a direct Redis
command. In preview/dev the RPC always returned found:false even when the package
existed. Fix: new getRawJson() in redis.ts always uses the unprefixed key AND throws
on failure instead of swallowing errors.
2. redis_unavailable unreachable: getCachedJson swallows fetch failures and missing-
credentials by returning null, so the catch block for redis_unavailable was dead
code. getRawJson() throws on HTTP errors and missing credentials, making the
error: "redis_unavailable" contract actually reachable.
3. Negative-cache stampede: slow tier caches every 200 GET. A request before any deep
run wrote a package returned { found:false } which the CDN cached for up to 1h,
breaking post-run discovery. Fix: markNoCacheResponse() on both not-found and
error paths so they are served fresh on every request.
* feat(deep-forecast): Phase 1 simulation package export contract
Add buildSimulationPackageFromDeepSnapshot and writeSimulationPackage to
produce simulation-package.json alongside deep-snapshot.json on every eligible
fast run, completing Phase 1 of the WorldMonitor → MiroFish bridge defined in
docs/internal/wm-mirofish-gap.md.
Phase 1 scope: maritime chokepoint + energy/logistics theaters only.
A candidate qualifies if its routeFacilityKey is a known chokepoint in
CHOKEPOINT_MARKET_REGIONS and the top bucket is energy or freight, or the
commodityKey is an energy commodity.
Package shape (schemaVersion: v1):
- selectedTheaters: top 1–3 qualifying candidates with theater ID, route,
commodity, bucket, channel, and rankingScore
- simulationRequirement: deterministic template per theater (no LLM, fully
cacheable), built from label, stateKind, route, commodity, channel, and
criticalSignalTypes
- structuralWorld: filtered stateUnits, worldSignals, transmission edges,
market buckets, situationClusters, situationFamilies touching the theater
- entities: extracted from actorRegistry (forecastId overlap), stateUnit
actors, and evidence table actor entries; classified into 7 entity classes
(state_actor, military_or_security_actor, regulator_or_central_bank,
exporter_or_importer, logistics_operator, market_participant,
media_or_public_bloc); falls back to anchor set if extraction finds nothing
- eventSeeds: headline evidence → live_news, disruption-keyword signal
evidence → observed_disruption; T+0h timing; relative timing format
- constraints: route_chokepoint_status (hard if criticalSignalLift ≥ 0.25),
commodity_exposure (always hard), market_admissibility (soft, channel
routing), known_invalidators (soft, when contradictionScore ≥ 0.10)
- evaluationTargets: deterministic escalation/containment/spillover path
questions + T+24h/T+48h/T+72h timing markers per theater
Also adds 6 missing chokepoints to CHOKEPOINT_MARKET_REGIONS:
Baltic Sea, Danish Straits, Strait of Gibraltar, Panama Canal,
Lombok Strait, Cape of Good Hope.
writeSimulationPackage fires-and-forgets after writeDeepForecastSnapshot
so it does not add latency to the critical seed path.
17 new unit tests covering: theater filter, package shape, simulationRequirement
content, eventSeeds, constraints (hard/soft), evaluationTargets structure,
entity extraction, key format, and 3-theater cap.
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(simulation-package): address P1+P2 code review issues from PR #2204
P1 fixes:
- inferEntityClassFromName: use word-boundary regex to prevent "force"
substring false positives (e.g. "workforce", "Salesforce")
- buildSimulationPackageEntities: key Map entries by candidateStateId
instead of dominantRegion to prevent collision across theaters
sharing the same region
- writeSimulationPackage call site: pass priorWorldState so
actorRegistry is available to buildSimulationPackageFromDeepSnapshot
P2 fixes:
- buildSimulationRequirementText: apply sanitizeForPrompt to
theater.label, stateKind, topChannel, and critTypes before
string interpolation (stored prompt injection risk)
- buildSimulationPackageEventSeeds: apply sanitizeForPrompt to
entry.text before .slice(0, 200)
- isMaritimeChokeEnergyCandidate: replace new Set() allocation
per call with Array.includes for 2-element arrays
- buildSimulationPackageEntities: convert allForecastIds to Set
before actor registry loop (O(n²) → O(n))
- buildSimulationPackageEvaluationTargets: add missing candidate
guard with console.warn when candidate is undefined for theater
- selectedTheaters map: add label fallback to dominantRegion /
'unknown theater' to prevent "undefined" in simulationRequirement
Tests: 6 new unit tests covering the word-boundary fix, entity key
collision, injection stripping, and undefined label guard
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(deep-forecast): lower acceptance threshold 0.60→0.50 to match real score distribution
computeDeepPathAcceptanceScore formula: pathScore×0.55 + quality×0.20 + coherence×0.15
With pathScore≈0.65, quality≈0.30, coherence≈0.55:
0.358 + 0.060 + 0.083 = 0.50
The 0.60 threshold was calibrated before understanding that reportableQualityScore
is constrained by world-state simulation geometry (not hypothesis quality), and
coherence loses 0.15 for generic candidates without routeFacilityKey. The threshold
was structurally unreachable with typical expanded paths.
Verified end-to-end: deep worker now returns [DeepForecast] completed.
Also updates T6 gateDetails assertion and renames the rejection-floor test to
correctly describe the new behavior (strong inputs should be accepted).
111/111 tests pass.
* feat(deep-forecast): autoresearch prompt self-improvement loop + T9/T10 tests
- Add scoreImpactExpansionQuality() locked scorer: commodity rate (35%),
variable diversity (35%), chain coverage (20%), mapped rate (10%)
- Add runImpactExpansionPromptRefinement(): rate-limited LLM critic loop
(30min cooldown) that reads learned section from Redis, scores current
run, generates critique if composite < 0.62, tests on same candidates,
commits to forecast:prompt:impact-expansion:learned if score improves
- buildImpactExpansionSystemPrompt() now accepts learnedSection param,
appends it after core rules with separator so model sees prior examples
- buildImpactExpansionCandidateHash() includes learnedFingerprint to
bust cache when learned section changes
- processDeepForecastTask reads learnedSection from Redis before LLM
call, runs refinement after both completed and no_material_change paths
- Export scoreImpactExpansionQuality + runImpactExpansionPromptRefinement
- T9: high commodity rate + chain coverage → composite > 0.70
- T10: no commodity + no chain coverage → composite < 0.40
- 113/113 tests pass
* fix(deep-forecast): raise autoresearch threshold 0.62→0.80 + fix JSON parse
- Threshold 0.62 was too low: commodity=1.00 + chain=1.00 compensated for
diversity=0.50 (all same chains), keeping composite at 0.775 → no critique
- Raise to 0.80 so diversity<0.70 triggers critique even with good commodity/chain
- Fix JSON parser to extract first {…} block (handles Gemini code-fence wrapping)
- Add per-hypothesis log in refinement breakdown for observability
- Add refinementQualityThreshold to gateDetails for self-documenting artifacts
- Verified: critique fires on diversity=0.50 run, committed Hormuz/Baltic/Suez
region-specific chain examples (score 0.592→0.650)
* feat(deep-forecast): per-candidate parallel LLM calls replace batch expansion
Previously: all candidates → one batch LLM call → LLM averages context →
identical route_disruption → inflation_pass_through chains for all candidates.
Now: each candidate → its own focused LLM call (parallel Promise.all) →
LLM reasons about specific stateKind/region/routeFacility for that candidate.
Results (3 candidates, 3 parallel calls):
- composite: 0.592 → 0.831 (+0.24)
- commodity: 0.17 → 1.00 (all mapped have specific commodity)
- diversity: 0.50 → 0.83 (energy_export_stress, importer_balance_stress
appearing alongside route_disruption — genuinely different chains)
- Baseline updated: 0.831 (above 0.80 threshold → no critique needed)
Also threads learnedSection through extractSingleImpactExpansionCandidate
so the learned examples from autoresearch apply to each focused call.
Per-candidate cache keys (already existed) now serve as primary cache.
* fix(tests): update recovery test for per-candidate LLM call flow
- Change stage mock from impact_expansion → impact_expansion_single
(batch primary path removed, per-candidate is now primary)
- Assert parseMode === per_candidate instead of parseStage /^recovered_/
(recovered_ prefix was only set by old batch_repair path)
- 2257/2257 tests pass
* fix(deep-forecast): add Red Sea, Persian Gulf, South China Sea to chokepoint map
Candidate packets had routeFacilityKey=none for Red Sea / Persian Gulf /
Baltic Sea signals because prediction titles say "Red Sea maritime disruption"
not "Bab el-Mandeb" or "Strait of Hormuz". CHOKEPOINT_MARKET_REGIONS only
had sub-facility names (Bab el-Mandeb, Suez Canal) as keys, not the sea
regions themselves.
Fix: add Red Sea, Persian Gulf, Arabian Sea, Black Sea, South China Sea,
Mediterranean Sea as direct keys so region-level candidate titles resolve.
Result: LLM user prompt now shows routeFacilityKey=Red Sea / Persian Gulf /
Baltic Sea per candidate — giving each focused call the geographic context
needed to generate route-specific chains.
- Autoresearch baseline updated 0.932→0.965 on this run
- T8 extended with Red Sea, Persian Gulf, South China Sea assertions
- 2257/2257 tests pass
* feat(deep-forecast): free-form hypothesis schema + remove registry constraint
- Bump IMPACT_EXPANSION_REGISTRY_VERSION to v4
- Add hypothesisKey, description, geography, affectedAssets, marketImpact, causalLink fields to normalizeImpactHypothesisDraft (keep legacy fields for backward compat)
- Rewrite buildImpactExpansionSystemPrompt: remove IMPACT_VARIABLE_REGISTRY constraint table, use free-form ImpactHypothesis schema with geographic/commodity specificity rules
- Rewrite evaluateImpactHypothesisRejection: use effective key (hypothesisKey || variableKey) for dedup; legacy registry check only for old cached responses without hypothesisKey
- Update validateImpactHypotheses scoring: add geographyScore, commodityScore, causalLinkScore, assetScore terms; channelCoherence/bucketCoherence only apply to legacy responses
- Update parent-must-be-mapped invariant to use hypothesisKey || variableKey as effective key
- Update mapImpactHypothesesToWorldSignals: use effective key for dedup and sourceKey; prefer description/geography over legacy fields
- Update buildImpactPathsForCandidate: match on hypothesisKey || variableKey for parent lookup
- Update buildImpactPathId: use hypothesisKey || variableKey for hash inputs
- Rewrite scoreImpactExpansionQuality: add geographyRate and assetRate metrics; update composite weights
- Update buildImpactPromptCritiqueSystemPrompt/UserPrompt: use hypothesisKey-based chain format in examples
- Add new fields to buildImpactExpansionBundleFromPaths push calls
- Update T7 test assertion: MUST be the exact hypothesisKey instead of variableKey string
* fix(deep-forecast): update breakdown log to show free-form hypothesis fields
* feat(deep-forecast): add commodityDiversity metric to autoresearch scorer
- commodityDiversity = unique commodities / nCandidates (weight 0.35)
Penalizes runs where all candidates default to same commodity.
3 candidates all crude_oil → diversity=0.33 → composite ~0.76 → critique fires.
- Rebalanced composite weights: comDiversity 0.35, geo 0.20, keyDiversity 0.15, chain 0.10, commodityRate 0.10, asset 0.05, mappedRate 0.05
- Breakdown log now shows comDiversity + geo + keyDiversity
- Critique prompt updated: commodity_monoculture failure mode, diagnosis targets commodity homogeneity
- T9: added commodityDiversity=1.0 assertion (2 unique commodities across 2 candidates)
* refactor(deep-forecast): replace commodityDiversity with directCommodityDiversity + directGeoDiversity + candidateSpreadScore
Problem: measuring diversity on all mapped hypotheses misses the case where
one candidate generates 10 implications while others generate 0, or where
all candidates converge on the same commodity due to dominating signals.
Fix: score at the DIRECT hypothesis level (root causes only) and add
a candidate-spread metric:
- directCommodityDiversity: unique commodities among direct hypotheses /
nCandidates. Measures breadth at the root-cause level. 3 candidates all
crude_oil → 0.33 → composite ~0.77 → critique fires.
- directGeoDiversity: unique primary geographies among direct hypotheses /
nCandidates. First segment of compound geography strings (e.g.
'Red Sea, Suez Canal' → 'red sea') to avoid double-counting.
- candidateSpreadScore: normalized inverse-HHI. 1.0 = perfectly even
distribution across candidates. One candidate with 10 implications and
others with 0 → scores near 0 → critique fires.
Weight rationale: comDiversity 0.35, geoDiversity 0.20, spread 0.15,
chain 0.15, comRate 0.08, assetRate 0.04, mappedRate 0.03.
Verified: Run 2 Baltic/Hormuz/Brazil → freight/crude_oil/USD spread=0.98 ✓
* feat(deep-forecast): add convergence object to R2 debug artifact
Surface autoresearch loop outcome per run: converged (bool), finalComposite,
critiqueIterations (0 or 1), refinementCommitted, and perCandidateMappedCount
(candidateStateId → count). After 5+ runs the artifact alone answers whether
the pipeline is improving.
Architectural changes:
- runImpactExpansionPromptRefinement now returns { iterationCount, committed }
at all exit paths instead of undefined
- Call hoisted before writeForecastTraceArtifacts so the result flows into the
debug payload via dataForWrite.refinementResult
- buildImpactExpansionDebugPayload assembles convergence from validation +
refinementResult; exported for direct testing
- Fix: stale diversityScore reference replaced with directCommodityDiversity
Tests: T-conv-1 (converged=true), T-conv-2 (converged=false + iterations=1),
T-conv-3 (perCandidateMappedCount grouping) — 116/116 pass
* fix(deep-forecast): address P1+P2 review issues from convergence observability PR
P1-A: sanitize LLM-returned proposed_addition before Redis write (prompt injection
guard via sanitizeProposedLlmAddition — strips directive-phrase lines)
P1-B: restore fire-and-forget for runImpactExpansionPromptRefinement; compute
critiqueIterations from quality score (predicted) instead of awaiting result,
eliminating 15-30s critical-path latency on poor-quality runs
P1-C: processDeepForecastTask now returns convergence object to callers; add
convergence_quality_met warn check to evaluateForecastRunArtifacts
P1-D: cap concurrent LLM calls in extractImpactExpansionBundle to 3 (manual
batching — no p-limit) to respect provider rate limits
P2-1: hash full learnedSection in buildImpactExpansionCandidateHash (was sliced
to 80 chars, causing cache collisions on long learned sections)
P2-2: add exitReason field to all runImpactExpansionPromptRefinement return paths
P2-3: sanitizeForPrompt strips directive injection phrases; new
sanitizeProposedLlmAddition applies line-level filtering before Redis write
P2-4: add comment explaining intentional bidirectional affectedAssets/assetsOrSectors
coalescing in normalizeImpactHypothesisDraft
P2-5: extract makeConvTestData helper in T-conv tests; remove refinementCommitted
assertions (field removed from convergence shape)
P2-6: convergence_quality_met check added to evaluateForecastRunArtifacts (warn)
🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(docs): add blank lines around lists in plan (MD032)
* fix(deep-forecast): address P1+P2 reviewer issues in convergence observability
P1-1: mapImpactHypothesesToWorldSignals used free-form marketImpact values
(price_spike, shortage, credit_stress, risk_off) verbatim as signal channel
types, producing unknown types that buildMarketTransmissionGraph cannot
consume. Add IMPACT_SIGNAL_CHANNELS set + resolveImpactChannel() to map
free-form strings to the nearest valid channel before signal materialization.
P1-2: sanitizeForPrompt had directive-phrase stripping added that was too
broad for a function called on headlines, evidence tables, case files, and
geopolitical summaries. Reverted to original safe sanitizer (newline/control
char removal only). Directive stripping remains in sanitizeProposedLlmAddition
where it is scoped to Redis-bound LLM-generated additions only.
P2: Renamed convergence.critiqueIterations to predictedCritiqueIterations to
make clear this is a prediction from the quality score, not a measured count
from actual refinement behavior (refinement is fire-and-forget after artifact
write). Updated T-conv-1/2 test assertions to match.
* feat(deep-forecast): inject live news headlines into evidence table
Wire inputs.newsInsights / inputs.newsDigest through the candidate
selection pipeline so buildImpactExpansionEvidenceTable receives up to
3 commodity-relevant live headlines as 'live_news' evidence entries.
Changes:
- IMPACT_COMMODITY_LEXICON: extend fertilizer pattern (fertiliser,
nitrogen, phosphate, npk); add food_grains and shipping_freight entries
- filterNewsHeadlinesByState: new pure helper that scores headlines by
alert status, LNG/energy/route/sanctions signal match, lexicon commodity
match, and source corroboration count (min score 2 to include)
- buildImpactExpansionEvidenceTable: add newsItems param, inject
live_news entries, raise cap 8→11
- buildImpactExpansionCandidate: add newsInsights/newsDigest params,
compute newsItems via filterNewsHeadlinesByState
- selectImpactExpansionCandidates: add newsInsights/newsDigest to options
- Call site: pass inputs.newsInsights/newsDigest at seed time
- Export filterNewsHeadlinesByState, buildImpactExpansionEvidenceTable
- 9 new tests (T-news-1 through T-lex-3): all pass, 125 total pass
🤖 Generated with Claude Sonnet 4.6 (200K context) via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(deep-forecast): remove hardcoded LNG boost from filterNewsHeadlinesByState
The LNG+2 score was commodity-specific and inconsistent with the
intent: headline scoring should be generic, not biased toward any
named commodity. The function already handles the state's detected
commodity dynamically via lexEntry.pattern (IMPACT_COMMODITY_LEXICON).
LNG headlines still score via CRITICAL_NEWS_ENERGY_RE (+1) and
CRITICAL_NEWS_ROUTE_RE (+1) when relevant to the state's region.
All 125 tests pass.
* fix(deep-forecast): address all P1+P2 code review findings from PR #2178
P1 fixes (block-merge):
- Lower third_order mapped floor 0.74→0.70 (max achievable via 0.72 multiplier was 0.72)
- Guard runImpactExpansionPromptRefinement against empty validation (no_mapped exit)
- Replace block-list sanitizeProposedLlmAddition with pattern-based allowlist (HTML/JS/directive takeover)
- Fix TOCTOU on PROMPT_LAST_ATTEMPT_KEY: claim slot before quality check, not after LLM call
P2 fixes:
- Fix learned section overflow: use slice(-MAX) to preserve tail, not discard all prior content
- Add safe_haven_bid and global_crude_spread_stress branches to resolveImpactChannel
- quality_met path now sets rate-limit key (prevents 3 Redis GETs per good run)
- Hoist extractNewsClusterItems outside stateUnit map in selectImpactExpansionCandidates
- Export PROMPT_LEARNED_KEY, PROMPT_BASELINE_KEY, PROMPT_LAST_ATTEMPT_KEY + read/clear helpers
All 125 tests pass.
* fix(todos): add blank lines around lists/headings in todo files (markdownlint)
* fix(todos): fix markdownlint blanks-around-headings/lists in all todo files
---------
Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
* fix(forecast): fix impact expansion LLM prompt + add pre-filter validation diagnostic
Problem: LLM generates 8 hypotheses but 6 get rejected post-hoc because the
prompt gave 3 flat unlinked enum lists without per-variable constraints, causing
the LLM to hallucinate plausible but invalid channel/bucket combinations. Only
2 direct-only hypotheses survived, producing no expanded paths ->
completed_no_material_change on every deep run.
Also fixes two related bugs:
- Duplicate canonical state unit labels (same region + stateKind) blocked the
deep worker. Added label dedup filter in buildCanonicalStateUnits.
- When all hypotheses failed state-ID filtering in materializeImpactExpansion,
impactExpansionSummary showed 0 hypotheses with no rejection reasons, losing
all diagnostic data.
Changes:
- Bump IMPACT_EXPANSION_REGISTRY_VERSION v1 -> v2 to invalidate cached LLM
responses so the model re-generates with new constraint guidance
- Add buildRegistryConstraintTable() that serializes IMPACT_VARIABLE_REGISTRY
and MARKET_BUCKET_ALLOWED_CHANNELS into a compact constraint block
- Rewrite buildImpactExpansionSystemPrompt() to replace 3 flat enum lists with
the structured constraint table, bucket-channel dual gate rule, and explicit
MiroFish causal chain guidance (direct -> second_order -> third_order example)
- Add validation field to all 3 return paths in evaluateDeepForecastPaths so
pre-state-filter rejection data is always preserved
- Add hypothesisValidation to buildImpactExpansionDebugPayload capturing
totalHypotheses, validatedCount, mappedCount, rejectionReasonCounts, and
per-hypothesis rejection identity (candidateIndex, candidateStateId,
variableKey, channel, targetBucket, order, rejectionReason)
- Filter duplicate state unit labels post-finalization in buildCanonicalStateUnits
- Export buildRegistryConstraintTable, IMPACT_VARIABLE_REGISTRY,
MARKET_BUCKET_ALLOWED_CHANNELS for testability
- Add 4 new tests: mapped=0 early return has validation, no-expanded-accepted
path has validation, hypothesisValidation flows through buildForecastTraceArtifacts,
buildRegistryConstraintTable matches registry
* refactor(forecast): memoize constraint table, clarify dedup filter, tighten test A2
- Extract IMPACT_EXPANSION_REGISTRY_CONSTRAINT_TABLE const so the
constraint table is built once instead of on every prompt invocation
- Replace IIFE dedup filter in buildCanonicalStateUnits with an explicit
seenLabels Set for readability
- Add format-assumption comment to buildRegistryConstraintTable
- Rename test A2 and add mapped > 0 assertion to pin it past the
mapped=0 early-return path
* fix(forecast): disambiguate duplicate state unit labels instead of dropping units
The previous dedup filter silently dropped state units whose labels
collided after finalizeStateUnit(), losing their id, forecastIds,
signals, and deep-candidate eligibility.
Two distinct clusters can score below the merge threshold (< 5.5) while
still resolving to the same formatStateUnitLabel() output when they
share the same leadRegion and stateKind. The filter kept the first
and discarded the rest, suppressing valid deep paths.
Fix: replace the .filter() with a .map() that disambiguates collision
labels using dominantDomain suffix, falling back to the last 4 chars
of the unit id if domain-based disambiguation would itself collide.
The seenLabels Set tracks all assigned labels to prevent any secondary
collision. The snapshot validator (and deep worker) no longer sees
duplicate labels, and no units are dropped.
Also export buildCanonicalStateUnits for direct test coverage.
* feat(deep-forecast): phase 2 scoring recalibration + prompt excellence
Fixes cascade of three scoring gates that caused every deep run to return
completed_no_material_change despite valid LLM hypothesis generation.
Changes:
- Lower second_order validation floors (mapped: 0.66→0.58, internal: 0.58→0.50)
and raise multiplier (0.85→0.88) so typical LLM quality (strength~0.75,
conf~0.75, 2 refs, specificityScore=0.2) now reaches mapped status
- Binary evidenceSupport: refs >= 2 → 1, else → 0 (enforces 2-ref requirement
at scoring layer, not just prompt text; 1-ref hypotheses stay trace_only)
- Parent-must-be-mapped invariant: post-validation pass downgrades mapped
second_order/third_order whose dependsOnKey has no mapped parent
- Lower pathScore threshold from 0.66 to 0.50 to allow barely-mapped pairs
through to expanded path generation
- Add 6 missing maritime chokepoints to CHOKEPOINT_MARKET_REGIONS
- Bump IMPACT_EXPANSION_REGISTRY_VERSION v2 → v3 (invalidates stale LLM cache)
- Prompt v3: explicit dependsOnKey pairing, 2-evidence citation rule,
confidence calibration guidance, direct+second_order pair structure
- Add scoringBreakdown (all hypotheses with scoring factors) and gateDetails
(active thresholds) to debug artifact for observability feedback loop
- Export buildImpactExpansionSystemPrompt, extractImpactRouteFacilityKey,
extractImpactCommodityKey for testability
- 8 new tests (T1-T8) covering all phase 2 changes; 111/111 pass
* fix(seeders): apply gold standard TTL-extend+retry pattern to Aviation, NOTAM, Cyber, PositiveEvents
* feat(consumer-prices): default to All — global comparison table as landing view
- DEFAULT_MARKET = 'all' so panel opens with the global view
- 🌍 All pill added at front of market bar
- All view fetches all 9 markets in parallel via fetchAllMarketsOverview()
and renders a comparison table: Market / Index / WoW / Spread / Updated
- Clicking any market row drills into that market's full tab view
- SINGLE_MARKETS exported for use in All-view iteration
- CSS: .cp-global-table and row styles
Military detector forecasts (ADS-B flight tracking + theater posture API)
structurally score near zero on readiness metrics that require LLM-enriched
caseFile content (supporting evidence, news headlines, calibration, triggers).
This causes them to rank below the target count threshold every run despite
a valid elevated posture signal.
Add a domain guarantee post-pass after the 3 selection loops: if no military
forecast was selected and we have room below MAX_TARGET_PUBLISHED_FORECASTS,
inject the highest-scoring eligible military forecast. This does not displace
any already-selected forecast and respects all existing family/situation caps.
Diagnosis: Baltic theater at postureLevel='elevated' with 6 active flights
generates a military forecast (prob=0.41, confidence=0.30, score=0.136) but
gets buried behind 15+ well-grounded situation cluster forecasts at score 0.4+.
Tests: 3 new assertions in 'military domain guarantee in publish selection'.
* fix(forecast): block generic-actor cross-theater interactions + raise enrichment budget
Root cause: actor registry uses name:category as key (e.g. "Incumbent leadership:state"),
causing unrelated situations (Israel conflict, Taiwan political) to share the same actor
ID and fire sharedActor=true in pushInteraction. This propagated into the reportable
ledger and surfaced as junk effects like Israel→Taiwan at 80% confidence.
Two-pronged fix:
1. Specificity gate in pushInteraction: sharedActor now requires avgSpecificity >= 0.75.
Generic blueprint actors ("Incumbent leadership" ~0.68, "Civil protection authorities"
~0.73) no longer qualify as structural cross-situation links. Named domain-specific
actors ("Threat actors:adversarial" ~0.95) continue to qualify.
2. MACRO_REGION_MAP + isCrossTheaterPair + gate in buildCrossSituationEffects: for
cross-theater pairs (different macro-regions) with non-exempt channels, requires
sharedActor=true AND avgActorSpecificity >= 0.90. Exempt channels: cyber_disruption,
market_repricing (legitimately global). Same-macro-region pairs (Brazil/Mexico both
AMERICAS) are unaffected.
Verified against live run 1773983083084-bu6b1f:
BLOCKED: Israel→Taiwan (MENA/EAST_ASIA, spec 0.68)
BLOCKED: Israel→US political (MENA/AMERICAS, spec 0.68)
BLOCKED: Cuba→Iran (AMERICAS/MENA, spec 0.73)
BLOCKED: Brazil→Israel (AMERICAS/MENA, spec 0.85 < 0.90)
ALLOWED: China→US cyber_disruption (exempt channel)
ALLOWED: Brazil→Mexico (same AMERICAS)
Also raises ENRICHMENT_COMBINED_MAX from 3 to 5 (total budget 6→8),
targeting enrichedRate improvement from ~38% to ~60%.
* fix(plans): fix markdown lint errors in forecast semantic quality plan
* fix(plans): fix remaining markdown lint error in plan file
buildForecastTraceArtifacts was building worldState after tracedPredictions,
so simulation data was never available to buildForecastTraceRecord. Each
forecast's caseFile.worldState had situationId/familyId/simulationSummary
all undefined, making the 3-round MiroFish simulation invisible at the
forecast level.
Fix:
- Compute worldState before tracing (so simulationState is ready)
- Build forecastId → situationSimulation lookup from worldState.simulationState
- Pass lookup into buildForecastTraceRecord; inject situationId, familyId,
familyLabel, simulationSummary, simulationPosture, simulationPostureScore
into caseFile.worldState for each matched forecast
- Add regression assertion to forecast-trace-export tests
All 194 forecast tests pass.