Commit Graph

53 Commits

Author SHA1 Message Date
Elie Habib
01f6057389 feat(simulation): MiroFish Phase 2 — theater-limited simulation runner (#2220)
* feat(simulation): MiroFish Phase 2 — theater-limited simulation runner

Adds the simulation execution layer that consumes simulation-package.json
and produces simulation-outcome.json for maritime chokepoint + energy/logistics
theaters, closing the WorldMonitor → MiroFish handoff loop.

Changes:
- scripts/seed-forecasts.mjs: 2-round LLM simulation runner (prompt builders,
  JSON extractor, runTheaterSimulation, writeSimulationOutcome, task queue
  with NX dedup lock, runSimulationWorker poll loop)
- scripts/process-simulation-tasks.mjs: standalone worker entry point
- proto: GetSimulationOutcome RPC + make generate
- server/worldmonitor/forecast/v1/get-simulation-outcome.ts: RPC handler
- server/gateway.ts: slow tier for get-simulation-outcome
- api/health.js: simulationOutcomeLatest in STANDALONE + ON_DEMAND keys
- tests: 14 new tests for simulation runner functions

* fix(simulation): address P1/P2 code review findings from PR #2220

Security (P1 #018):
- sanitizeForPrompt() applied to all entity/seed fields interpolated into
  Round 1 prompt (entityId, class, stance, seedId, type, timing)
- sanitizeForPrompt() applied to actorId and entityIds in Round 2 prompt
- sanitizeForPrompt() + length caps applied to all LLM array fields written
  to R2 (dominantReactions, stabilizers, invalidators, keyActors, timingMarkers)

Validation (P1 #019):
- Added validateRunId() regex guard
- Applied in enqueueSimulationTask() and processNextSimulationTask() loop

Type safety (P1 #020):
- Added isOutcomePointer() and isPackagePointer() type guards in TS handlers
- Replaced unsafe as-casts with runtime-validated guards in both handlers

Correctness (P2 #022):
- Log warning when pkgPointer.runId does not match task runId

Architecture (P2 #024):
- isMaritimeChokeEnergyCandidate() accepts both flat and nested topBucketId
- Call site simplified to pass theater directly

Performance (P2 #025):
- SIMULATION_ROUND1_MAX_TOKENS raised 1800 to 2200
- Added max 3 initialReactions instruction to Round 1 prompt

Maintainability (P2 #026):
- Simulation pointer keys exported from server/_shared/cache-keys.ts
- Both TS handlers import from shared location

Documentation (P2 #027):
- Strengthened runId no-op description in proto and OpenAPI spec

* fix(todos): add blank lines around lists in markdown todo files

* style(api): reformat openapi yaml to match linter output

* test(simulation): add flat-shape filter test + getSimulationOutcome handler coverage

Two tests identified as missing during PR #2220 review:

1. isMaritimeChokeEnergyCandidate flat-shape tests — covers the || candidate.topBucketId
   normalization added in the P1/P2 review pass. The existing tests only used the nested
   marketContext.topBucketId shape; this adds the flat root-field shape that arrives from
   the simulation-package.json JSON (selectedTheaters entries have topBucketId at root).

2. getSimulationOutcome handler structural tests — verifies the isOutcomePointer guard,
   found:false NOT_FOUND return, found:true success path, note population on runId mismatch,
   and redis_unavailable error string. Follows the readSrc static-analysis pattern used
   elsewhere in server-handlers.test.mjs (handler imports Redis so full integration test
   would require a test Redis instance).
2026-03-25 13:55:59 +04:00
Elie Habib
f87c8c71c4 feat(forecast): Phase 2 simulation package read path (#2219)
* feat(forecast): Phase 2 simulation package read path (getSimulationPackage RPC + Redis existence key)

- writeSimulationPackage now writes forecast:simulation-package:latest to Redis after
  successful R2 write, containing { runId, pkgKey, schemaVersion, theaterCount, generatedAt }
  with TTL matching TRACE_REDIS_TTL_SECONDS (60 days)
- New getSimulationPackage RPC handler reads Redis key, returns pointer metadata without
  requiring an R2 fetch (zero R2 cost for existence check)
- Wired into ForecastServiceHandler and server/gateway.ts cache tier (medium)
- Proto: GetSimulationPackage RPC + get_simulation_package.proto message definitions
- api/health.js: simulationPackageLatest added to STANDALONE_KEYS + ON_DEMAND_KEYS
- Tests: SIMULATION_PACKAGE_LATEST_KEY constant + writeSimulationPackage null-guard test

Closes todo #017 (Phase 2 prerequisites for MiroFish integration)

* chore(generated): regenerate proto types for GetSimulationPackage RPC

* fix(simulation-rpc): distinguish Redis failure from not-found; signal runId mismatch

- Add `error` field to GetSimulationPackageResponse: populated with
  "redis_unavailable" on Redis errors so callers can distinguish a
  healthy not-found (found=false, error="") from a Redis failure
  (found=false, error="redis_unavailable"). Adds console.warn on error.
- Add `note` field: populated when req.runId is supplied but does not
  match the latest package's runId, signalling that per-run filtering
  is not yet active (Phase 3).
- Add proto comment on run_id: "Currently ignored; reserved for Phase 3"
- Add milliseconds annotation to generated_at description.
- Simplify handler: extract NOT_FOUND constant, remove SimulationPackagePointer
  interface, remove || '' / || 0 guards on guaranteed-present fields.
- Regenerate all buf-generated files.

Fixes todos #018 (runId silently ignored) and #019 (error indistinguishable
from not-found). Also resolves todos #022 (simplifications) and #023
(OpenAPI required fields / generatedAt unit annotation).

* fix(simulation-rpc): change cache tier from medium to slow (aligns with deep-run update frequency)

* fix(simulation-rpc): fix key prefixing, make Redis errors reachable, no-cache not-found

Three P1 regressions caught in external review:

1. Key prefix bug: getCachedJson() applies preview:<sha>: prefix in non-production
   environments, but writeSimulationPackage writes the raw key via a direct Redis
   command. In preview/dev the RPC always returned found:false even when the package
   existed. Fix: new getRawJson() in redis.ts always uses the unprefixed key AND throws
   on failure instead of swallowing errors.

2. redis_unavailable unreachable: getCachedJson swallows fetch failures and missing-
   credentials by returning null, so the catch block for redis_unavailable was dead
   code. getRawJson() throws on HTTP errors and missing credentials, making the
   error: "redis_unavailable" contract actually reachable.

3. Negative-cache stampede: slow tier caches every 200 GET. A request before any deep
   run wrote a package returned { found:false } which the CDN cached for up to 1h,
   breaking post-run discovery. Fix: markNoCacheResponse() on both not-found and
   error paths so they are served fresh on every request.
2026-03-24 22:45:22 +04:00
Elie Habib
b7e6333877 feat(deep-forecast): Phase 1 simulation package export contract (#2204)
* feat(deep-forecast): Phase 1 simulation package export contract

Add buildSimulationPackageFromDeepSnapshot and writeSimulationPackage to
produce simulation-package.json alongside deep-snapshot.json on every eligible
fast run, completing Phase 1 of the WorldMonitor → MiroFish bridge defined in
docs/internal/wm-mirofish-gap.md.

Phase 1 scope: maritime chokepoint + energy/logistics theaters only.
A candidate qualifies if its routeFacilityKey is a known chokepoint in
CHOKEPOINT_MARKET_REGIONS and the top bucket is energy or freight, or the
commodityKey is an energy commodity.

Package shape (schemaVersion: v1):
- selectedTheaters: top 1–3 qualifying candidates with theater ID, route,
  commodity, bucket, channel, and rankingScore
- simulationRequirement: deterministic template per theater (no LLM, fully
  cacheable), built from label, stateKind, route, commodity, channel, and
  criticalSignalTypes
- structuralWorld: filtered stateUnits, worldSignals, transmission edges,
  market buckets, situationClusters, situationFamilies touching the theater
- entities: extracted from actorRegistry (forecastId overlap), stateUnit
  actors, and evidence table actor entries; classified into 7 entity classes
  (state_actor, military_or_security_actor, regulator_or_central_bank,
  exporter_or_importer, logistics_operator, market_participant,
  media_or_public_bloc); falls back to anchor set if extraction finds nothing
- eventSeeds: headline evidence → live_news, disruption-keyword signal
  evidence → observed_disruption; T+0h timing; relative timing format
- constraints: route_chokepoint_status (hard if criticalSignalLift ≥ 0.25),
  commodity_exposure (always hard), market_admissibility (soft, channel
  routing), known_invalidators (soft, when contradictionScore ≥ 0.10)
- evaluationTargets: deterministic escalation/containment/spillover path
  questions + T+24h/T+48h/T+72h timing markers per theater

Also adds 6 missing chokepoints to CHOKEPOINT_MARKET_REGIONS:
Baltic Sea, Danish Straits, Strait of Gibraltar, Panama Canal,
Lombok Strait, Cape of Good Hope.

writeSimulationPackage fires-and-forgets after writeDeepForecastSnapshot
so it does not add latency to the critical seed path.

17 new unit tests covering: theater filter, package shape, simulationRequirement
content, eventSeeds, constraints (hard/soft), evaluationTargets structure,
entity extraction, key format, and 3-theater cap.

🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>

* fix(simulation-package): address P1+P2 code review issues from PR #2204

P1 fixes:
- inferEntityClassFromName: use word-boundary regex to prevent "force"
  substring false positives (e.g. "workforce", "Salesforce")
- buildSimulationPackageEntities: key Map entries by candidateStateId
  instead of dominantRegion to prevent collision across theaters
  sharing the same region
- writeSimulationPackage call site: pass priorWorldState so
  actorRegistry is available to buildSimulationPackageFromDeepSnapshot

P2 fixes:
- buildSimulationRequirementText: apply sanitizeForPrompt to
  theater.label, stateKind, topChannel, and critTypes before
  string interpolation (stored prompt injection risk)
- buildSimulationPackageEventSeeds: apply sanitizeForPrompt to
  entry.text before .slice(0, 200)
- isMaritimeChokeEnergyCandidate: replace new Set() allocation
  per call with Array.includes for 2-element arrays
- buildSimulationPackageEntities: convert allForecastIds to Set
  before actor registry loop (O(n²) → O(n))
- buildSimulationPackageEvaluationTargets: add missing candidate
  guard with console.warn when candidate is undefined for theater
- selectedTheaters map: add label fallback to dominantRegion /
  'unknown theater' to prevent "undefined" in simulationRequirement

Tests: 6 new unit tests covering the word-boundary fix, entity key
collision, injection stripping, and undefined label guard

---------

Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
2026-03-24 20:40:52 +04:00
Elie Habib
226cebf9bc feat(deep-forecast): Phase 2+3 scoring recalibration + autoresearch prompt self-improvement (#2178)
* fix(deep-forecast): lower acceptance threshold 0.60→0.50 to match real score distribution

computeDeepPathAcceptanceScore formula: pathScore×0.55 + quality×0.20 + coherence×0.15
With pathScore≈0.65, quality≈0.30, coherence≈0.55:
  0.358 + 0.060 + 0.083 = 0.50

The 0.60 threshold was calibrated before understanding that reportableQualityScore
is constrained by world-state simulation geometry (not hypothesis quality), and
coherence loses 0.15 for generic candidates without routeFacilityKey. The threshold
was structurally unreachable with typical expanded paths.

Verified end-to-end: deep worker now returns [DeepForecast] completed.

Also updates T6 gateDetails assertion and renames the rejection-floor test to
correctly describe the new behavior (strong inputs should be accepted).

111/111 tests pass.

* feat(deep-forecast): autoresearch prompt self-improvement loop + T9/T10 tests

- Add scoreImpactExpansionQuality() locked scorer: commodity rate (35%),
  variable diversity (35%), chain coverage (20%), mapped rate (10%)
- Add runImpactExpansionPromptRefinement(): rate-limited LLM critic loop
  (30min cooldown) that reads learned section from Redis, scores current
  run, generates critique if composite < 0.62, tests on same candidates,
  commits to forecast:prompt:impact-expansion:learned if score improves
- buildImpactExpansionSystemPrompt() now accepts learnedSection param,
  appends it after core rules with separator so model sees prior examples
- buildImpactExpansionCandidateHash() includes learnedFingerprint to
  bust cache when learned section changes
- processDeepForecastTask reads learnedSection from Redis before LLM
  call, runs refinement after both completed and no_material_change paths
- Export scoreImpactExpansionQuality + runImpactExpansionPromptRefinement
- T9: high commodity rate + chain coverage → composite > 0.70
- T10: no commodity + no chain coverage → composite < 0.40
- 113/113 tests pass

* fix(deep-forecast): raise autoresearch threshold 0.62→0.80 + fix JSON parse

- Threshold 0.62 was too low: commodity=1.00 + chain=1.00 compensated for
  diversity=0.50 (all same chains), keeping composite at 0.775 → no critique
- Raise to 0.80 so diversity<0.70 triggers critique even with good commodity/chain
- Fix JSON parser to extract first {…} block (handles Gemini code-fence wrapping)
- Add per-hypothesis log in refinement breakdown for observability
- Add refinementQualityThreshold to gateDetails for self-documenting artifacts
- Verified: critique fires on diversity=0.50 run, committed Hormuz/Baltic/Suez
  region-specific chain examples (score 0.592→0.650)

* feat(deep-forecast): per-candidate parallel LLM calls replace batch expansion

Previously: all candidates → one batch LLM call → LLM averages context →
identical route_disruption → inflation_pass_through chains for all candidates.

Now: each candidate → its own focused LLM call (parallel Promise.all) →
LLM reasons about specific stateKind/region/routeFacility for that candidate.

Results (3 candidates, 3 parallel calls):
- composite: 0.592 → 0.831 (+0.24)
- commodity: 0.17 → 1.00 (all mapped have specific commodity)
- diversity: 0.50 → 0.83 (energy_export_stress, importer_balance_stress
  appearing alongside route_disruption — genuinely different chains)
- Baseline updated: 0.831 (above 0.80 threshold → no critique needed)

Also threads learnedSection through extractSingleImpactExpansionCandidate
so the learned examples from autoresearch apply to each focused call.
Per-candidate cache keys (already existed) now serve as primary cache.

* fix(tests): update recovery test for per-candidate LLM call flow

- Change stage mock from impact_expansion → impact_expansion_single
  (batch primary path removed, per-candidate is now primary)
- Assert parseMode === per_candidate instead of parseStage /^recovered_/
  (recovered_ prefix was only set by old batch_repair path)
- 2257/2257 tests pass

* fix(deep-forecast): add Red Sea, Persian Gulf, South China Sea to chokepoint map

Candidate packets had routeFacilityKey=none for Red Sea / Persian Gulf /
Baltic Sea signals because prediction titles say "Red Sea maritime disruption"
not "Bab el-Mandeb" or "Strait of Hormuz". CHOKEPOINT_MARKET_REGIONS only
had sub-facility names (Bab el-Mandeb, Suez Canal) as keys, not the sea
regions themselves.

Fix: add Red Sea, Persian Gulf, Arabian Sea, Black Sea, South China Sea,
Mediterranean Sea as direct keys so region-level candidate titles resolve.

Result: LLM user prompt now shows routeFacilityKey=Red Sea / Persian Gulf /
Baltic Sea per candidate — giving each focused call the geographic context
needed to generate route-specific chains.

- Autoresearch baseline updated 0.932→0.965 on this run
- T8 extended with Red Sea, Persian Gulf, South China Sea assertions
- 2257/2257 tests pass

* feat(deep-forecast): free-form hypothesis schema + remove registry constraint

- Bump IMPACT_EXPANSION_REGISTRY_VERSION to v4
- Add hypothesisKey, description, geography, affectedAssets, marketImpact, causalLink fields to normalizeImpactHypothesisDraft (keep legacy fields for backward compat)
- Rewrite buildImpactExpansionSystemPrompt: remove IMPACT_VARIABLE_REGISTRY constraint table, use free-form ImpactHypothesis schema with geographic/commodity specificity rules
- Rewrite evaluateImpactHypothesisRejection: use effective key (hypothesisKey || variableKey) for dedup; legacy registry check only for old cached responses without hypothesisKey
- Update validateImpactHypotheses scoring: add geographyScore, commodityScore, causalLinkScore, assetScore terms; channelCoherence/bucketCoherence only apply to legacy responses
- Update parent-must-be-mapped invariant to use hypothesisKey || variableKey as effective key
- Update mapImpactHypothesesToWorldSignals: use effective key for dedup and sourceKey; prefer description/geography over legacy fields
- Update buildImpactPathsForCandidate: match on hypothesisKey || variableKey for parent lookup
- Update buildImpactPathId: use hypothesisKey || variableKey for hash inputs
- Rewrite scoreImpactExpansionQuality: add geographyRate and assetRate metrics; update composite weights
- Update buildImpactPromptCritiqueSystemPrompt/UserPrompt: use hypothesisKey-based chain format in examples
- Add new fields to buildImpactExpansionBundleFromPaths push calls
- Update T7 test assertion: MUST be the exact hypothesisKey instead of variableKey string

* fix(deep-forecast): update breakdown log to show free-form hypothesis fields

* feat(deep-forecast): add commodityDiversity metric to autoresearch scorer

- commodityDiversity = unique commodities / nCandidates (weight 0.35)
  Penalizes runs where all candidates default to same commodity.
  3 candidates all crude_oil → diversity=0.33 → composite ~0.76 → critique fires.
- Rebalanced composite weights: comDiversity 0.35, geo 0.20, keyDiversity 0.15, chain 0.10, commodityRate 0.10, asset 0.05, mappedRate 0.05
- Breakdown log now shows comDiversity + geo + keyDiversity
- Critique prompt updated: commodity_monoculture failure mode, diagnosis targets commodity homogeneity
- T9: added commodityDiversity=1.0 assertion (2 unique commodities across 2 candidates)

* refactor(deep-forecast): replace commodityDiversity with directCommodityDiversity + directGeoDiversity + candidateSpreadScore

Problem: measuring diversity on all mapped hypotheses misses the case where
one candidate generates 10 implications while others generate 0, or where
all candidates converge on the same commodity due to dominating signals.

Fix: score at the DIRECT hypothesis level (root causes only) and add
a candidate-spread metric:

- directCommodityDiversity: unique commodities among direct hypotheses /
  nCandidates. Measures breadth at the root-cause level. 3 candidates all
  crude_oil → 0.33 → composite ~0.77 → critique fires.

- directGeoDiversity: unique primary geographies among direct hypotheses /
  nCandidates. First segment of compound geography strings (e.g.
  'Red Sea, Suez Canal' → 'red sea') to avoid double-counting.

- candidateSpreadScore: normalized inverse-HHI. 1.0 = perfectly even
  distribution across candidates. One candidate with 10 implications and
  others with 0 → scores near 0 → critique fires.

Weight rationale: comDiversity 0.35, geoDiversity 0.20, spread 0.15,
chain 0.15, comRate 0.08, assetRate 0.04, mappedRate 0.03.

Verified: Run 2 Baltic/Hormuz/Brazil → freight/crude_oil/USD spread=0.98 ✓

* feat(deep-forecast): add convergence object to R2 debug artifact

Surface autoresearch loop outcome per run: converged (bool), finalComposite,
critiqueIterations (0 or 1), refinementCommitted, and perCandidateMappedCount
(candidateStateId → count). After 5+ runs the artifact alone answers whether
the pipeline is improving.

Architectural changes:
- runImpactExpansionPromptRefinement now returns { iterationCount, committed }
  at all exit paths instead of undefined
- Call hoisted before writeForecastTraceArtifacts so the result flows into the
  debug payload via dataForWrite.refinementResult
- buildImpactExpansionDebugPayload assembles convergence from validation +
  refinementResult; exported for direct testing
- Fix: stale diversityScore reference replaced with directCommodityDiversity

Tests: T-conv-1 (converged=true), T-conv-2 (converged=false + iterations=1),
T-conv-3 (perCandidateMappedCount grouping) — 116/116 pass

* fix(deep-forecast): address P1+P2 review issues from convergence observability PR

P1-A: sanitize LLM-returned proposed_addition before Redis write (prompt injection
      guard via sanitizeProposedLlmAddition — strips directive-phrase lines)
P1-B: restore fire-and-forget for runImpactExpansionPromptRefinement; compute
      critiqueIterations from quality score (predicted) instead of awaiting result,
      eliminating 15-30s critical-path latency on poor-quality runs
P1-C: processDeepForecastTask now returns convergence object to callers; add
      convergence_quality_met warn check to evaluateForecastRunArtifacts
P1-D: cap concurrent LLM calls in extractImpactExpansionBundle to 3 (manual
      batching — no p-limit) to respect provider rate limits

P2-1: hash full learnedSection in buildImpactExpansionCandidateHash (was sliced
      to 80 chars, causing cache collisions on long learned sections)
P2-2: add exitReason field to all runImpactExpansionPromptRefinement return paths
P2-3: sanitizeForPrompt strips directive injection phrases; new
      sanitizeProposedLlmAddition applies line-level filtering before Redis write
P2-4: add comment explaining intentional bidirectional affectedAssets/assetsOrSectors
      coalescing in normalizeImpactHypothesisDraft
P2-5: extract makeConvTestData helper in T-conv tests; remove refinementCommitted
      assertions (field removed from convergence shape)
P2-6: convergence_quality_met check added to evaluateForecastRunArtifacts (warn)

🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>

* fix(docs): add blank lines around lists in plan (MD032)

* fix(deep-forecast): address P1+P2 reviewer issues in convergence observability

P1-1: mapImpactHypothesesToWorldSignals used free-form marketImpact values
(price_spike, shortage, credit_stress, risk_off) verbatim as signal channel
types, producing unknown types that buildMarketTransmissionGraph cannot
consume. Add IMPACT_SIGNAL_CHANNELS set + resolveImpactChannel() to map
free-form strings to the nearest valid channel before signal materialization.

P1-2: sanitizeForPrompt had directive-phrase stripping added that was too
broad for a function called on headlines, evidence tables, case files, and
geopolitical summaries. Reverted to original safe sanitizer (newline/control
char removal only). Directive stripping remains in sanitizeProposedLlmAddition
where it is scoped to Redis-bound LLM-generated additions only.

P2: Renamed convergence.critiqueIterations to predictedCritiqueIterations to
make clear this is a prediction from the quality score, not a measured count
from actual refinement behavior (refinement is fire-and-forget after artifact
write). Updated T-conv-1/2 test assertions to match.

* feat(deep-forecast): inject live news headlines into evidence table

Wire inputs.newsInsights / inputs.newsDigest through the candidate
selection pipeline so buildImpactExpansionEvidenceTable receives up to
3 commodity-relevant live headlines as 'live_news' evidence entries.

Changes:
- IMPACT_COMMODITY_LEXICON: extend fertilizer pattern (fertiliser,
  nitrogen, phosphate, npk); add food_grains and shipping_freight entries
- filterNewsHeadlinesByState: new pure helper that scores headlines by
  alert status, LNG/energy/route/sanctions signal match, lexicon commodity
  match, and source corroboration count (min score 2 to include)
- buildImpactExpansionEvidenceTable: add newsItems param, inject
  live_news entries, raise cap 8→11
- buildImpactExpansionCandidate: add newsInsights/newsDigest params,
  compute newsItems via filterNewsHeadlinesByState
- selectImpactExpansionCandidates: add newsInsights/newsDigest to options
- Call site: pass inputs.newsInsights/newsDigest at seed time
- Export filterNewsHeadlinesByState, buildImpactExpansionEvidenceTable
- 9 new tests (T-news-1 through T-lex-3): all pass, 125 total pass

🤖 Generated with Claude Sonnet 4.6 (200K context) via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(deep-forecast): remove hardcoded LNG boost from filterNewsHeadlinesByState

The LNG+2 score was commodity-specific and inconsistent with the
intent: headline scoring should be generic, not biased toward any
named commodity. The function already handles the state's detected
commodity dynamically via lexEntry.pattern (IMPACT_COMMODITY_LEXICON).
LNG headlines still score via CRITICAL_NEWS_ENERGY_RE (+1) and
CRITICAL_NEWS_ROUTE_RE (+1) when relevant to the state's region.
All 125 tests pass.

* fix(deep-forecast): address all P1+P2 code review findings from PR #2178

P1 fixes (block-merge):
- Lower third_order mapped floor 0.74→0.70 (max achievable via 0.72 multiplier was 0.72)
- Guard runImpactExpansionPromptRefinement against empty validation (no_mapped exit)
- Replace block-list sanitizeProposedLlmAddition with pattern-based allowlist (HTML/JS/directive takeover)
- Fix TOCTOU on PROMPT_LAST_ATTEMPT_KEY: claim slot before quality check, not after LLM call

P2 fixes:
- Fix learned section overflow: use slice(-MAX) to preserve tail, not discard all prior content
- Add safe_haven_bid and global_crude_spread_stress branches to resolveImpactChannel
- quality_met path now sets rate-limit key (prevents 3 Redis GETs per good run)
- Hoist extractNewsClusterItems outside stateUnit map in selectImpactExpansionCandidates
- Export PROMPT_LEARNED_KEY, PROMPT_BASELINE_KEY, PROMPT_LAST_ATTEMPT_KEY + read/clear helpers

All 125 tests pass.

* fix(todos): add blank lines around lists/headings in todo files (markdownlint)

* fix(todos): fix markdownlint blanks-around-headings/lists in all todo files

---------

Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
2026-03-24 18:52:02 +04:00
Elie Habib
c9d4fc49a5 fix(forecast): fix impact expansion LLM prompt + pre-filter validation diagnostic (#2169)
* fix(forecast): fix impact expansion LLM prompt + add pre-filter validation diagnostic

Problem: LLM generates 8 hypotheses but 6 get rejected post-hoc because the
prompt gave 3 flat unlinked enum lists without per-variable constraints, causing
the LLM to hallucinate plausible but invalid channel/bucket combinations. Only
2 direct-only hypotheses survived, producing no expanded paths ->
completed_no_material_change on every deep run.

Also fixes two related bugs:
- Duplicate canonical state unit labels (same region + stateKind) blocked the
  deep worker. Added label dedup filter in buildCanonicalStateUnits.
- When all hypotheses failed state-ID filtering in materializeImpactExpansion,
  impactExpansionSummary showed 0 hypotheses with no rejection reasons, losing
  all diagnostic data.

Changes:
- Bump IMPACT_EXPANSION_REGISTRY_VERSION v1 -> v2 to invalidate cached LLM
  responses so the model re-generates with new constraint guidance
- Add buildRegistryConstraintTable() that serializes IMPACT_VARIABLE_REGISTRY
  and MARKET_BUCKET_ALLOWED_CHANNELS into a compact constraint block
- Rewrite buildImpactExpansionSystemPrompt() to replace 3 flat enum lists with
  the structured constraint table, bucket-channel dual gate rule, and explicit
  MiroFish causal chain guidance (direct -> second_order -> third_order example)
- Add validation field to all 3 return paths in evaluateDeepForecastPaths so
  pre-state-filter rejection data is always preserved
- Add hypothesisValidation to buildImpactExpansionDebugPayload capturing
  totalHypotheses, validatedCount, mappedCount, rejectionReasonCounts, and
  per-hypothesis rejection identity (candidateIndex, candidateStateId,
  variableKey, channel, targetBucket, order, rejectionReason)
- Filter duplicate state unit labels post-finalization in buildCanonicalStateUnits
- Export buildRegistryConstraintTable, IMPACT_VARIABLE_REGISTRY,
  MARKET_BUCKET_ALLOWED_CHANNELS for testability
- Add 4 new tests: mapped=0 early return has validation, no-expanded-accepted
  path has validation, hypothesisValidation flows through buildForecastTraceArtifacts,
  buildRegistryConstraintTable matches registry

* refactor(forecast): memoize constraint table, clarify dedup filter, tighten test A2

- Extract IMPACT_EXPANSION_REGISTRY_CONSTRAINT_TABLE const so the
  constraint table is built once instead of on every prompt invocation
- Replace IIFE dedup filter in buildCanonicalStateUnits with an explicit
  seenLabels Set for readability
- Add format-assumption comment to buildRegistryConstraintTable
- Rename test A2 and add mapped > 0 assertion to pin it past the
  mapped=0 early-return path

* fix(forecast): disambiguate duplicate state unit labels instead of dropping units

The previous dedup filter silently dropped state units whose labels
collided after finalizeStateUnit(), losing their id, forecastIds,
signals, and deep-candidate eligibility.

Two distinct clusters can score below the merge threshold (< 5.5) while
still resolving to the same formatStateUnitLabel() output when they
share the same leadRegion and stateKind. The filter kept the first
and discarded the rest, suppressing valid deep paths.

Fix: replace the .filter() with a .map() that disambiguates collision
labels using dominantDomain suffix, falling back to the last 4 chars
of the unit id if domain-based disambiguation would itself collide.
The seenLabels Set tracks all assigned labels to prevent any secondary
collision. The snapshot validator (and deep worker) no longer sees
duplicate labels, and no units are dropped.

Also export buildCanonicalStateUnits for direct test coverage.

* feat(deep-forecast): phase 2 scoring recalibration + prompt excellence

Fixes cascade of three scoring gates that caused every deep run to return
completed_no_material_change despite valid LLM hypothesis generation.

Changes:
- Lower second_order validation floors (mapped: 0.66→0.58, internal: 0.58→0.50)
  and raise multiplier (0.85→0.88) so typical LLM quality (strength~0.75,
  conf~0.75, 2 refs, specificityScore=0.2) now reaches mapped status
- Binary evidenceSupport: refs >= 2 → 1, else → 0 (enforces 2-ref requirement
  at scoring layer, not just prompt text; 1-ref hypotheses stay trace_only)
- Parent-must-be-mapped invariant: post-validation pass downgrades mapped
  second_order/third_order whose dependsOnKey has no mapped parent
- Lower pathScore threshold from 0.66 to 0.50 to allow barely-mapped pairs
  through to expanded path generation
- Add 6 missing maritime chokepoints to CHOKEPOINT_MARKET_REGIONS
- Bump IMPACT_EXPANSION_REGISTRY_VERSION v2 → v3 (invalidates stale LLM cache)
- Prompt v3: explicit dependsOnKey pairing, 2-evidence citation rule,
  confidence calibration guidance, direct+second_order pair structure
- Add scoringBreakdown (all hypotheses with scoring factors) and gateDetails
  (active thresholds) to debug artifact for observability feedback loop
- Export buildImpactExpansionSystemPrompt, extractImpactRouteFacilityKey,
  extractImpactCommodityKey for testability
- 8 new tests (T1-T8) covering all phase 2 changes; 111/111 pass
2026-03-24 08:31:41 +04:00
Elie Habib
2e0bc86d81 feat(forecast): add replayable deep forecast lifecycle (#2161)
* feat(forecast): add replayable deep forecast lifecycle

* fix(forecast): serialize replay snapshot market index
2026-03-23 23:59:21 +04:00
Elie Habib
cdc77bcbaa feat(forecast): add queued deep forecast search (#2155)
* feat(forecast): add queued deep forecast search

* test(forecast): cover deep path scoring
2026-03-23 21:36:46 +04:00
Elie Habib
d29fd4e03f fix(forecast): harden reliability recovery pipeline (#2149) 2026-03-23 19:28:26 +04:00
Elie Habib
636ace7b2c feat(forecast): add impact expansion simulation layer (#2138)
* feat(forecast): add impact expansion simulation layer

* fix(forecast): correct impact bucket coherence gate
2026-03-23 15:19:06 +04:00
Elie Habib
00f9ce7c19 fix(forecast): preserve llm narratives on publish refresh (#2134) 2026-03-23 13:50:57 +04:00
Elie Habib
a202b8ebcc feat(consumer-prices): global All view as default, market selector, per-market cache keys (#2128)
* fix(seeders): apply gold standard TTL-extend+retry pattern to Aviation, NOTAM, Cyber, PositiveEvents

* feat(consumer-prices): default to All — global comparison table as landing view

- DEFAULT_MARKET = 'all' so panel opens with the global view
- 🌍 All pill added at front of market bar
- All view fetches all 9 markets in parallel via fetchAllMarketsOverview()
  and renders a comparison table: Market / Index / WoW / Spread / Updated
- Clicking any market row drills into that market's full tab view
- SINGLE_MARKETS exported for use in All-view iteration
- CSS: .cp-global-table and row styles
2026-03-23 10:58:37 +04:00
Elie Habib
166fc58e92 fix(forecast): tighten state coherence and promotion (#2120)
* fix(forecast): tighten state coherence and promotion

* fix(forecast): harden coherence follow-ups
2026-03-23 10:19:17 +04:00
Elie Habib
1058b648a1 feat(forecast): derive market domains from state units (#2116)
* feat(forecast): derive market domains from state units

* fix(forecast): cover state-derived backfill path
2026-03-23 09:26:13 +04:00
Elie Habib
ea991dc7ce fix(forecast): unclog market promotion and state selection (#2082) 2026-03-23 01:19:28 +04:00
Elie Habib
5e8a106999 feat(forecast): extract critical news signals (#2064)
* feat(forecast): extract critical news signals

* fix(forecast): harden critical signal extraction

* feat(forecast): add structured urgent signal extraction

* docs(env): document critical forecast llm overrides
2026-03-22 22:39:00 +04:00
Elie Habib
a24ea45983 feat(forecast): compress situations into state units (#2037) 2026-03-22 10:11:41 +04:00
Elie Habib
7eef3fd9ca feat(forecast): enrich energy transmission signals (#2021) 2026-03-21 23:05:20 +04:00
Elie Habib
3b762492fe feat(forecast): deepen market transmission simulation (#1996) 2026-03-21 17:16:31 +04:00
Elie Habib
41591e33a9 feat(forecast): add macro market signals (#1980) 2026-03-21 12:26:52 +04:00
Elie Habib
5b987ea434 feat(forecast): drive simulation from market state (#1976) 2026-03-21 11:09:04 +04:00
Elie Habib
3670716daa feat(forecast): add market transmission state (#1971) 2026-03-21 09:48:38 +04:00
Elie Habib
8d86607d21 feat(forecast): drive selection from causal memory (#1958) 2026-03-21 01:20:42 +04:00
Elie Habib
f56f11a596 feat(forecast): add simulation memory replay state (#1945) 2026-03-20 20:37:11 +04:00
Elie Habib
8e8db1b40f feat(forecast): calibrate interaction effect promotion (#1936) 2026-03-20 18:49:58 +04:00
Elie Habib
070248b792 fix(forecast): guarantee military forecast inclusion in publish selection pool (#1917)
Military detector forecasts (ADS-B flight tracking + theater posture API)
structurally score near zero on readiness metrics that require LLM-enriched
caseFile content (supporting evidence, news headlines, calibration, triggers).
This causes them to rank below the target count threshold every run despite
a valid elevated posture signal.

Add a domain guarantee post-pass after the 3 selection loops: if no military
forecast was selected and we have room below MAX_TARGET_PUBLISHED_FORECASTS,
inject the highest-scoring eligible military forecast. This does not displace
any already-selected forecast and respects all existing family/situation caps.

Diagnosis: Baltic theater at postureLevel='elevated' with 6 active flights
generates a military forecast (prob=0.41, confidence=0.30, score=0.136) but
gets buried behind 15+ well-grounded situation cluster forecasts at score 0.4+.

Tests: 3 new assertions in 'military domain guarantee in publish selection'.
2026-03-20 14:04:08 +04:00
Elie Habib
01366fcc00 fix(forecast): block generic-actor cross-theater interactions + raise enrichment budget (#1916)
* fix(forecast): block generic-actor cross-theater interactions + raise enrichment budget

Root cause: actor registry uses name:category as key (e.g. "Incumbent leadership:state"),
causing unrelated situations (Israel conflict, Taiwan political) to share the same actor
ID and fire sharedActor=true in pushInteraction. This propagated into the reportable
ledger and surfaced as junk effects like Israel→Taiwan at 80% confidence.

Two-pronged fix:

1. Specificity gate in pushInteraction: sharedActor now requires avgSpecificity >= 0.75.
   Generic blueprint actors ("Incumbent leadership" ~0.68, "Civil protection authorities"
   ~0.73) no longer qualify as structural cross-situation links. Named domain-specific
   actors ("Threat actors:adversarial" ~0.95) continue to qualify.

2. MACRO_REGION_MAP + isCrossTheaterPair + gate in buildCrossSituationEffects: for
   cross-theater pairs (different macro-regions) with non-exempt channels, requires
   sharedActor=true AND avgActorSpecificity >= 0.90. Exempt channels: cyber_disruption,
   market_repricing (legitimately global). Same-macro-region pairs (Brazil/Mexico both
   AMERICAS) are unaffected.

Verified against live run 1773983083084-bu6b1f:
  BLOCKED: Israel→Taiwan (MENA/EAST_ASIA, spec 0.68)
  BLOCKED: Israel→US political (MENA/AMERICAS, spec 0.68)
  BLOCKED: Cuba→Iran (AMERICAS/MENA, spec 0.73)
  BLOCKED: Brazil→Israel (AMERICAS/MENA, spec 0.85 < 0.90)
  ALLOWED: China→US cyber_disruption (exempt channel)
  ALLOWED: Brazil→Mexico (same AMERICAS)

Also raises ENRICHMENT_COMBINED_MAX from 3 to 5 (total budget 6→8),
targeting enrichedRate improvement from ~38% to ~60%.

* fix(plans): fix markdown lint errors in forecast semantic quality plan

* fix(plans): fix remaining markdown lint error in plan file
2026-03-20 13:26:58 +04:00
Elie Habib
46cd3728d6 fix(forecast): tighten reportable effect quality (#1902)
* fix(forecast): tighten reportable effect quality

* fix(forecast): preserve structural political carryover

* chore(forecast): document effect grouping heuristics
2026-03-20 00:44:21 +04:00
Elie Habib
8768d10b7f fix(forecast): tighten interaction semantics (#1896)
* fix(forecast): tighten interaction semantics

* fix(forecast): narrow maritime family inference

* fix(forecast): keep full reportable interaction graph
2026-03-19 23:34:46 +04:00
Elie Habib
e434769e37 feat(forecast): add simulation action ledger (#1891)
* feat(forecast): add simulation action ledger

* fix(forecast): preserve directional interaction effects
2026-03-19 21:01:47 +04:00
Elie Habib
486f5f799f fix(forecast): tighten family effect credibility (#1880)
* fix(forecast): tighten family effect credibility

* fix(forecast): respect domain effect thresholds
2026-03-19 18:24:40 +04:00
Elie Habib
08cc2723cc fix(forecast): wire per-situation simulation into per-forecast worldState (#1879)
buildForecastTraceArtifacts was building worldState after tracedPredictions,
so simulation data was never available to buildForecastTraceRecord. Each
forecast's caseFile.worldState had situationId/familyId/simulationSummary
all undefined, making the 3-round MiroFish simulation invisible at the
forecast level.

Fix:
- Compute worldState before tracing (so simulationState is ready)
- Build forecastId → situationSimulation lookup from worldState.simulationState
- Pass lookup into buildForecastTraceRecord; inject situationId, familyId,
  familyLabel, simulationSummary, simulationPosture, simulationPostureScore
  into caseFile.worldState for each matched forecast
- Add regression assertion to forecast-trace-export tests

All 194 forecast tests pass.
2026-03-19 17:19:49 +04:00
Elie Habib
2deccac691 fix(forecast): allocate publish output by family (#1868)
* fix(forecast): allocate publish output by family

* fix(forecast): backfill deferred family selections
2026-03-19 11:42:12 +04:00
Elie Habib
ee0f124b3f feat(forecast): add family spillover engine (#1866)
* feat(forecast): add family spillover engine

* fix(forecast): require direct spillover links

* fix(forecast): stabilize family spillover wiring
2026-03-19 10:52:06 +04:00
Elie Habib
0b338afed8 fix(forecast): calibrate simulation posture scoring (#1860)
* fix(forecast): calibrate simulation posture scoring

* fix(forecast): version and rebalance simulation scoring
2026-03-19 09:48:41 +04:00
Elie Habib
15e2a6fccb feat(forecast): drive simulation rounds from actor actions (#1858) 2026-03-19 09:08:04 +04:00
Elie Habib
214b17d757 fix(forecast): align published and candidate state surfaces (#1852)
* fix(forecast): align published and candidate state surfaces

* fix(forecast): preserve projected published situations
2026-03-19 08:24:26 +04:00
Elie Habib
568408b0ca fix(forecast): tighten simulation effect links (#1851) 2026-03-19 03:55:19 +04:00
Elie Habib
10958a397b feat(forecast): synthesize reports from simulation state (#1850) 2026-03-19 03:40:22 +04:00
Elie Habib
a40c0a11fb feat(forecast): add simulation state transitions (#1847) 2026-03-19 03:25:52 +04:00
Elie Habib
a3492e8b4e fix(forecast): refine world-state situation clustering (#1843) 2026-03-19 02:55:22 +04:00
Elie Habib
47e942011b fix(forecast): improve situation-aware publish quality (#1840) 2026-03-19 02:11:27 +04:00
Elie Habib
2b228da916 fix(forecast): dedupe situation-overlap forecasts (#1807)
* fix(forecast): dedupe situation-overlap forecasts

* fix(forecast): reuse situation clusters in publish flow

* fix(forecast): reuse publish trace context
2026-03-18 17:57:03 +04:00
Elie Habib
e58a608262 fix(forecast): make trace writing single-writer (#1801)
* fix(forecast): make trace writing single-writer

* fix(forecast): preserve chained refresh requests
2026-03-18 11:00:43 +04:00
Elie Habib
527002f873 fix(forecast): improve trace enrichment diagnostics (#1797)
* fix(forecast): avoid duplicate prior world-state read

* feat(forecast): record llm enrichment failure reasons

* fix(forecast): preserve latest pointer continuity fallback
2026-03-18 09:55:28 +04:00
Elie Habib
6b1ea49397 feat(forecast): add report continuity history (#1788)
* feat(forecast): cluster situations in world state

* feat(forecast): add report continuity history

* fix(forecast): stabilize report continuity matching
2026-03-18 00:27:03 +04:00
Elie Habib
c1f8aa516b feat(forecast): add situation clustering to world state (#1785)
* feat(forecast): cluster situations in world state

* fix(forecast): stabilize situation continuity ids
2026-03-17 22:47:58 +04:00
Elie Habib
3ba56997af feat(forecast): add world-state report synthesis (#1780) 2026-03-17 19:19:28 +04:00
Elie Habib
0296758398 feat(forecast): add world-state continuity to main (#1779)
* feat(forecast): add actor continuity to world state

* fix(forecast): report full actor continuity counts

* feat(forecast): add branch continuity to world state
2026-03-17 18:49:58 +04:00
Elie Habib
8cdca53bd8 feat(forecast): persist run-level world state (#1773)
* feat(forecast): persist run-level world state

* fix(forecast): align world state artifacts
2026-03-17 18:21:56 +04:00
Elie Habib
e2f0811330 fix(forecast): tighten quality and enrichment balance (#1761) 2026-03-17 14:03:13 +04:00