Files
worldmonitor/todos/002-complete-p1-refinement-fires-on-empty-validation.md
Elie Habib 226cebf9bc feat(deep-forecast): Phase 2+3 scoring recalibration + autoresearch prompt self-improvement (#2178)
* fix(deep-forecast): lower acceptance threshold 0.60→0.50 to match real score distribution

computeDeepPathAcceptanceScore formula: pathScore×0.55 + quality×0.20 + coherence×0.15
With pathScore≈0.65, quality≈0.30, coherence≈0.55:
  0.358 + 0.060 + 0.083 = 0.50

The 0.60 threshold was calibrated before understanding that reportableQualityScore
is constrained by world-state simulation geometry (not hypothesis quality), and
coherence loses 0.15 for generic candidates without routeFacilityKey. The threshold
was structurally unreachable with typical expanded paths.

Verified end-to-end: deep worker now returns [DeepForecast] completed.

Also updates T6 gateDetails assertion and renames the rejection-floor test to
correctly describe the new behavior (strong inputs should be accepted).

111/111 tests pass.

* feat(deep-forecast): autoresearch prompt self-improvement loop + T9/T10 tests

- Add scoreImpactExpansionQuality() locked scorer: commodity rate (35%),
  variable diversity (35%), chain coverage (20%), mapped rate (10%)
- Add runImpactExpansionPromptRefinement(): rate-limited LLM critic loop
  (30min cooldown) that reads learned section from Redis, scores current
  run, generates critique if composite < 0.62, tests on same candidates,
  commits to forecast:prompt:impact-expansion:learned if score improves
- buildImpactExpansionSystemPrompt() now accepts learnedSection param,
  appends it after core rules with separator so model sees prior examples
- buildImpactExpansionCandidateHash() includes learnedFingerprint to
  bust cache when learned section changes
- processDeepForecastTask reads learnedSection from Redis before LLM
  call, runs refinement after both completed and no_material_change paths
- Export scoreImpactExpansionQuality + runImpactExpansionPromptRefinement
- T9: high commodity rate + chain coverage → composite > 0.70
- T10: no commodity + no chain coverage → composite < 0.40
- 113/113 tests pass

* fix(deep-forecast): raise autoresearch threshold 0.62→0.80 + fix JSON parse

- Threshold 0.62 was too low: commodity=1.00 + chain=1.00 compensated for
  diversity=0.50 (all same chains), keeping composite at 0.775 → no critique
- Raise to 0.80 so diversity<0.70 triggers critique even with good commodity/chain
- Fix JSON parser to extract first {…} block (handles Gemini code-fence wrapping)
- Add per-hypothesis log in refinement breakdown for observability
- Add refinementQualityThreshold to gateDetails for self-documenting artifacts
- Verified: critique fires on diversity=0.50 run, committed Hormuz/Baltic/Suez
  region-specific chain examples (score 0.592→0.650)

* feat(deep-forecast): per-candidate parallel LLM calls replace batch expansion

Previously: all candidates → one batch LLM call → LLM averages context →
identical route_disruption → inflation_pass_through chains for all candidates.

Now: each candidate → its own focused LLM call (parallel Promise.all) →
LLM reasons about specific stateKind/region/routeFacility for that candidate.

Results (3 candidates, 3 parallel calls):
- composite: 0.592 → 0.831 (+0.24)
- commodity: 0.17 → 1.00 (all mapped have specific commodity)
- diversity: 0.50 → 0.83 (energy_export_stress, importer_balance_stress
  appearing alongside route_disruption — genuinely different chains)
- Baseline updated: 0.831 (above 0.80 threshold → no critique needed)

Also threads learnedSection through extractSingleImpactExpansionCandidate
so the learned examples from autoresearch apply to each focused call.
Per-candidate cache keys (already existed) now serve as primary cache.

* fix(tests): update recovery test for per-candidate LLM call flow

- Change stage mock from impact_expansion → impact_expansion_single
  (batch primary path removed, per-candidate is now primary)
- Assert parseMode === per_candidate instead of parseStage /^recovered_/
  (recovered_ prefix was only set by old batch_repair path)
- 2257/2257 tests pass

* fix(deep-forecast): add Red Sea, Persian Gulf, South China Sea to chokepoint map

Candidate packets had routeFacilityKey=none for Red Sea / Persian Gulf /
Baltic Sea signals because prediction titles say "Red Sea maritime disruption"
not "Bab el-Mandeb" or "Strait of Hormuz". CHOKEPOINT_MARKET_REGIONS only
had sub-facility names (Bab el-Mandeb, Suez Canal) as keys, not the sea
regions themselves.

Fix: add Red Sea, Persian Gulf, Arabian Sea, Black Sea, South China Sea,
Mediterranean Sea as direct keys so region-level candidate titles resolve.

Result: LLM user prompt now shows routeFacilityKey=Red Sea / Persian Gulf /
Baltic Sea per candidate — giving each focused call the geographic context
needed to generate route-specific chains.

- Autoresearch baseline updated 0.932→0.965 on this run
- T8 extended with Red Sea, Persian Gulf, South China Sea assertions
- 2257/2257 tests pass

* feat(deep-forecast): free-form hypothesis schema + remove registry constraint

- Bump IMPACT_EXPANSION_REGISTRY_VERSION to v4
- Add hypothesisKey, description, geography, affectedAssets, marketImpact, causalLink fields to normalizeImpactHypothesisDraft (keep legacy fields for backward compat)
- Rewrite buildImpactExpansionSystemPrompt: remove IMPACT_VARIABLE_REGISTRY constraint table, use free-form ImpactHypothesis schema with geographic/commodity specificity rules
- Rewrite evaluateImpactHypothesisRejection: use effective key (hypothesisKey || variableKey) for dedup; legacy registry check only for old cached responses without hypothesisKey
- Update validateImpactHypotheses scoring: add geographyScore, commodityScore, causalLinkScore, assetScore terms; channelCoherence/bucketCoherence only apply to legacy responses
- Update parent-must-be-mapped invariant to use hypothesisKey || variableKey as effective key
- Update mapImpactHypothesesToWorldSignals: use effective key for dedup and sourceKey; prefer description/geography over legacy fields
- Update buildImpactPathsForCandidate: match on hypothesisKey || variableKey for parent lookup
- Update buildImpactPathId: use hypothesisKey || variableKey for hash inputs
- Rewrite scoreImpactExpansionQuality: add geographyRate and assetRate metrics; update composite weights
- Update buildImpactPromptCritiqueSystemPrompt/UserPrompt: use hypothesisKey-based chain format in examples
- Add new fields to buildImpactExpansionBundleFromPaths push calls
- Update T7 test assertion: MUST be the exact hypothesisKey instead of variableKey string

* fix(deep-forecast): update breakdown log to show free-form hypothesis fields

* feat(deep-forecast): add commodityDiversity metric to autoresearch scorer

- commodityDiversity = unique commodities / nCandidates (weight 0.35)
  Penalizes runs where all candidates default to same commodity.
  3 candidates all crude_oil → diversity=0.33 → composite ~0.76 → critique fires.
- Rebalanced composite weights: comDiversity 0.35, geo 0.20, keyDiversity 0.15, chain 0.10, commodityRate 0.10, asset 0.05, mappedRate 0.05
- Breakdown log now shows comDiversity + geo + keyDiversity
- Critique prompt updated: commodity_monoculture failure mode, diagnosis targets commodity homogeneity
- T9: added commodityDiversity=1.0 assertion (2 unique commodities across 2 candidates)

* refactor(deep-forecast): replace commodityDiversity with directCommodityDiversity + directGeoDiversity + candidateSpreadScore

Problem: measuring diversity on all mapped hypotheses misses the case where
one candidate generates 10 implications while others generate 0, or where
all candidates converge on the same commodity due to dominating signals.

Fix: score at the DIRECT hypothesis level (root causes only) and add
a candidate-spread metric:

- directCommodityDiversity: unique commodities among direct hypotheses /
  nCandidates. Measures breadth at the root-cause level. 3 candidates all
  crude_oil → 0.33 → composite ~0.77 → critique fires.

- directGeoDiversity: unique primary geographies among direct hypotheses /
  nCandidates. First segment of compound geography strings (e.g.
  'Red Sea, Suez Canal' → 'red sea') to avoid double-counting.

- candidateSpreadScore: normalized inverse-HHI. 1.0 = perfectly even
  distribution across candidates. One candidate with 10 implications and
  others with 0 → scores near 0 → critique fires.

Weight rationale: comDiversity 0.35, geoDiversity 0.20, spread 0.15,
chain 0.15, comRate 0.08, assetRate 0.04, mappedRate 0.03.

Verified: Run 2 Baltic/Hormuz/Brazil → freight/crude_oil/USD spread=0.98 ✓

* feat(deep-forecast): add convergence object to R2 debug artifact

Surface autoresearch loop outcome per run: converged (bool), finalComposite,
critiqueIterations (0 or 1), refinementCommitted, and perCandidateMappedCount
(candidateStateId → count). After 5+ runs the artifact alone answers whether
the pipeline is improving.

Architectural changes:
- runImpactExpansionPromptRefinement now returns { iterationCount, committed }
  at all exit paths instead of undefined
- Call hoisted before writeForecastTraceArtifacts so the result flows into the
  debug payload via dataForWrite.refinementResult
- buildImpactExpansionDebugPayload assembles convergence from validation +
  refinementResult; exported for direct testing
- Fix: stale diversityScore reference replaced with directCommodityDiversity

Tests: T-conv-1 (converged=true), T-conv-2 (converged=false + iterations=1),
T-conv-3 (perCandidateMappedCount grouping) — 116/116 pass

* fix(deep-forecast): address P1+P2 review issues from convergence observability PR

P1-A: sanitize LLM-returned proposed_addition before Redis write (prompt injection
      guard via sanitizeProposedLlmAddition — strips directive-phrase lines)
P1-B: restore fire-and-forget for runImpactExpansionPromptRefinement; compute
      critiqueIterations from quality score (predicted) instead of awaiting result,
      eliminating 15-30s critical-path latency on poor-quality runs
P1-C: processDeepForecastTask now returns convergence object to callers; add
      convergence_quality_met warn check to evaluateForecastRunArtifacts
P1-D: cap concurrent LLM calls in extractImpactExpansionBundle to 3 (manual
      batching — no p-limit) to respect provider rate limits

P2-1: hash full learnedSection in buildImpactExpansionCandidateHash (was sliced
      to 80 chars, causing cache collisions on long learned sections)
P2-2: add exitReason field to all runImpactExpansionPromptRefinement return paths
P2-3: sanitizeForPrompt strips directive injection phrases; new
      sanitizeProposedLlmAddition applies line-level filtering before Redis write
P2-4: add comment explaining intentional bidirectional affectedAssets/assetsOrSectors
      coalescing in normalizeImpactHypothesisDraft
P2-5: extract makeConvTestData helper in T-conv tests; remove refinementCommitted
      assertions (field removed from convergence shape)
P2-6: convergence_quality_met check added to evaluateForecastRunArtifacts (warn)

🤖 Generated with Claude Sonnet 4.6 via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>

* fix(docs): add blank lines around lists in plan (MD032)

* fix(deep-forecast): address P1+P2 reviewer issues in convergence observability

P1-1: mapImpactHypothesesToWorldSignals used free-form marketImpact values
(price_spike, shortage, credit_stress, risk_off) verbatim as signal channel
types, producing unknown types that buildMarketTransmissionGraph cannot
consume. Add IMPACT_SIGNAL_CHANNELS set + resolveImpactChannel() to map
free-form strings to the nearest valid channel before signal materialization.

P1-2: sanitizeForPrompt had directive-phrase stripping added that was too
broad for a function called on headlines, evidence tables, case files, and
geopolitical summaries. Reverted to original safe sanitizer (newline/control
char removal only). Directive stripping remains in sanitizeProposedLlmAddition
where it is scoped to Redis-bound LLM-generated additions only.

P2: Renamed convergence.critiqueIterations to predictedCritiqueIterations to
make clear this is a prediction from the quality score, not a measured count
from actual refinement behavior (refinement is fire-and-forget after artifact
write). Updated T-conv-1/2 test assertions to match.

* feat(deep-forecast): inject live news headlines into evidence table

Wire inputs.newsInsights / inputs.newsDigest through the candidate
selection pipeline so buildImpactExpansionEvidenceTable receives up to
3 commodity-relevant live headlines as 'live_news' evidence entries.

Changes:
- IMPACT_COMMODITY_LEXICON: extend fertilizer pattern (fertiliser,
  nitrogen, phosphate, npk); add food_grains and shipping_freight entries
- filterNewsHeadlinesByState: new pure helper that scores headlines by
  alert status, LNG/energy/route/sanctions signal match, lexicon commodity
  match, and source corroboration count (min score 2 to include)
- buildImpactExpansionEvidenceTable: add newsItems param, inject
  live_news entries, raise cap 8→11
- buildImpactExpansionCandidate: add newsInsights/newsDigest params,
  compute newsItems via filterNewsHeadlinesByState
- selectImpactExpansionCandidates: add newsInsights/newsDigest to options
- Call site: pass inputs.newsInsights/newsDigest at seed time
- Export filterNewsHeadlinesByState, buildImpactExpansionEvidenceTable
- 9 new tests (T-news-1 through T-lex-3): all pass, 125 total pass

🤖 Generated with Claude Sonnet 4.6 (200K context) via Claude Code (https://claude.ai/claude-code) + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(deep-forecast): remove hardcoded LNG boost from filterNewsHeadlinesByState

The LNG+2 score was commodity-specific and inconsistent with the
intent: headline scoring should be generic, not biased toward any
named commodity. The function already handles the state's detected
commodity dynamically via lexEntry.pattern (IMPACT_COMMODITY_LEXICON).
LNG headlines still score via CRITICAL_NEWS_ENERGY_RE (+1) and
CRITICAL_NEWS_ROUTE_RE (+1) when relevant to the state's region.
All 125 tests pass.

* fix(deep-forecast): address all P1+P2 code review findings from PR #2178

P1 fixes (block-merge):
- Lower third_order mapped floor 0.74→0.70 (max achievable via 0.72 multiplier was 0.72)
- Guard runImpactExpansionPromptRefinement against empty validation (no_mapped exit)
- Replace block-list sanitizeProposedLlmAddition with pattern-based allowlist (HTML/JS/directive takeover)
- Fix TOCTOU on PROMPT_LAST_ATTEMPT_KEY: claim slot before quality check, not after LLM call

P2 fixes:
- Fix learned section overflow: use slice(-MAX) to preserve tail, not discard all prior content
- Add safe_haven_bid and global_crude_spread_stress branches to resolveImpactChannel
- quality_met path now sets rate-limit key (prevents 3 Redis GETs per good run)
- Hoist extractNewsClusterItems outside stateUnit map in selectImpactExpansionCandidates
- Export PROMPT_LEARNED_KEY, PROMPT_BASELINE_KEY, PROMPT_LAST_ATTEMPT_KEY + read/clear helpers

All 125 tests pass.

* fix(todos): add blank lines around lists/headings in todo files (markdownlint)

* fix(todos): fix markdownlint blanks-around-headings/lists in all todo files

---------

Co-authored-by: Claude Sonnet 4.6 (200K context) <noreply@anthropic.com>
2026-03-24 18:52:02 +04:00

2.0 KiB

status, priority, issue_id, tags
status priority issue_id tags
pending p1 002
code-review
deep-forecast
llm-budget
architecture

runImpactExpansionPromptRefinement fires LLM call when validation is empty

Problem Statement

In processDeepForecastTask, the non-completed branch (status != 'completed') calls runImpactExpansionPromptRefinement(candidatePackets, evaluation.validation || {}). When evaluation.validation is empty (no mapped hypotheses), scoreImpactExpansionQuality({}, candidatePackets) produces composite=0.0, which is below the 0.80 quality threshold, unconditionally triggering the critique LLM call on every failed deep forecast run. This wastes LLM budget with no useful signal.

Findings

  • seed-forecasts.mjs:14215-14219 — non-completed branch calls refinement with evaluation.validation || {}
  • scoreImpactExpansionQuality({}, candidatePackets) → composite=0 → triggers critique
  • Critique prompt receives empty validation data, cannot generate useful guidance
  • Rate limit is set (prevents infinite loop), but 1 wasted LLM call per non-completed run still occurs

Proposed Solutions

if (evaluation.validation?.mapped?.length > 0) {
  runImpactExpansionPromptRefinement(candidatePackets, evaluation.validation)
    .catch((err) => console.warn('[PromptRefinement] Error:', err.message));
}

Effort: Small | Risk: Low

Option B: Always pass to refinement but guard inside the function

  • Add early exit in runImpactExpansionPromptRefinement when validation.mapped?.length === 0
  • Effort: Small | Risk: Low (duplicates the guard logic)

Acceptance Criteria

  • Deep forecast run with no mapped hypotheses does NOT trigger LLM critique call
  • Log message confirms skip: [PromptRefinement] Skipping — no mapped hypotheses
  • Existing refinement tests still pass

Technical Details

  • File: scripts/seed-forecasts.mjs:14215
  • Function: processDeepForecastTask

Work Log

  • 2026-03-24: Found by kieran-typescript-reviewer in PR #2178 review