fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)

* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern- matching the loudest context numbers — VIX 19.50, top forecast probability, MidEast FX stress 77 — into every story regardless of editorial fit. A Rwanda humanitarian story about refugees cited VIX; an aviation story cited a forecast probability. Root cause: every story got the same 6-bundle context block, so the LLM had markets / forecasts / macro in-hand and the "cite a specific fact" instruction did the rest. Two-layer fix: 1. STRUCTURAL — sectionsForCategory() maps the story's category to an editorially-relevant subset of bundles. Humanitarian stories don't see marketData / forecasts / macroSignals; diplomacy gets riskScores only; market/energy gets markets+forecasts but drops riskScores. The model physically cannot cite what it wasn't given. Unknown categories fall back to all six (backcompat). 2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE that explicitly permits grounding in headline/description actors when no context fact is a clean fit, and bans dragging off-topic market metrics into humanitarian/aviation/diplomacy stories. The prompt footer (inline, per-call) restates the same guardrail — models follow inline instructions more reliably than system-prompt constraints on longer outputs. Cache keys bumped to invalidate the formulaic v5 output: endpoint v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5 policies + default fallback + humanitarian structural guarantee + market policy does-see-markets + guardrail footer presence. Observability: endpoint now logs policyLabel per call so operators can confirm in Vercel logs that humanitarian/aviation stories are NOT seeing marketData without dumping the full prompt. * test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage Greptile PR #3281 review raised two P2 test-quality issues: 1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint was bumped to 8192 in PR #3269 (v2 output + description). With the stale constant, a payload in the 4097–8192 range was accepted by the real endpoint but looked oversize in the test mirror, letting the body-cap invariant silently drift. Fixed by syncing to 8192 + bumping the bloated fixture to 10_000 bytes so a future endpoint-cap bump doesn't silently re-invalidate the assertion. 2. parseWhyMattersV2 (the only output-validation gate on the analyst path) had no dedicated unit tests. Adds 11 targeted cases covering: valid 2 and 3 sentence output, 100/500 char bounds (incl. boundary assertions), all 6 banned preamble phrases, section-label leaks (SITUATION/ANALYSIS/Watch), markdown leakage (#, -, *, 1.), stub echo rejection, smart/ plain quote stripping, non-string defensive branch, and whitespace-only strings. Suite size: 50 to 61 tests, all green. * fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1) Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's RELEVANCE RULE as a category banned from off-topic market metrics, but had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all fell through to DEFAULT_SECTIONS and still got all 6 bundles including marketData, forecasts, and macroSignals — exactly the VIX / forecast probability pattern the PR claimed to structurally prevent. Reproduced on HEAD before fix: Aviation Incident -> default Airspace Closure -> default Plane Crash -> default ...etc. Fix: 1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/ tech: worldBrief, countryBrief, riskScores). 2. Adds dedicated aviation-gating test with 6 category variants. 3. Adds meta-invariant test: every category named in the system prompt's RELEVANCE RULE MUST have a structural policy entry, asserting policyLabel !== 'default'. If someone adds a new category name to the prompt in the future, this test fires until they wire up a regex — prevents soft-guard drift. 4. Removes 'Aviation Incident' from the default-fall-through test list (it now correctly matches aviation). No cache bump needed — v6 was published to the feature branch only a few minutes ago, no production entries have been written yet.
2026-04-25 17:14:57 +02:00 · 2026-04-22 08:21:01 +04:00
parent fbaf07e106
commit 425507d15a
5 changed files with 518 additions and 39 deletions
--- a/shared/brief-llm-core.js
+++ b/shared/brief-llm-core.js
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
  'on why the story matters.\n\n' +
  'STRUCTURE:\n' +
  '1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
-  'metric, date, or place drawn from the context.\n' +
+  'metric, date, or place relevant to this story.\n' +
  '2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
  'the balance, triggers a cascade).\n' +
  '3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
  'HARD CONSTRAINTS:\n' +
  '- Total length 40–70 words across 2–3 sentences.\n' +
  '- MUST reference at least ONE specific: named person / country / organization / ' +
-  'number / percentage / date / city — drawn from the context, NOT invented.\n' +
+  'number / percentage / date / city.\n' +
  '- No preamble ("This matters because…", "The importance of…").\n' +
  '- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
-  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
+  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
+  'RELEVANCE RULE (critical, read carefully):\n' +
+  '- The context block may contain facts from world-brief, country-brief, risk scores, ' +
+  'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
+  "directly relevant to this story's category and country.\n" +
+  '- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
+  'or figure drawn from the headline or description. That is a VALID grounding — do ' +
+  'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
+  '- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
+  'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
+  'does not need a VIX number; a story about a drone incursion does not need an FX ' +
+  "stress reading. If it isn't editorially connected to the story, leave it out.";

 /**
 * Parse + validate the analyst-path v2 LLM response. Accepts