fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)

* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern- matching the loudest context numbers — VIX 19.50, top forecast probability, MidEast FX stress 77 — into every story regardless of editorial fit. A Rwanda humanitarian story about refugees cited VIX; an aviation story cited a forecast probability. Root cause: every story got the same 6-bundle context block, so the LLM had markets / forecasts / macro in-hand and the "cite a specific fact" instruction did the rest. Two-layer fix: 1. STRUCTURAL — sectionsForCategory() maps the story's category to an editorially-relevant subset of bundles. Humanitarian stories don't see marketData / forecasts / macroSignals; diplomacy gets riskScores only; market/energy gets markets+forecasts but drops riskScores. The model physically cannot cite what it wasn't given. Unknown categories fall back to all six (backcompat). 2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE that explicitly permits grounding in headline/description actors when no context fact is a clean fit, and bans dragging off-topic market metrics into humanitarian/aviation/diplomacy stories. The prompt footer (inline, per-call) restates the same guardrail — models follow inline instructions more reliably than system-prompt constraints on longer outputs. Cache keys bumped to invalidate the formulaic v5 output: endpoint v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5 policies + default fallback + humanitarian structural guarantee + market policy does-see-markets + guardrail footer presence. Observability: endpoint now logs policyLabel per call so operators can confirm in Vercel logs that humanitarian/aviation stories are NOT seeing marketData without dumping the full prompt. * test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage Greptile PR #3281 review raised two P2 test-quality issues: 1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint was bumped to 8192 in PR #3269 (v2 output + description). With the stale constant, a payload in the 4097–8192 range was accepted by the real endpoint but looked oversize in the test mirror, letting the body-cap invariant silently drift. Fixed by syncing to 8192 + bumping the bloated fixture to 10_000 bytes so a future endpoint-cap bump doesn't silently re-invalidate the assertion. 2. parseWhyMattersV2 (the only output-validation gate on the analyst path) had no dedicated unit tests. Adds 11 targeted cases covering: valid 2 and 3 sentence output, 100/500 char bounds (incl. boundary assertions), all 6 banned preamble phrases, section-label leaks (SITUATION/ANALYSIS/Watch), markdown leakage (#, -, *, 1.), stub echo rejection, smart/ plain quote stripping, non-string defensive branch, and whitespace-only strings. Suite size: 50 to 61 tests, all green. * fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1) Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's RELEVANCE RULE as a category banned from off-topic market metrics, but had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all fell through to DEFAULT_SECTIONS and still got all 6 bundles including marketData, forecasts, and macroSignals — exactly the VIX / forecast probability pattern the PR claimed to structurally prevent. Reproduced on HEAD before fix: Aviation Incident -> default Airspace Closure -> default Plane Crash -> default ...etc. Fix: 1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/ tech: worldBrief, countryBrief, riskScores). 2. Adds dedicated aviation-gating test with 6 category variants. 3. Adds meta-invariant test: every category named in the system prompt's RELEVANCE RULE MUST have a structural policy entry, asserting policyLabel !== 'default'. If someone adds a new category name to the prompt in the future, this test fires until they wire up a regex — prevents soft-guard drift. 4. Removes 'Aviation Incident' from the default-fall-through test list (it now correctly matches aviation). No cache bump needed — v6 was published to the feature branch only a few minutes ago, no production entries have been written yet.
2026-04-25 17:14:57 +02:00 · 2026-04-22 08:21:01 +04:00
parent fbaf07e106
commit 425507d15a
5 changed files with 518 additions and 39 deletions
--- a/api/internal/brief-why-matters.ts
+++ b/api/internal/brief-why-matters.ts
@@ -215,7 +215,13 @@ function validateStoryBody(raw: unknown): ValidationOk | ValidationErr {
 async function runAnalystPath(story: StoryPayload, iso2: string | null): Promise<string | null> {
  try {
    const context = await assembleBriefStoryContext({ iso2, category: story.category });
-    const { system, user } = buildAnalystWhyMattersPrompt(story, context);
+    const { system, user, policyLabel } = buildAnalystWhyMattersPrompt(story, context);
+    // One line per call so we can verify in Vercel logs that humanitarian
+    // / aviation stories are NOT seeing marketData, without dumping the
+    // full prompt (which would include upstream-provided text).
+    console.log(
+      `[brief-why-matters] analyst gate policy=${policyLabel} category="${story.category}" promptLen=${user.length}`,
+    );
    const result = await callLlmReasoning({
      messages: [
        { role: 'system', content: system },
@@ -362,16 +368,23 @@ export default async function handler(req: Request, ctx?: EdgeContext): Promise<

  // Cache identity.
  const hash = await hashBriefStory(story);
-  // v5: `hashBriefStory` now includes `description` as a prompt input
-  // so same-story + different description no longer collide on a single
-  // cache entry (P1 caught in PR #3269 review — endpoint could serve
-  // prose grounded in a PREVIOUS caller's description). Bumping v4→v5
-  // invalidates the short-lived v4 entries written under the buggy
-  // 5-field hash so fresh output lands on the next cron tick.
-  const cacheKey = `brief:llm:whymatters:v5:${hash}`;
-  // Shadow v2→v3 for the same reason — any v2 comparison pairs may be
-  // grounded in the wrong description, so the A/B was noisy.
-  const shadowKey = `brief:llm:whymatters:shadow:v3:${hash}`;
+  // v6: category-gated context + prompt-level RELEVANCE RULE (2026-04-22).
+  // Shadow review of 15 v2 pairs showed the analyst pattern-matching the
+  // loudest context numbers (VIX, forecast probabilities, FX stress) into
+  // every story regardless of editorial fit. Fix ships two layers:
+  //   1. structural — buildContextBlock now only exposes sections that are
+  //      editorially relevant to the story's category (humanitarian stories
+  //      don't see market data, aviation doesn't see macro, etc.).
+  //   2. prompt — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE that
+  //      explicitly permits grounding in headline/description actors when
+  //      no context fact is a clean fit.
+  // Either layer changes the output distribution enough that v5 prose must
+  // be invalidated — otherwise half the tick's stories would still return
+  // the formulaic v5 strings for up to 24h until TTL.
+  const cacheKey = `brief:llm:whymatters:v6:${hash}`;
+  // Shadow v3→v4 for the same reason — a mid-rollout shadow record
+  // comparing v5-analyst vs gemini is not useful once v6 is live.
+  const shadowKey = `brief:llm:whymatters:shadow:v4:${hash}`;

  // Cache read. Any infrastructure failure → treat as miss (logged).
  let cached: WhyMattersEnvelope | null = null;
--- a/scripts/shared/brief-llm-core.js
+++ b/scripts/shared/brief-llm-core.js
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
  'on why the story matters.\n\n' +
  'STRUCTURE:\n' +
  '1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
-  'metric, date, or place drawn from the context.\n' +
+  'metric, date, or place relevant to this story.\n' +
  '2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
  'the balance, triggers a cascade).\n' +
  '3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
  'HARD CONSTRAINTS:\n' +
  '- Total length 40–70 words across 2–3 sentences.\n' +
  '- MUST reference at least ONE specific: named person / country / organization / ' +
-  'number / percentage / date / city — drawn from the context, NOT invented.\n' +
+  'number / percentage / date / city.\n' +
  '- No preamble ("This matters because…", "The importance of…").\n' +
  '- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
-  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
+  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
+  'RELEVANCE RULE (critical, read carefully):\n' +
+  '- The context block may contain facts from world-brief, country-brief, risk scores, ' +
+  'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
+  "directly relevant to this story's category and country.\n" +
+  '- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
+  'or figure drawn from the headline or description. That is a VALID grounding — do ' +
+  'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
+  '- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
+  'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
+  'does not need a VIX number; a story about a drone incursion does not need an FX ' +
+  "stress reading. If it isn't editorially connected to the story, leave it out.";

 /**
 * Parse + validate the analyst-path v2 LLM response. Accepts
--- a/server/worldmonitor/intelligence/v1/brief-why-matters-prompt.ts
+++ b/server/worldmonitor/intelligence/v1/brief-why-matters-prompt.ts
@@ -52,10 +52,31 @@ export function sanitizeStoryFields(story: StoryForPrompt): StoryForPrompt {
 // the LLM call latency stays under ~6s on typical provider responses.
 const CONTEXT_BUDGET_CHARS = 1700;

+// ── Category-gated context sections ──────────────────────────────────────
+//
+// Shadow-diff (2026-04-22) of 15 v2 pairs showed the LLM pattern-matching
+// the loudest numbers (VIX 19.50, top forecast probability, MidEast FX
+// stress) into every story — even humanitarian / Rwanda / aviation stories
+// with no editorial connection to markets. Root cause: the context block
+// passes ALL six bundles for every story, so the LLM has the market /
+// forecast numbers in-hand and the prompt's "cite a specific fact"
+// instruction does the rest.
+//
+// Fix: structurally exclude bundles the LLM cannot appropriately cite for
+// a given category. Humanitarian stories don't see market data; energy
+// stories don't see domestic risk scores; etc. The model physically cannot
+// cite what it wasn't given.
+//
+// Matching is case-insensitive substring on the story's category slug
+// (shared/brief-filter.js:134 — category is free-form like "Humanitarian
+// Crisis", "Geopolitical Risk", "Energy"). First match wins. Unknown →
+// DEFAULT_SECTIONS (all six — same as pre-gating behavior).
+type SectionKey = Exclude<keyof BriefStoryContext, 'degraded'>;
+
 // Per-section caps so no single heavy bundle (e.g. long worldBrief)
 // crowds out the others. Ordered by editorial importance: a single-
 // sentence summary benefits most from narrative + country framing.
-const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: number }> = [
+const SECTION_CAPS: Array<{ key: SectionKey; label: string; cap: number }> = [
  { key: 'worldBrief', label: 'World Brief', cap: 500 },
  { key: 'countryBrief', label: 'Country Brief', cap: 400 },
  { key: 'riskScores', label: 'Risk Scores', cap: 250 },
@@ -64,6 +85,90 @@ const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: nu
  { key: 'marketData', label: 'Market Data', cap: 200 },
 ];

+const DEFAULT_SECTIONS: SectionKey[] = [
+  'worldBrief',
+  'countryBrief',
+  'riskScores',
+  'forecasts',
+  'macroSignals',
+  'marketData',
+];
+
+// NOTE on regex shape: patterns use a LEADING `\b` (start-of-word
+// anchor) but NO TRAILING `\b`, so they match stems. "Diplomac" must
+// match "Diplomacy" and "Diplomatic"; "migrat" must match "migration"
+// and "migrating". A trailing `\b` here incorrectly required the stem
+// to end on a non-word char and caused every inflected form to fall
+// through to the default policy (caught in unit tests 2026-04-22).
+const CATEGORY_SECTION_POLICY: Array<{ match: RegExp; sections: SectionKey[]; label: string }> = [
+  // Energy / commodity / markets / financial — forecasts + markets matter.
+  {
+    label: 'market',
+    match: /\b(energy|commodit|market|financ|trade|oil|gas|fuel)/i,
+    sections: ['worldBrief', 'countryBrief', 'forecasts', 'macroSignals', 'marketData'],
+  },
+  // Humanitarian / civil / social / rights — NO market, NO forecasts.
+  // This is the #1 source of the "77% FX stress dragged into a Rwanda
+  // story" pattern from the 2026-04-22 shadow review.
+  {
+    label: 'humanitarian',
+    match: /\b(humanitarian|refuge|civil|social|rights|genocid|aid\b|migrat)/i,
+    sections: ['worldBrief', 'countryBrief', 'riskScores'],
+  },
+  // Geopolitical risk / conflict / military / security — risk + forecasts
+  // but not market data (the LLM would otherwise tack on a VIX reading to
+  // every conflict story).
+  {
+    label: 'geopolitical',
+    match: /\b(geopolit|military|conflict|war\b|terror|securit|defen[cs]e|nuclear)/i,
+    sections: ['worldBrief', 'countryBrief', 'riskScores', 'forecasts'],
+  },
+  // Diplomacy / negotiations — risk + country framing, no market / macro.
+  {
+    label: 'diplomacy',
+    match: /\b(diplomac|negotia|summit|sanction)/i,
+    sections: ['worldBrief', 'countryBrief', 'riskScores'],
+  },
+  // Technology / cyber — world narrative + risk, not markets.
+  {
+    label: 'tech',
+    match: /\b(tech|cyber|a\.?i\b|artificial|algorith|autonom)/i,
+    sections: ['worldBrief', 'countryBrief', 'riskScores'],
+  },
+  // Aviation / airspace / drones — world narrative + risk, NO market /
+  // forecasts / macro. Named explicitly in the RELEVANCE RULE (shared/
+  // brief-llm-core.js WHY_MATTERS_ANALYST_SYSTEM_V2) — the prior revision
+  // of this file only had the prompt-level guard, so aviation categories
+  // still fell through to DEFAULT_SECTIONS and got all 6 bundles.
+  // Structural fix ensures the LLM physically cannot cite a forecast
+  // probability or VIX reading for an aviation story (PR #3281 review).
+  {
+    label: 'aviation',
+    match: /\b(aviation|airspace|flight\b|aircraft|plane\b|drone)/i,
+    sections: ['worldBrief', 'countryBrief', 'riskScores'],
+  },
+];
+
+/**
+ * Resolve which context sections are editorially relevant for a given
+ * story category. Exported for testability — the category → sections
+ * map is the main lever for tuning analyst output relevance.
+ *
+ * @param category — the story's category slug (free-form, from the cron
+ *   payload). `""` or unknown categories fall back to DEFAULT_SECTIONS.
+ */
+export function sectionsForCategory(category: string): {
+  sections: SectionKey[];
+  policyLabel: string;
+} {
+  if (typeof category === 'string' && category.length > 0) {
+    for (const { match, sections, label } of CATEGORY_SECTION_POLICY) {
+      if (match.test(category)) return { sections, policyLabel: label };
+    }
+  }
+  return { sections: DEFAULT_SECTIONS, policyLabel: 'default' };
+}
+
 function clip(s: string, cap: number): string {
  if (typeof s !== 'string' || s.length === 0) return '';
  if (s.length <= cap) return s;
@@ -71,15 +176,26 @@ function clip(s: string, cap: number): string {
 }

 /**
- * Assemble the compact context block. Skips empty sections. Respects
- * a total-chars budget so a bloated single section can't push the
- * prompt over its token limit.
+ * Assemble the compact context block, filtered to sections that
+ * editorially matter for the story's category. Skips empty sections.
+ * Respects a total-chars budget so a bloated single section can't push
+ * the prompt over its token limit.
+ *
+ * @param context — the full BriefStoryContext from assembleBriefStoryContext
+ * @param allowedSections — whitelist from sectionsForCategory(category).
+ *   When omitted, all sections allowed (pre-gating behavior — kept for
+ *   test backcompat).
 */
-export function buildContextBlock(context: BriefStoryContext): string {
+export function buildContextBlock(
+  context: BriefStoryContext,
+  allowedSections?: SectionKey[],
+): string {
  if (!context) return '';
+  const allow = allowedSections ? new Set<SectionKey>(allowedSections) : null;
  const parts: string[] = [];
  let used = 0;
  for (const { key, label, cap } of SECTION_CAPS) {
+    if (allow && !allow.has(key)) continue;
    const raw = context[key];
    if (typeof raw !== 'string' || raw.trim() === '') continue;
    const clipped = clip(raw, cap);
@@ -105,9 +221,10 @@ export function buildContextBlock(context: BriefStoryContext): string {
 export function buildAnalystWhyMattersPrompt(
  story: StoryForPrompt,
  context: BriefStoryContext,
-): { system: string; user: string } {
+): { system: string; user: string; policyLabel: string } {
  const safe = sanitizeStoryFields(story);
-  const contextBlock = buildContextBlock(context);
+  const { sections: allowedSections, policyLabel } = sectionsForCategory(safe.category);
+  const contextBlock = buildContextBlock(context, allowedSections);

  const storyLineList = [
    `Headline: ${safe.headline}`,
@@ -119,24 +236,39 @@ export function buildAnalystWhyMattersPrompt(
  ];
  const storyLines = storyLineList.join('\n');

-  const sections = [];
+  const parts: string[] = [];
  if (contextBlock) {
-    sections.push('# Live WorldMonitor Context', contextBlock);
+    parts.push('# Live WorldMonitor Context', contextBlock);
  }
-  sections.push('# Story', storyLines);
-  // Prompt footer matches the system prompt's SITUATION → ANALYSIS →
-  // (optional) WATCH arc, but explicitly restates the grounding
-  // requirement so the model can't ignore it from the system message
-  // alone. Models follow inline instructions more reliably than
-  // system-prompt constraints on longer outputs.
-  sections.push(
-    'Write 2–3 sentences (40–70 words) on why this story matters, grounded in at ' +
-      'least ONE specific actor / metric / date / place drawn from the context above. ' +
-      'Plain prose, no section labels in the output:',
+  parts.push('# Story', storyLines);
+  // Prompt footer restates the grounding requirement inline (models
+  // follow inline instructions more reliably than system-prompt
+  // constraints on longer outputs), and adds a relevance guardrail.
+  //
+  // Shadow review (2026-04-22, 15 v2 pairs) showed the analyst pattern-
+  // matching loud context numbers — VIX 19.50, top forecast probability,
+  // MidEast FX stress 77 — into humanitarian / aviation / Rwanda stories
+  // regardless of editorial fit. Structural category gating above strips
+  // the worst offenders (markets never reach humanitarian stories), but
+  // category overlap is imperfect: a "Security" story that regex-matches
+  // as geopolitical still gets forecasts, and not every forecast belongs.
+  //
+  // The guardrail below is the second layer: tell the model explicitly
+  // that the category framing it was handed limits what's worth citing,
+  // and that a named actor from the headline / description is a valid
+  // grounding target when no context fact is a clean fit.
+  parts.push(
+    `Write 2–3 sentences (40–70 words) on why this ${safe.category || 'story'} matters, grounded in at ` +
+      "least ONE specific reference. Prefer a fact drawn from the context block above WHEN it clearly " +
+      "relates to this story's category and country. If no context fact is a clean fit, ground " +
+      'instead in a named actor, place, date, or figure from the headline or description. ' +
+      'DO NOT force an off-topic market metric, VIX value, FX reading, or forecast probability ' +
+      "into a story where it does not belong. Plain prose, no section labels in the output:",
  );

  return {
    system: WHY_MATTERS_ANALYST_SYSTEM_V2,
-    user: sections.join('\n\n'),
+    user: parts.join('\n\n'),
+    policyLabel,
  };
 }
--- a/shared/brief-llm-core.js
+++ b/shared/brief-llm-core.js
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
  'on why the story matters.\n\n' +
  'STRUCTURE:\n' +
  '1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
-  'metric, date, or place drawn from the context.\n' +
+  'metric, date, or place relevant to this story.\n' +
  '2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
  'the balance, triggers a cascade).\n' +
  '3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
  'HARD CONSTRAINTS:\n' +
  '- Total length 40–70 words across 2–3 sentences.\n' +
  '- MUST reference at least ONE specific: named person / country / organization / ' +
-  'number / percentage / date / city — drawn from the context, NOT invented.\n' +
+  'number / percentage / date / city.\n' +
  '- No preamble ("This matters because…", "The importance of…").\n' +
  '- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
-  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
+  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
+  'RELEVANCE RULE (critical, read carefully):\n' +
+  '- The context block may contain facts from world-brief, country-brief, risk scores, ' +
+  'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
+  "directly relevant to this story's category and country.\n" +
+  '- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
+  'or figure drawn from the headline or description. That is a VALID grounding — do ' +
+  'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
+  '- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
+  'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
+  'does not need a VIX number; a story about a drone incursion does not need an FX ' +
+  "stress reading. If it isn't editorially connected to the story, leave it out.";

 /**
 * Parse + validate the analyst-path v2 LLM response. Accepts
--- a/tests/brief-why-matters-analyst.test.mjs
+++ b/tests/brief-why-matters-analyst.test.mjs
@@ -18,6 +18,7 @@ import { generateWhyMatters } from '../scripts/lib/brief-llm.mjs';
 import {
  hashBriefStory,
  parseWhyMatters,
+  parseWhyMattersV2,
  WHY_MATTERS_SYSTEM,
 } from '../shared/brief-llm-core.js';

@@ -127,6 +128,106 @@ describe('cache key identity', () => {
  });
 });

+// ── parseWhyMattersV2 — analyst-path output validator ───────────────────
+//
+// This is the only output-validation gate between the analyst LLM and
+// the cache envelope: if it returns null the whole response falls back
+// to the gemini layer. Its rejection rules differ from v1 (100–500
+// char range, multi-sentence preamble list, section-label check) and
+// were not previously covered by unit tests (greptile P2, PR #3281).
+
+describe('parseWhyMattersV2 — analyst output validator', () => {
+  const VALID_MULTI =
+    "Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
+    'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters.';
+
+  it('accepts a valid 2-sentence, ~40–70 word output', () => {
+    const out = parseWhyMattersV2(VALID_MULTI);
+    assert.equal(out, VALID_MULTI);
+  });
+
+  it('accepts a valid 3-sentence output with optional WATCH arc', () => {
+    const three =
+      "Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
+      'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters. ' +
+      'Watch IMF commentary in the next 48 hours for cascading guidance.';
+    assert.equal(parseWhyMattersV2(three), three);
+  });
+
+  it('rejects output under the 100-char minimum (distinguishes it from v1)', () => {
+    // v1 accepts short outputs; v2 requires 100+ chars so the model has
+    // room for SITUATION + ANALYSIS. A short string is "too terse".
+    assert.equal(parseWhyMattersV2('Short sentence under 100 chars.'), null);
+    assert.equal(parseWhyMattersV2('x'.repeat(99)), null);
+    // Boundary: exactly 100 passes.
+    assert.equal(typeof parseWhyMattersV2('x'.repeat(100)), 'string');
+  });
+
+  it('rejects output over the 500-char cap (prevents runaway essays)', () => {
+    assert.equal(parseWhyMattersV2('x'.repeat(501)), null);
+    // Boundary: exactly 500 passes.
+    assert.equal(typeof parseWhyMattersV2('x'.repeat(500)), 'string');
+  });
+
+  it('rejects banned preamble phrases (v2-specific)', () => {
+    for (const preamble of [
+      'This matters because the Strait of Hormuz closure would halt 20% of global oil supply right now and this is very important for analysts.',
+      'The importance of this event is that oil tankers cannot transit the strait, which forces a global supply rerouting and price shock.',
+      'It is important to note that Iran has blockaded a critical global shipping chokepoint with real consequences for supply.',
+      'Importantly, the closure of the Strait of Hormuz disrupts roughly 20% of global seaborne oil flows starting April 21.',
+      'In summary, the analyst sees this as a major geopolitical escalation with wide-reaching market and security implications.',
+      'To summarize, the blockade represents a sharp departure from the prior six months of relative calm in the Persian Gulf region.',
+    ]) {
+      assert.equal(parseWhyMattersV2(preamble), null, `should reject preamble: "${preamble.slice(0, 40)}..."`);
+    }
+  });
+
+  it('rejects section-label leaks (SITUATION/ANALYSIS/WATCH prefixes)', () => {
+    for (const leak of [
+      'SITUATION: Iran has closed the Strait of Hormuz effective April 21, halting roughly 20% of seaborne global oil supply today.',
+      'ANALYSIS — the disruption forces an immediate global sovereign risk repricing across Gulf exporters including Saudi Arabia and UAE.',
+      'Watch: IMF commentary for the next 48 hours should give the earliest signal on the cascading global guidance implications.',
+    ]) {
+      assert.equal(parseWhyMattersV2(leak), null, `should reject label leak: "${leak.slice(0, 40)}..."`);
+    }
+  });
+
+  it('rejects markdown leakage (bullets, headers, numbered lists)', () => {
+    for (const md of [
+      '# The closure of the Strait of Hormuz is the single most material geopolitical event of the quarter for sovereign credit.',
+      '- Iran has blockaded the Strait of Hormuz, halting roughly 20% of the world seaborne oil on April 21 effective immediately.',
+      '* The closure of the Strait of Hormuz halts roughly 20% of the world seaborne oil, which forces an immediate price shock today.',
+      '1. The closure of the Strait of Hormuz halts roughly 20% of seaborne global oil, which forces an immediate sovereign risk repricing.',
+    ]) {
+      assert.equal(parseWhyMattersV2(md), null, `should reject markdown: "${md.slice(0, 40)}..."`);
+    }
+  });
+
+  it('rejects the stub echo (same as v1)', () => {
+    const stub =
+      'Story flagged by your sensitivity settings — the analyst could not find a clean grounding fact and returned the pre-canned fallback.';
+    assert.equal(parseWhyMattersV2(stub), null);
+  });
+
+  it('trims surrounding quote marks the model sometimes wraps output in', () => {
+    const quoted = `"${VALID_MULTI}"`;
+    assert.equal(parseWhyMattersV2(quoted), VALID_MULTI);
+    const smart = `\u201C${VALID_MULTI}\u201D`;
+    assert.equal(parseWhyMattersV2(smart), VALID_MULTI);
+  });
+
+  it('rejects non-string inputs (defensive)', () => {
+    for (const v of [null, undefined, 123, {}, [], true]) {
+      assert.equal(parseWhyMattersV2(v), null, `should reject ${typeof v}`);
+    }
+  });
+
+  it('rejects whitespace-only strings', () => {
+    assert.equal(parseWhyMattersV2(''), null);
+    assert.equal(parseWhyMattersV2('   \n\t  '), null);
+  });
+});
+
 // ── Deterministic shadow sampling ────────────────────────────────────────

 describe('shadow sample deterministic hashing', () => {
@@ -296,7 +397,12 @@ describe('endpoint validation contract', () => {
  // test regression on the endpoint flow (see "endpoint end-to-end" below).
  const VALID_THREAT = new Set(['critical', 'high', 'medium', 'low']);
  const CAPS = { headline: 400, source: 120, category: 80, country: 80 };
-  const MAX_BODY_BYTES = 4096;
+  // Must match `api/internal/brief-why-matters.ts:116` — bumped to 8192 in
+  // PR #3269 to accommodate v2 output + description. If this ever drifts
+  // again, the bloated-fixture assertion below silently passes for
+  // payloads in the (OLD_VALUE, NEW_VALUE] range that the real endpoint
+  // now accepts (greptile P2, PR #3281).
+  const MAX_BODY_BYTES = 8192;

  function validate(raw) {
    if (!raw || typeof raw !== 'object') return { ok: false, msg: 'body' };
@@ -364,7 +470,9 @@ describe('endpoint validation contract', () => {
        ...story(),
        // Artificial oversize payload — would need headline cap bypassed
        // to reach in practice, but the total body-byte cap must still fire.
-        extra: 'x'.repeat(5000),
+        // Sized well above MAX_BODY_BYTES (8192) so a future bump doesn't
+        // silently invalidate the assertion.
+        extra: 'x'.repeat(10_000),
      },
    };
    assert.ok(measureBytes(bloated) > MAX_BODY_BYTES, 'fixture is oversize');
@@ -480,6 +588,210 @@ describe('buildAnalystWhyMattersPrompt — shape and budget', () => {
  });
 });

+// ── Category-gated context (2026-04-22 formulaic-grounding fix) ──────
+//
+// Shadow-diff of 15 v2 pairs showed the analyst pattern-matching loud
+// context numbers (VIX, top forecast probability, MidEast FX stress)
+// into every story regardless of editorial fit. The structural fix is
+// to only feed editorially-relevant context bundles per category; the
+// prompt-level RELEVANCE RULE is a second-layer guard.
+//
+// These tests pin the category → sections map so a future "loosen this
+// one little thing" edit can't silently re-introduce market metrics
+// into humanitarian stories.
+
+describe('sectionsForCategory — structural relevance gating', () => {
+  let sectionsForCategory;
+  let builder;
+  it('loads', async () => {
+    const mod = await import('../server/worldmonitor/intelligence/v1/brief-why-matters-prompt.ts');
+    sectionsForCategory = mod.sectionsForCategory;
+    builder = mod.buildAnalystWhyMattersPrompt;
+    assert.ok(typeof sectionsForCategory === 'function');
+  });
+
+  it('market/commodity/finance → includes marketData + forecasts, excludes riskScores', () => {
+    for (const cat of ['Energy', 'Commodity Squeeze', 'Market Activity', 'Financial Stress', 'Oil Markets', 'Trade Policy']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'market', `${cat} should match market policy`);
+      assert.ok(sections.includes('marketData'), `${cat} should include marketData`);
+      assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
+      assert.ok(sections.includes('macroSignals'), `${cat} should include macroSignals`);
+      assert.ok(!sections.includes('riskScores'), `${cat} should NOT include riskScores`);
+    }
+  });
+
+  it('humanitarian → excludes marketData AND forecasts (the #1 drift pattern)', () => {
+    for (const cat of ['Humanitarian Crisis', 'Refugee Flow', 'Civil Unrest', 'Social Upheaval', 'Rights Violation', 'Aid Delivery', 'Migration']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'humanitarian', `${cat} should match humanitarian policy`);
+      assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
+      assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
+      assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
+      assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
+    }
+  });
+
+  it('geopolitical → includes forecasts + riskScores, excludes marketData', () => {
+    for (const cat of ['Geopolitical Risk', 'Military Posture', 'Conflict', 'War', 'Terrorism', 'Security', 'Nuclear Policy', 'Defense']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'geopolitical', `${cat} should match geopolitical policy`);
+      assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
+      assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
+      assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
+      assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
+    }
+  });
+
+  it('diplomacy → riskScores only, no markets/forecasts', () => {
+    for (const cat of ['Diplomacy', 'Negotiations', 'Summit Meetings', 'Sanctions']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'diplomacy', `${cat} should match diplomacy policy`);
+      assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
+      assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
+      assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
+    }
+  });
+
+  it('tech → riskScores only, no markets/forecasts/macro', () => {
+    for (const cat of ['Tech Policy', 'Cyber Attack', 'AI Regulation', 'Artificial Intelligence', 'Algorithm Abuse', 'Autonomous Systems']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'tech', `${cat} should match tech policy`);
+      assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
+      assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
+      assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
+    }
+  });
+
+  it('aviation / airspace / drone → riskScores only, NO markets/forecasts/macro (PR #3281 review fix)', () => {
+    // Reviewer caught that aviation was named in the RELEVANCE RULE as a
+    // category banned from off-topic metrics, but had no structural
+    // regex entry — so "Aviation Incident" / "Airspace Closure" / etc.
+    // fell through to DEFAULT_SECTIONS and still got all 6 bundles
+    // including marketData + forecasts + macroSignals. Direct repro
+    // test so a future regex rewrite can't silently regress.
+    for (const cat of ['Aviation Incident', 'Airspace Closure', 'Plane Crash', 'Flight Disruption', 'Drone Incursion', 'Aircraft Shot Down']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'aviation', `${cat} should match aviation policy`);
+      assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
+      assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
+      assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
+      assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
+    }
+  });
+
+  it('unknown / empty category → default (all 6 sections, backcompat)', () => {
+    for (const cat of ['', 'General', 'Sports Event', 'Unknown Thing']) {
+      const { sections, policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'default', `"${cat}" should fall through to default`);
+      // Default must include everything — prevents a regression where
+      // a refactor accidentally empties the default.
+      for (const k of ['worldBrief', 'countryBrief', 'riskScores', 'forecasts', 'macroSignals', 'marketData']) {
+        assert.ok(sections.includes(k), `default policy should include ${k}`);
+      }
+    }
+  });
+
+  it('RELEVANCE RULE categories have structural coverage (no prompt-only guards)', () => {
+    // Meta-invariant: every category named in the system prompt's
+    // RELEVANCE RULE as banned-from-off-topic-metrics MUST have a
+    // matching policy entry. A prompt-only guard is too soft — models
+    // follow inline instructions imperfectly. If someone adds a new
+    // category to the prompt, this test fires until they add a regex.
+    for (const cat of ['Humanitarian Crisis', 'Aviation Incident', 'Diplomatic Summit', 'Cyber Attack']) {
+      const { policyLabel } = sectionsForCategory(cat);
+      assert.notEqual(
+        policyLabel,
+        'default',
+        `"${cat}" is named in WHY_MATTERS_ANALYST_SYSTEM_V2 as banned from market metrics — it must have a structural policy, not fall through to default`,
+      );
+    }
+  });
+
+  it('non-string / null / undefined category → default fallback (defensive)', () => {
+    for (const cat of [null, undefined, 123, {}, []]) {
+      const { policyLabel } = sectionsForCategory(cat);
+      assert.equal(policyLabel, 'default', `non-string ${JSON.stringify(cat)} should fall through to default`);
+    }
+  });
+
+  it('buildAnalystWhyMattersPrompt — humanitarian story must not see marketData or forecasts', () => {
+    const humanitarian = {
+      headline: 'Rwanda hosts fresh Congolese refugees',
+      source: 'UNHCR',
+      threatLevel: 'high',
+      category: 'Humanitarian Crisis',
+      country: 'RW',
+    };
+    const fullContext = {
+      worldBrief: 'Global migration pressure is at a decade high.',
+      countryBrief: 'Rwanda has absorbed 100K refugees this quarter.',
+      riskScores: 'Risk index 62/100 (elevated).',
+      forecasts: 'Top forecast: Congo ceasefire holds (72% by Q3).',
+      // Use distinctive values that would never appear in the guardrail
+      // text — the guardrail mentions "VIX value" / "FX reading" in the
+      // abstract, so we assert on the concrete numeric fingerprint.
+      marketData: 'VIX-READING-19-50. EUR/USD 1.0732. Gold $2,380.',
+      macroSignals: 'MidEastFxStressSentinel-77.',
+      degraded: false,
+    };
+    const { user, policyLabel } = builder(humanitarian, fullContext);
+    assert.equal(policyLabel, 'humanitarian');
+    // Structural guarantee: the distinctive context values physically
+    // cannot appear in the prompt because we didn't pass them to the LLM.
+    assert.doesNotMatch(user, /VIX-READING-19-50/, 'humanitarian prompt must not include marketData sentinel');
+    assert.doesNotMatch(user, /EUR\/USD/, 'humanitarian prompt must not include FX pair');
+    assert.doesNotMatch(user, /Top forecast/, 'humanitarian prompt must not include forecasts');
+    assert.doesNotMatch(user, /MidEastFxStressSentinel/, 'humanitarian prompt must not include macro signals');
+    assert.doesNotMatch(user, /## Market Data/, 'humanitarian prompt must not have a Market Data section heading');
+    assert.doesNotMatch(user, /## Forecasts/, 'humanitarian prompt must not have a Forecasts section heading');
+    assert.doesNotMatch(user, /## Macro Signals/, 'humanitarian prompt must not have a Macro Signals section heading');
+    // But country + risk framing must survive.
+    assert.match(user, /Rwanda has absorbed/);
+    assert.match(user, /Risk index/);
+  });
+
+  it('buildAnalystWhyMattersPrompt — market story DOES see marketData', () => {
+    const marketStory = {
+      headline: 'Crude oil jumps 4% on Houthi tanker strike',
+      source: 'FT',
+      threatLevel: 'high',
+      category: 'Energy',
+      country: 'YE',
+    };
+    const ctx = {
+      worldBrief: 'Red Sea shipping activity down 35% YoY.',
+      countryBrief: 'Yemen remains active conflict zone.',
+      riskScores: 'Risk index 88/100.',
+      forecasts: 'Top forecast: Houthi attacks continue (83%).',
+      marketData: 'Brent $87.40. VIX 19.50. USD/SAR flat.',
+      macroSignals: 'Shipping-stress index at 3-month high.',
+      degraded: false,
+    };
+    const { user, policyLabel } = builder(marketStory, ctx);
+    assert.equal(policyLabel, 'market');
+    assert.match(user, /Brent/);
+    assert.match(user, /Shipping-stress/);
+    assert.match(user, /Top forecast/);
+    // Market policy excludes riskScores — the LLM would otherwise tack
+    // on a "country risk 88/100" into every commodity story.
+    assert.doesNotMatch(user, /Risk index 88/);
+  });
+
+  it('buildAnalystWhyMattersPrompt — prompt footer includes relevance guardrail', () => {
+    const { user } = builder(
+      { headline: 'X', source: 'Y', threatLevel: 'low', category: 'General', country: 'US' },
+      { worldBrief: '', countryBrief: '', riskScores: '', forecasts: '', marketData: '', macroSignals: '', degraded: false },
+    );
+    // Guardrail phrases — if any of these drops out, the prompt-level
+    // second-layer guard is broken and we're back to the formulaic v5
+    // behavior for any story that still hits the default policy.
+    assert.match(user, /DO NOT force/i, 'guardrail phrase "DO NOT force" must be in footer');
+    assert.match(user, /off-topic market metric|VIX|forecast probability/i);
+    assert.match(user, /named actor, place, date, or figure/);
+  });
+});
+
 // ── Env flag parsing (endpoint config resolution) ─────────────────────

 describe('endpoint env flag parsing', () => {