fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)

* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding

Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern-
matching the loudest context numbers — VIX 19.50, top forecast
probability, MidEast FX stress 77 — into every story regardless of
editorial fit. A Rwanda humanitarian story about refugees cited VIX;
an aviation story cited a forecast probability.

Root cause: every story got the same 6-bundle context block, so the
LLM had markets / forecasts / macro in-hand and the "cite a specific
fact" instruction did the rest.

Two-layer fix:

  1. STRUCTURAL — sectionsForCategory() maps the story's category to
     an editorially-relevant subset of bundles. Humanitarian stories
     don't see marketData / forecasts / macroSignals; diplomacy gets
     riskScores only; market/energy gets markets+forecasts but drops
     riskScores. The model physically cannot cite what it wasn't
     given. Unknown categories fall back to all six (backcompat).

  2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE
     that explicitly permits grounding in headline/description
     actors when no context fact is a clean fit, and bans dragging
     off-topic market metrics into humanitarian/aviation/diplomacy
     stories. The prompt footer (inline, per-call) restates the
     same guardrail — models follow inline instructions more
     reliably than system-prompt constraints on longer outputs.

Cache keys bumped to invalidate the formulaic v5 output: endpoint
v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5
policies + default fallback + humanitarian structural guarantee +
market policy does-see-markets + guardrail footer presence.

Observability: endpoint now logs policyLabel per call so operators
can confirm in Vercel logs that humanitarian/aviation stories are
NOT seeing marketData without dumping the full prompt.

* test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage

Greptile PR #3281 review raised two P2 test-quality issues:

1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint
   was bumped to 8192 in PR #3269 (v2 output + description). With
   the stale constant, a payload in the 4097–8192 range was
   accepted by the real endpoint but looked oversize in the test
   mirror, letting the body-cap invariant silently drift. Fixed
   by syncing to 8192 + bumping the bloated fixture to 10_000
   bytes so a future endpoint-cap bump doesn't silently
   re-invalidate the assertion.

2. parseWhyMattersV2 (the only output-validation gate on the
   analyst path) had no dedicated unit tests. Adds 11 targeted
   cases covering: valid 2 and 3 sentence output, 100/500 char
   bounds (incl. boundary assertions), all 6 banned preamble
   phrases, section-label leaks (SITUATION/ANALYSIS/Watch),
   markdown leakage (#, -, *, 1.), stub echo rejection, smart/
   plain quote stripping, non-string defensive branch, and
   whitespace-only strings.

Suite size: 50 to 61 tests, all green.

* fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1)

Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's
RELEVANCE RULE as a category banned from off-topic market metrics, but
had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation
Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all
fell through to DEFAULT_SECTIONS and still got all 6 bundles including
marketData, forecasts, and macroSignals — exactly the VIX / forecast
probability pattern the PR claimed to structurally prevent.

Reproduced on HEAD before fix:
  Aviation Incident -> default
  Airspace Closure  -> default
  Plane Crash       -> default
  ...etc.

Fix:
  1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/
     tech: worldBrief, countryBrief, riskScores).
  2. Adds dedicated aviation-gating test with 6 category variants.
  3. Adds meta-invariant test: every category named in the system
     prompt's RELEVANCE RULE MUST have a structural policy entry,
     asserting policyLabel !== 'default'. If someone adds a new
     category name to the prompt in the future, this test fires
     until they wire up a regex — prevents soft-guard drift.
  4. Removes 'Aviation Incident' from the default-fall-through test
     list (it now correctly matches aviation).

No cache bump needed — v6 was published to the feature branch only a
few minutes ago, no production entries have been written yet.
This commit is contained in:
Elie Habib
2026-04-22 08:21:01 +04:00
committed by GitHub
parent fbaf07e106
commit 425507d15a
5 changed files with 518 additions and 39 deletions

View File

@@ -215,7 +215,13 @@ function validateStoryBody(raw: unknown): ValidationOk | ValidationErr {
async function runAnalystPath(story: StoryPayload, iso2: string | null): Promise<string | null> {
try {
const context = await assembleBriefStoryContext({ iso2, category: story.category });
const { system, user } = buildAnalystWhyMattersPrompt(story, context);
const { system, user, policyLabel } = buildAnalystWhyMattersPrompt(story, context);
// One line per call so we can verify in Vercel logs that humanitarian
// / aviation stories are NOT seeing marketData, without dumping the
// full prompt (which would include upstream-provided text).
console.log(
`[brief-why-matters] analyst gate policy=${policyLabel} category="${story.category}" promptLen=${user.length}`,
);
const result = await callLlmReasoning({
messages: [
{ role: 'system', content: system },
@@ -362,16 +368,23 @@ export default async function handler(req: Request, ctx?: EdgeContext): Promise<
// Cache identity.
const hash = await hashBriefStory(story);
// v5: `hashBriefStory` now includes `description` as a prompt input
// so same-story + different description no longer collide on a single
// cache entry (P1 caught in PR #3269 review — endpoint could serve
// prose grounded in a PREVIOUS caller's description). Bumping v4→v5
// invalidates the short-lived v4 entries written under the buggy
// 5-field hash so fresh output lands on the next cron tick.
const cacheKey = `brief:llm:whymatters:v5:${hash}`;
// Shadow v2→v3 for the same reason — any v2 comparison pairs may be
// grounded in the wrong description, so the A/B was noisy.
const shadowKey = `brief:llm:whymatters:shadow:v3:${hash}`;
// v6: category-gated context + prompt-level RELEVANCE RULE (2026-04-22).
// Shadow review of 15 v2 pairs showed the analyst pattern-matching the
// loudest context numbers (VIX, forecast probabilities, FX stress) into
// every story regardless of editorial fit. Fix ships two layers:
// 1. structural — buildContextBlock now only exposes sections that are
// editorially relevant to the story's category (humanitarian stories
// don't see market data, aviation doesn't see macro, etc.).
// 2. prompt — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE that
// explicitly permits grounding in headline/description actors when
// no context fact is a clean fit.
// Either layer changes the output distribution enough that v5 prose must
// be invalidated — otherwise half the tick's stories would still return
// the formulaic v5 strings for up to 24h until TTL.
const cacheKey = `brief:llm:whymatters:v6:${hash}`;
// Shadow v3→v4 for the same reason — a mid-rollout shadow record
// comparing v5-analyst vs gemini is not useful once v6 is live.
const shadowKey = `brief:llm:whymatters:shadow:v4:${hash}`;
// Cache read. Any infrastructure failure → treat as miss (logged).
let cached: WhyMattersEnvelope | null = null;

View File

@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
'on why the story matters.\n\n' +
'STRUCTURE:\n' +
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
'metric, date, or place drawn from the context.\n' +
'metric, date, or place relevant to this story.\n' +
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
'the balance, triggers a cascade).\n' +
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
'HARD CONSTRAINTS:\n' +
'- Total length 4070 words across 23 sentences.\n' +
'- MUST reference at least ONE specific: named person / country / organization / ' +
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
'number / percentage / date / city.\n' +
'- No preamble ("This matters because…", "The importance of…").\n' +
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
'RELEVANCE RULE (critical, read carefully):\n' +
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
"directly relevant to this story's category and country.\n" +
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
"stress reading. If it isn't editorially connected to the story, leave it out.";
/**
* Parse + validate the analyst-path v2 LLM response. Accepts

View File

@@ -52,10 +52,31 @@ export function sanitizeStoryFields(story: StoryForPrompt): StoryForPrompt {
// the LLM call latency stays under ~6s on typical provider responses.
const CONTEXT_BUDGET_CHARS = 1700;
// ── Category-gated context sections ──────────────────────────────────────
//
// Shadow-diff (2026-04-22) of 15 v2 pairs showed the LLM pattern-matching
// the loudest numbers (VIX 19.50, top forecast probability, MidEast FX
// stress) into every story — even humanitarian / Rwanda / aviation stories
// with no editorial connection to markets. Root cause: the context block
// passes ALL six bundles for every story, so the LLM has the market /
// forecast numbers in-hand and the prompt's "cite a specific fact"
// instruction does the rest.
//
// Fix: structurally exclude bundles the LLM cannot appropriately cite for
// a given category. Humanitarian stories don't see market data; energy
// stories don't see domestic risk scores; etc. The model physically cannot
// cite what it wasn't given.
//
// Matching is case-insensitive substring on the story's category slug
// (shared/brief-filter.js:134 — category is free-form like "Humanitarian
// Crisis", "Geopolitical Risk", "Energy"). First match wins. Unknown →
// DEFAULT_SECTIONS (all six — same as pre-gating behavior).
type SectionKey = Exclude<keyof BriefStoryContext, 'degraded'>;
// Per-section caps so no single heavy bundle (e.g. long worldBrief)
// crowds out the others. Ordered by editorial importance: a single-
// sentence summary benefits most from narrative + country framing.
const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: number }> = [
const SECTION_CAPS: Array<{ key: SectionKey; label: string; cap: number }> = [
{ key: 'worldBrief', label: 'World Brief', cap: 500 },
{ key: 'countryBrief', label: 'Country Brief', cap: 400 },
{ key: 'riskScores', label: 'Risk Scores', cap: 250 },
@@ -64,6 +85,90 @@ const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: nu
{ key: 'marketData', label: 'Market Data', cap: 200 },
];
const DEFAULT_SECTIONS: SectionKey[] = [
'worldBrief',
'countryBrief',
'riskScores',
'forecasts',
'macroSignals',
'marketData',
];
// NOTE on regex shape: patterns use a LEADING `\b` (start-of-word
// anchor) but NO TRAILING `\b`, so they match stems. "Diplomac" must
// match "Diplomacy" and "Diplomatic"; "migrat" must match "migration"
// and "migrating". A trailing `\b` here incorrectly required the stem
// to end on a non-word char and caused every inflected form to fall
// through to the default policy (caught in unit tests 2026-04-22).
const CATEGORY_SECTION_POLICY: Array<{ match: RegExp; sections: SectionKey[]; label: string }> = [
// Energy / commodity / markets / financial — forecasts + markets matter.
{
label: 'market',
match: /\b(energy|commodit|market|financ|trade|oil|gas|fuel)/i,
sections: ['worldBrief', 'countryBrief', 'forecasts', 'macroSignals', 'marketData'],
},
// Humanitarian / civil / social / rights — NO market, NO forecasts.
// This is the #1 source of the "77% FX stress dragged into a Rwanda
// story" pattern from the 2026-04-22 shadow review.
{
label: 'humanitarian',
match: /\b(humanitarian|refuge|civil|social|rights|genocid|aid\b|migrat)/i,
sections: ['worldBrief', 'countryBrief', 'riskScores'],
},
// Geopolitical risk / conflict / military / security — risk + forecasts
// but not market data (the LLM would otherwise tack on a VIX reading to
// every conflict story).
{
label: 'geopolitical',
match: /\b(geopolit|military|conflict|war\b|terror|securit|defen[cs]e|nuclear)/i,
sections: ['worldBrief', 'countryBrief', 'riskScores', 'forecasts'],
},
// Diplomacy / negotiations — risk + country framing, no market / macro.
{
label: 'diplomacy',
match: /\b(diplomac|negotia|summit|sanction)/i,
sections: ['worldBrief', 'countryBrief', 'riskScores'],
},
// Technology / cyber — world narrative + risk, not markets.
{
label: 'tech',
match: /\b(tech|cyber|a\.?i\b|artificial|algorith|autonom)/i,
sections: ['worldBrief', 'countryBrief', 'riskScores'],
},
// Aviation / airspace / drones — world narrative + risk, NO market /
// forecasts / macro. Named explicitly in the RELEVANCE RULE (shared/
// brief-llm-core.js WHY_MATTERS_ANALYST_SYSTEM_V2) — the prior revision
// of this file only had the prompt-level guard, so aviation categories
// still fell through to DEFAULT_SECTIONS and got all 6 bundles.
// Structural fix ensures the LLM physically cannot cite a forecast
// probability or VIX reading for an aviation story (PR #3281 review).
{
label: 'aviation',
match: /\b(aviation|airspace|flight\b|aircraft|plane\b|drone)/i,
sections: ['worldBrief', 'countryBrief', 'riskScores'],
},
];
/**
* Resolve which context sections are editorially relevant for a given
* story category. Exported for testability — the category → sections
* map is the main lever for tuning analyst output relevance.
*
* @param category — the story's category slug (free-form, from the cron
* payload). `""` or unknown categories fall back to DEFAULT_SECTIONS.
*/
export function sectionsForCategory(category: string): {
sections: SectionKey[];
policyLabel: string;
} {
if (typeof category === 'string' && category.length > 0) {
for (const { match, sections, label } of CATEGORY_SECTION_POLICY) {
if (match.test(category)) return { sections, policyLabel: label };
}
}
return { sections: DEFAULT_SECTIONS, policyLabel: 'default' };
}
function clip(s: string, cap: number): string {
if (typeof s !== 'string' || s.length === 0) return '';
if (s.length <= cap) return s;
@@ -71,15 +176,26 @@ function clip(s: string, cap: number): string {
}
/**
* Assemble the compact context block. Skips empty sections. Respects
* a total-chars budget so a bloated single section can't push the
* prompt over its token limit.
* Assemble the compact context block, filtered to sections that
* editorially matter for the story's category. Skips empty sections.
* Respects a total-chars budget so a bloated single section can't push
* the prompt over its token limit.
*
* @param context — the full BriefStoryContext from assembleBriefStoryContext
* @param allowedSections — whitelist from sectionsForCategory(category).
* When omitted, all sections allowed (pre-gating behavior — kept for
* test backcompat).
*/
export function buildContextBlock(context: BriefStoryContext): string {
export function buildContextBlock(
context: BriefStoryContext,
allowedSections?: SectionKey[],
): string {
if (!context) return '';
const allow = allowedSections ? new Set<SectionKey>(allowedSections) : null;
const parts: string[] = [];
let used = 0;
for (const { key, label, cap } of SECTION_CAPS) {
if (allow && !allow.has(key)) continue;
const raw = context[key];
if (typeof raw !== 'string' || raw.trim() === '') continue;
const clipped = clip(raw, cap);
@@ -105,9 +221,10 @@ export function buildContextBlock(context: BriefStoryContext): string {
export function buildAnalystWhyMattersPrompt(
story: StoryForPrompt,
context: BriefStoryContext,
): { system: string; user: string } {
): { system: string; user: string; policyLabel: string } {
const safe = sanitizeStoryFields(story);
const contextBlock = buildContextBlock(context);
const { sections: allowedSections, policyLabel } = sectionsForCategory(safe.category);
const contextBlock = buildContextBlock(context, allowedSections);
const storyLineList = [
`Headline: ${safe.headline}`,
@@ -119,24 +236,39 @@ export function buildAnalystWhyMattersPrompt(
];
const storyLines = storyLineList.join('\n');
const sections = [];
const parts: string[] = [];
if (contextBlock) {
sections.push('# Live WorldMonitor Context', contextBlock);
parts.push('# Live WorldMonitor Context', contextBlock);
}
sections.push('# Story', storyLines);
// Prompt footer matches the system prompt's SITUATION → ANALYSIS →
// (optional) WATCH arc, but explicitly restates the grounding
// requirement so the model can't ignore it from the system message
// alone. Models follow inline instructions more reliably than
// system-prompt constraints on longer outputs.
sections.push(
'Write 23 sentences (4070 words) on why this story matters, grounded in at ' +
'least ONE specific actor / metric / date / place drawn from the context above. ' +
'Plain prose, no section labels in the output:',
parts.push('# Story', storyLines);
// Prompt footer restates the grounding requirement inline (models
// follow inline instructions more reliably than system-prompt
// constraints on longer outputs), and adds a relevance guardrail.
//
// Shadow review (2026-04-22, 15 v2 pairs) showed the analyst pattern-
// matching loud context numbers — VIX 19.50, top forecast probability,
// MidEast FX stress 77 — into humanitarian / aviation / Rwanda stories
// regardless of editorial fit. Structural category gating above strips
// the worst offenders (markets never reach humanitarian stories), but
// category overlap is imperfect: a "Security" story that regex-matches
// as geopolitical still gets forecasts, and not every forecast belongs.
//
// The guardrail below is the second layer: tell the model explicitly
// that the category framing it was handed limits what's worth citing,
// and that a named actor from the headline / description is a valid
// grounding target when no context fact is a clean fit.
parts.push(
`Write 23 sentences (4070 words) on why this ${safe.category || 'story'} matters, grounded in at ` +
"least ONE specific reference. Prefer a fact drawn from the context block above WHEN it clearly " +
"relates to this story's category and country. If no context fact is a clean fit, ground " +
'instead in a named actor, place, date, or figure from the headline or description. ' +
'DO NOT force an off-topic market metric, VIX value, FX reading, or forecast probability ' +
"into a story where it does not belong. Plain prose, no section labels in the output:",
);
return {
system: WHY_MATTERS_ANALYST_SYSTEM_V2,
user: sections.join('\n\n'),
user: parts.join('\n\n'),
policyLabel,
};
}

View File

@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
'on why the story matters.\n\n' +
'STRUCTURE:\n' +
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
'metric, date, or place drawn from the context.\n' +
'metric, date, or place relevant to this story.\n' +
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
'the balance, triggers a cascade).\n' +
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
'HARD CONSTRAINTS:\n' +
'- Total length 4070 words across 23 sentences.\n' +
'- MUST reference at least ONE specific: named person / country / organization / ' +
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
'number / percentage / date / city.\n' +
'- No preamble ("This matters because…", "The importance of…").\n' +
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
'RELEVANCE RULE (critical, read carefully):\n' +
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
"directly relevant to this story's category and country.\n" +
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
"stress reading. If it isn't editorially connected to the story, leave it out.";
/**
* Parse + validate the analyst-path v2 LLM response. Accepts

View File

@@ -18,6 +18,7 @@ import { generateWhyMatters } from '../scripts/lib/brief-llm.mjs';
import {
hashBriefStory,
parseWhyMatters,
parseWhyMattersV2,
WHY_MATTERS_SYSTEM,
} from '../shared/brief-llm-core.js';
@@ -127,6 +128,106 @@ describe('cache key identity', () => {
});
});
// ── parseWhyMattersV2 — analyst-path output validator ───────────────────
//
// This is the only output-validation gate between the analyst LLM and
// the cache envelope: if it returns null the whole response falls back
// to the gemini layer. Its rejection rules differ from v1 (100500
// char range, multi-sentence preamble list, section-label check) and
// were not previously covered by unit tests (greptile P2, PR #3281).
describe('parseWhyMattersV2 — analyst output validator', () => {
const VALID_MULTI =
"Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters.';
it('accepts a valid 2-sentence, ~4070 word output', () => {
const out = parseWhyMattersV2(VALID_MULTI);
assert.equal(out, VALID_MULTI);
});
it('accepts a valid 3-sentence output with optional WATCH arc', () => {
const three =
"Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters. ' +
'Watch IMF commentary in the next 48 hours for cascading guidance.';
assert.equal(parseWhyMattersV2(three), three);
});
it('rejects output under the 100-char minimum (distinguishes it from v1)', () => {
// v1 accepts short outputs; v2 requires 100+ chars so the model has
// room for SITUATION + ANALYSIS. A short string is "too terse".
assert.equal(parseWhyMattersV2('Short sentence under 100 chars.'), null);
assert.equal(parseWhyMattersV2('x'.repeat(99)), null);
// Boundary: exactly 100 passes.
assert.equal(typeof parseWhyMattersV2('x'.repeat(100)), 'string');
});
it('rejects output over the 500-char cap (prevents runaway essays)', () => {
assert.equal(parseWhyMattersV2('x'.repeat(501)), null);
// Boundary: exactly 500 passes.
assert.equal(typeof parseWhyMattersV2('x'.repeat(500)), 'string');
});
it('rejects banned preamble phrases (v2-specific)', () => {
for (const preamble of [
'This matters because the Strait of Hormuz closure would halt 20% of global oil supply right now and this is very important for analysts.',
'The importance of this event is that oil tankers cannot transit the strait, which forces a global supply rerouting and price shock.',
'It is important to note that Iran has blockaded a critical global shipping chokepoint with real consequences for supply.',
'Importantly, the closure of the Strait of Hormuz disrupts roughly 20% of global seaborne oil flows starting April 21.',
'In summary, the analyst sees this as a major geopolitical escalation with wide-reaching market and security implications.',
'To summarize, the blockade represents a sharp departure from the prior six months of relative calm in the Persian Gulf region.',
]) {
assert.equal(parseWhyMattersV2(preamble), null, `should reject preamble: "${preamble.slice(0, 40)}..."`);
}
});
it('rejects section-label leaks (SITUATION/ANALYSIS/WATCH prefixes)', () => {
for (const leak of [
'SITUATION: Iran has closed the Strait of Hormuz effective April 21, halting roughly 20% of seaborne global oil supply today.',
'ANALYSIS — the disruption forces an immediate global sovereign risk repricing across Gulf exporters including Saudi Arabia and UAE.',
'Watch: IMF commentary for the next 48 hours should give the earliest signal on the cascading global guidance implications.',
]) {
assert.equal(parseWhyMattersV2(leak), null, `should reject label leak: "${leak.slice(0, 40)}..."`);
}
});
it('rejects markdown leakage (bullets, headers, numbered lists)', () => {
for (const md of [
'# The closure of the Strait of Hormuz is the single most material geopolitical event of the quarter for sovereign credit.',
'- Iran has blockaded the Strait of Hormuz, halting roughly 20% of the world seaborne oil on April 21 effective immediately.',
'* The closure of the Strait of Hormuz halts roughly 20% of the world seaborne oil, which forces an immediate price shock today.',
'1. The closure of the Strait of Hormuz halts roughly 20% of seaborne global oil, which forces an immediate sovereign risk repricing.',
]) {
assert.equal(parseWhyMattersV2(md), null, `should reject markdown: "${md.slice(0, 40)}..."`);
}
});
it('rejects the stub echo (same as v1)', () => {
const stub =
'Story flagged by your sensitivity settings — the analyst could not find a clean grounding fact and returned the pre-canned fallback.';
assert.equal(parseWhyMattersV2(stub), null);
});
it('trims surrounding quote marks the model sometimes wraps output in', () => {
const quoted = `"${VALID_MULTI}"`;
assert.equal(parseWhyMattersV2(quoted), VALID_MULTI);
const smart = `\u201C${VALID_MULTI}\u201D`;
assert.equal(parseWhyMattersV2(smart), VALID_MULTI);
});
it('rejects non-string inputs (defensive)', () => {
for (const v of [null, undefined, 123, {}, [], true]) {
assert.equal(parseWhyMattersV2(v), null, `should reject ${typeof v}`);
}
});
it('rejects whitespace-only strings', () => {
assert.equal(parseWhyMattersV2(''), null);
assert.equal(parseWhyMattersV2(' \n\t '), null);
});
});
// ── Deterministic shadow sampling ────────────────────────────────────────
describe('shadow sample deterministic hashing', () => {
@@ -296,7 +397,12 @@ describe('endpoint validation contract', () => {
// test regression on the endpoint flow (see "endpoint end-to-end" below).
const VALID_THREAT = new Set(['critical', 'high', 'medium', 'low']);
const CAPS = { headline: 400, source: 120, category: 80, country: 80 };
const MAX_BODY_BYTES = 4096;
// Must match `api/internal/brief-why-matters.ts:116` — bumped to 8192 in
// PR #3269 to accommodate v2 output + description. If this ever drifts
// again, the bloated-fixture assertion below silently passes for
// payloads in the (OLD_VALUE, NEW_VALUE] range that the real endpoint
// now accepts (greptile P2, PR #3281).
const MAX_BODY_BYTES = 8192;
function validate(raw) {
if (!raw || typeof raw !== 'object') return { ok: false, msg: 'body' };
@@ -364,7 +470,9 @@ describe('endpoint validation contract', () => {
...story(),
// Artificial oversize payload — would need headline cap bypassed
// to reach in practice, but the total body-byte cap must still fire.
extra: 'x'.repeat(5000),
// Sized well above MAX_BODY_BYTES (8192) so a future bump doesn't
// silently invalidate the assertion.
extra: 'x'.repeat(10_000),
},
};
assert.ok(measureBytes(bloated) > MAX_BODY_BYTES, 'fixture is oversize');
@@ -480,6 +588,210 @@ describe('buildAnalystWhyMattersPrompt — shape and budget', () => {
});
});
// ── Category-gated context (2026-04-22 formulaic-grounding fix) ──────
//
// Shadow-diff of 15 v2 pairs showed the analyst pattern-matching loud
// context numbers (VIX, top forecast probability, MidEast FX stress)
// into every story regardless of editorial fit. The structural fix is
// to only feed editorially-relevant context bundles per category; the
// prompt-level RELEVANCE RULE is a second-layer guard.
//
// These tests pin the category → sections map so a future "loosen this
// one little thing" edit can't silently re-introduce market metrics
// into humanitarian stories.
describe('sectionsForCategory — structural relevance gating', () => {
let sectionsForCategory;
let builder;
it('loads', async () => {
const mod = await import('../server/worldmonitor/intelligence/v1/brief-why-matters-prompt.ts');
sectionsForCategory = mod.sectionsForCategory;
builder = mod.buildAnalystWhyMattersPrompt;
assert.ok(typeof sectionsForCategory === 'function');
});
it('market/commodity/finance → includes marketData + forecasts, excludes riskScores', () => {
for (const cat of ['Energy', 'Commodity Squeeze', 'Market Activity', 'Financial Stress', 'Oil Markets', 'Trade Policy']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'market', `${cat} should match market policy`);
assert.ok(sections.includes('marketData'), `${cat} should include marketData`);
assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
assert.ok(sections.includes('macroSignals'), `${cat} should include macroSignals`);
assert.ok(!sections.includes('riskScores'), `${cat} should NOT include riskScores`);
}
});
it('humanitarian → excludes marketData AND forecasts (the #1 drift pattern)', () => {
for (const cat of ['Humanitarian Crisis', 'Refugee Flow', 'Civil Unrest', 'Social Upheaval', 'Rights Violation', 'Aid Delivery', 'Migration']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'humanitarian', `${cat} should match humanitarian policy`);
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
}
});
it('geopolitical → includes forecasts + riskScores, excludes marketData', () => {
for (const cat of ['Geopolitical Risk', 'Military Posture', 'Conflict', 'War', 'Terrorism', 'Security', 'Nuclear Policy', 'Defense']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'geopolitical', `${cat} should match geopolitical policy`);
assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
}
});
it('diplomacy → riskScores only, no markets/forecasts', () => {
for (const cat of ['Diplomacy', 'Negotiations', 'Summit Meetings', 'Sanctions']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'diplomacy', `${cat} should match diplomacy policy`);
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
}
});
it('tech → riskScores only, no markets/forecasts/macro', () => {
for (const cat of ['Tech Policy', 'Cyber Attack', 'AI Regulation', 'Artificial Intelligence', 'Algorithm Abuse', 'Autonomous Systems']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'tech', `${cat} should match tech policy`);
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
}
});
it('aviation / airspace / drone → riskScores only, NO markets/forecasts/macro (PR #3281 review fix)', () => {
// Reviewer caught that aviation was named in the RELEVANCE RULE as a
// category banned from off-topic metrics, but had no structural
// regex entry — so "Aviation Incident" / "Airspace Closure" / etc.
// fell through to DEFAULT_SECTIONS and still got all 6 bundles
// including marketData + forecasts + macroSignals. Direct repro
// test so a future regex rewrite can't silently regress.
for (const cat of ['Aviation Incident', 'Airspace Closure', 'Plane Crash', 'Flight Disruption', 'Drone Incursion', 'Aircraft Shot Down']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'aviation', `${cat} should match aviation policy`);
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
}
});
it('unknown / empty category → default (all 6 sections, backcompat)', () => {
for (const cat of ['', 'General', 'Sports Event', 'Unknown Thing']) {
const { sections, policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'default', `"${cat}" should fall through to default`);
// Default must include everything — prevents a regression where
// a refactor accidentally empties the default.
for (const k of ['worldBrief', 'countryBrief', 'riskScores', 'forecasts', 'macroSignals', 'marketData']) {
assert.ok(sections.includes(k), `default policy should include ${k}`);
}
}
});
it('RELEVANCE RULE categories have structural coverage (no prompt-only guards)', () => {
// Meta-invariant: every category named in the system prompt's
// RELEVANCE RULE as banned-from-off-topic-metrics MUST have a
// matching policy entry. A prompt-only guard is too soft — models
// follow inline instructions imperfectly. If someone adds a new
// category to the prompt, this test fires until they add a regex.
for (const cat of ['Humanitarian Crisis', 'Aviation Incident', 'Diplomatic Summit', 'Cyber Attack']) {
const { policyLabel } = sectionsForCategory(cat);
assert.notEqual(
policyLabel,
'default',
`"${cat}" is named in WHY_MATTERS_ANALYST_SYSTEM_V2 as banned from market metrics — it must have a structural policy, not fall through to default`,
);
}
});
it('non-string / null / undefined category → default fallback (defensive)', () => {
for (const cat of [null, undefined, 123, {}, []]) {
const { policyLabel } = sectionsForCategory(cat);
assert.equal(policyLabel, 'default', `non-string ${JSON.stringify(cat)} should fall through to default`);
}
});
it('buildAnalystWhyMattersPrompt — humanitarian story must not see marketData or forecasts', () => {
const humanitarian = {
headline: 'Rwanda hosts fresh Congolese refugees',
source: 'UNHCR',
threatLevel: 'high',
category: 'Humanitarian Crisis',
country: 'RW',
};
const fullContext = {
worldBrief: 'Global migration pressure is at a decade high.',
countryBrief: 'Rwanda has absorbed 100K refugees this quarter.',
riskScores: 'Risk index 62/100 (elevated).',
forecasts: 'Top forecast: Congo ceasefire holds (72% by Q3).',
// Use distinctive values that would never appear in the guardrail
// text — the guardrail mentions "VIX value" / "FX reading" in the
// abstract, so we assert on the concrete numeric fingerprint.
marketData: 'VIX-READING-19-50. EUR/USD 1.0732. Gold $2,380.',
macroSignals: 'MidEastFxStressSentinel-77.',
degraded: false,
};
const { user, policyLabel } = builder(humanitarian, fullContext);
assert.equal(policyLabel, 'humanitarian');
// Structural guarantee: the distinctive context values physically
// cannot appear in the prompt because we didn't pass them to the LLM.
assert.doesNotMatch(user, /VIX-READING-19-50/, 'humanitarian prompt must not include marketData sentinel');
assert.doesNotMatch(user, /EUR\/USD/, 'humanitarian prompt must not include FX pair');
assert.doesNotMatch(user, /Top forecast/, 'humanitarian prompt must not include forecasts');
assert.doesNotMatch(user, /MidEastFxStressSentinel/, 'humanitarian prompt must not include macro signals');
assert.doesNotMatch(user, /## Market Data/, 'humanitarian prompt must not have a Market Data section heading');
assert.doesNotMatch(user, /## Forecasts/, 'humanitarian prompt must not have a Forecasts section heading');
assert.doesNotMatch(user, /## Macro Signals/, 'humanitarian prompt must not have a Macro Signals section heading');
// But country + risk framing must survive.
assert.match(user, /Rwanda has absorbed/);
assert.match(user, /Risk index/);
});
it('buildAnalystWhyMattersPrompt — market story DOES see marketData', () => {
const marketStory = {
headline: 'Crude oil jumps 4% on Houthi tanker strike',
source: 'FT',
threatLevel: 'high',
category: 'Energy',
country: 'YE',
};
const ctx = {
worldBrief: 'Red Sea shipping activity down 35% YoY.',
countryBrief: 'Yemen remains active conflict zone.',
riskScores: 'Risk index 88/100.',
forecasts: 'Top forecast: Houthi attacks continue (83%).',
marketData: 'Brent $87.40. VIX 19.50. USD/SAR flat.',
macroSignals: 'Shipping-stress index at 3-month high.',
degraded: false,
};
const { user, policyLabel } = builder(marketStory, ctx);
assert.equal(policyLabel, 'market');
assert.match(user, /Brent/);
assert.match(user, /Shipping-stress/);
assert.match(user, /Top forecast/);
// Market policy excludes riskScores — the LLM would otherwise tack
// on a "country risk 88/100" into every commodity story.
assert.doesNotMatch(user, /Risk index 88/);
});
it('buildAnalystWhyMattersPrompt — prompt footer includes relevance guardrail', () => {
const { user } = builder(
{ headline: 'X', source: 'Y', threatLevel: 'low', category: 'General', country: 'US' },
{ worldBrief: '', countryBrief: '', riskScores: '', forecasts: '', marketData: '', macroSignals: '', degraded: false },
);
// Guardrail phrases — if any of these drops out, the prompt-level
// second-layer guard is broken and we're back to the formulaic v5
// behavior for any story that still hits the default policy.
assert.match(user, /DO NOT force/i, 'guardrail phrase "DO NOT force" must be in footer');
assert.match(user, /off-topic market metric|VIX|forecast probability/i);
assert.match(user, /named actor, place, date, or figure/);
});
});
// ── Env flag parsing (endpoint config resolution) ─────────────────────
describe('endpoint env flag parsing', () => {