mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)
* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding
Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern-
matching the loudest context numbers — VIX 19.50, top forecast
probability, MidEast FX stress 77 — into every story regardless of
editorial fit. A Rwanda humanitarian story about refugees cited VIX;
an aviation story cited a forecast probability.
Root cause: every story got the same 6-bundle context block, so the
LLM had markets / forecasts / macro in-hand and the "cite a specific
fact" instruction did the rest.
Two-layer fix:
1. STRUCTURAL — sectionsForCategory() maps the story's category to
an editorially-relevant subset of bundles. Humanitarian stories
don't see marketData / forecasts / macroSignals; diplomacy gets
riskScores only; market/energy gets markets+forecasts but drops
riskScores. The model physically cannot cite what it wasn't
given. Unknown categories fall back to all six (backcompat).
2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE
that explicitly permits grounding in headline/description
actors when no context fact is a clean fit, and bans dragging
off-topic market metrics into humanitarian/aviation/diplomacy
stories. The prompt footer (inline, per-call) restates the
same guardrail — models follow inline instructions more
reliably than system-prompt constraints on longer outputs.
Cache keys bumped to invalidate the formulaic v5 output: endpoint
v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5
policies + default fallback + humanitarian structural guarantee +
market policy does-see-markets + guardrail footer presence.
Observability: endpoint now logs policyLabel per call so operators
can confirm in Vercel logs that humanitarian/aviation stories are
NOT seeing marketData without dumping the full prompt.
* test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage
Greptile PR #3281 review raised two P2 test-quality issues:
1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint
was bumped to 8192 in PR #3269 (v2 output + description). With
the stale constant, a payload in the 4097–8192 range was
accepted by the real endpoint but looked oversize in the test
mirror, letting the body-cap invariant silently drift. Fixed
by syncing to 8192 + bumping the bloated fixture to 10_000
bytes so a future endpoint-cap bump doesn't silently
re-invalidate the assertion.
2. parseWhyMattersV2 (the only output-validation gate on the
analyst path) had no dedicated unit tests. Adds 11 targeted
cases covering: valid 2 and 3 sentence output, 100/500 char
bounds (incl. boundary assertions), all 6 banned preamble
phrases, section-label leaks (SITUATION/ANALYSIS/Watch),
markdown leakage (#, -, *, 1.), stub echo rejection, smart/
plain quote stripping, non-string defensive branch, and
whitespace-only strings.
Suite size: 50 to 61 tests, all green.
* fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1)
Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's
RELEVANCE RULE as a category banned from off-topic market metrics, but
had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation
Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all
fell through to DEFAULT_SECTIONS and still got all 6 bundles including
marketData, forecasts, and macroSignals — exactly the VIX / forecast
probability pattern the PR claimed to structurally prevent.
Reproduced on HEAD before fix:
Aviation Incident -> default
Airspace Closure -> default
Plane Crash -> default
...etc.
Fix:
1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/
tech: worldBrief, countryBrief, riskScores).
2. Adds dedicated aviation-gating test with 6 category variants.
3. Adds meta-invariant test: every category named in the system
prompt's RELEVANCE RULE MUST have a structural policy entry,
asserting policyLabel !== 'default'. If someone adds a new
category name to the prompt in the future, this test fires
until they wire up a regex — prevents soft-guard drift.
4. Removes 'Aviation Incident' from the default-fall-through test
list (it now correctly matches aviation).
No cache bump needed — v6 was published to the feature branch only a
few minutes ago, no production entries have been written yet.
This commit is contained in:
@@ -215,7 +215,13 @@ function validateStoryBody(raw: unknown): ValidationOk | ValidationErr {
|
||||
async function runAnalystPath(story: StoryPayload, iso2: string | null): Promise<string | null> {
|
||||
try {
|
||||
const context = await assembleBriefStoryContext({ iso2, category: story.category });
|
||||
const { system, user } = buildAnalystWhyMattersPrompt(story, context);
|
||||
const { system, user, policyLabel } = buildAnalystWhyMattersPrompt(story, context);
|
||||
// One line per call so we can verify in Vercel logs that humanitarian
|
||||
// / aviation stories are NOT seeing marketData, without dumping the
|
||||
// full prompt (which would include upstream-provided text).
|
||||
console.log(
|
||||
`[brief-why-matters] analyst gate policy=${policyLabel} category="${story.category}" promptLen=${user.length}`,
|
||||
);
|
||||
const result = await callLlmReasoning({
|
||||
messages: [
|
||||
{ role: 'system', content: system },
|
||||
@@ -362,16 +368,23 @@ export default async function handler(req: Request, ctx?: EdgeContext): Promise<
|
||||
|
||||
// Cache identity.
|
||||
const hash = await hashBriefStory(story);
|
||||
// v5: `hashBriefStory` now includes `description` as a prompt input
|
||||
// so same-story + different description no longer collide on a single
|
||||
// cache entry (P1 caught in PR #3269 review — endpoint could serve
|
||||
// prose grounded in a PREVIOUS caller's description). Bumping v4→v5
|
||||
// invalidates the short-lived v4 entries written under the buggy
|
||||
// 5-field hash so fresh output lands on the next cron tick.
|
||||
const cacheKey = `brief:llm:whymatters:v5:${hash}`;
|
||||
// Shadow v2→v3 for the same reason — any v2 comparison pairs may be
|
||||
// grounded in the wrong description, so the A/B was noisy.
|
||||
const shadowKey = `brief:llm:whymatters:shadow:v3:${hash}`;
|
||||
// v6: category-gated context + prompt-level RELEVANCE RULE (2026-04-22).
|
||||
// Shadow review of 15 v2 pairs showed the analyst pattern-matching the
|
||||
// loudest context numbers (VIX, forecast probabilities, FX stress) into
|
||||
// every story regardless of editorial fit. Fix ships two layers:
|
||||
// 1. structural — buildContextBlock now only exposes sections that are
|
||||
// editorially relevant to the story's category (humanitarian stories
|
||||
// don't see market data, aviation doesn't see macro, etc.).
|
||||
// 2. prompt — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE that
|
||||
// explicitly permits grounding in headline/description actors when
|
||||
// no context fact is a clean fit.
|
||||
// Either layer changes the output distribution enough that v5 prose must
|
||||
// be invalidated — otherwise half the tick's stories would still return
|
||||
// the formulaic v5 strings for up to 24h until TTL.
|
||||
const cacheKey = `brief:llm:whymatters:v6:${hash}`;
|
||||
// Shadow v3→v4 for the same reason — a mid-rollout shadow record
|
||||
// comparing v5-analyst vs gemini is not useful once v6 is live.
|
||||
const shadowKey = `brief:llm:whymatters:shadow:v4:${hash}`;
|
||||
|
||||
// Cache read. Any infrastructure failure → treat as miss (logged).
|
||||
let cached: WhyMattersEnvelope | null = null;
|
||||
|
||||
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
|
||||
'on why the story matters.\n\n' +
|
||||
'STRUCTURE:\n' +
|
||||
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
|
||||
'metric, date, or place drawn from the context.\n' +
|
||||
'metric, date, or place relevant to this story.\n' +
|
||||
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
|
||||
'the balance, triggers a cascade).\n' +
|
||||
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
|
||||
'HARD CONSTRAINTS:\n' +
|
||||
'- Total length 40–70 words across 2–3 sentences.\n' +
|
||||
'- MUST reference at least ONE specific: named person / country / organization / ' +
|
||||
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
|
||||
'number / percentage / date / city.\n' +
|
||||
'- No preamble ("This matters because…", "The importance of…").\n' +
|
||||
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
|
||||
'RELEVANCE RULE (critical, read carefully):\n' +
|
||||
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
|
||||
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
|
||||
"directly relevant to this story's category and country.\n" +
|
||||
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
|
||||
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
|
||||
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
|
||||
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
|
||||
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
|
||||
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
|
||||
"stress reading. If it isn't editorially connected to the story, leave it out.";
|
||||
|
||||
/**
|
||||
* Parse + validate the analyst-path v2 LLM response. Accepts
|
||||
|
||||
@@ -52,10 +52,31 @@ export function sanitizeStoryFields(story: StoryForPrompt): StoryForPrompt {
|
||||
// the LLM call latency stays under ~6s on typical provider responses.
|
||||
const CONTEXT_BUDGET_CHARS = 1700;
|
||||
|
||||
// ── Category-gated context sections ──────────────────────────────────────
|
||||
//
|
||||
// Shadow-diff (2026-04-22) of 15 v2 pairs showed the LLM pattern-matching
|
||||
// the loudest numbers (VIX 19.50, top forecast probability, MidEast FX
|
||||
// stress) into every story — even humanitarian / Rwanda / aviation stories
|
||||
// with no editorial connection to markets. Root cause: the context block
|
||||
// passes ALL six bundles for every story, so the LLM has the market /
|
||||
// forecast numbers in-hand and the prompt's "cite a specific fact"
|
||||
// instruction does the rest.
|
||||
//
|
||||
// Fix: structurally exclude bundles the LLM cannot appropriately cite for
|
||||
// a given category. Humanitarian stories don't see market data; energy
|
||||
// stories don't see domestic risk scores; etc. The model physically cannot
|
||||
// cite what it wasn't given.
|
||||
//
|
||||
// Matching is case-insensitive substring on the story's category slug
|
||||
// (shared/brief-filter.js:134 — category is free-form like "Humanitarian
|
||||
// Crisis", "Geopolitical Risk", "Energy"). First match wins. Unknown →
|
||||
// DEFAULT_SECTIONS (all six — same as pre-gating behavior).
|
||||
type SectionKey = Exclude<keyof BriefStoryContext, 'degraded'>;
|
||||
|
||||
// Per-section caps so no single heavy bundle (e.g. long worldBrief)
|
||||
// crowds out the others. Ordered by editorial importance: a single-
|
||||
// sentence summary benefits most from narrative + country framing.
|
||||
const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: number }> = [
|
||||
const SECTION_CAPS: Array<{ key: SectionKey; label: string; cap: number }> = [
|
||||
{ key: 'worldBrief', label: 'World Brief', cap: 500 },
|
||||
{ key: 'countryBrief', label: 'Country Brief', cap: 400 },
|
||||
{ key: 'riskScores', label: 'Risk Scores', cap: 250 },
|
||||
@@ -64,6 +85,90 @@ const SECTION_CAPS: Array<{ key: keyof BriefStoryContext; label: string; cap: nu
|
||||
{ key: 'marketData', label: 'Market Data', cap: 200 },
|
||||
];
|
||||
|
||||
const DEFAULT_SECTIONS: SectionKey[] = [
|
||||
'worldBrief',
|
||||
'countryBrief',
|
||||
'riskScores',
|
||||
'forecasts',
|
||||
'macroSignals',
|
||||
'marketData',
|
||||
];
|
||||
|
||||
// NOTE on regex shape: patterns use a LEADING `\b` (start-of-word
|
||||
// anchor) but NO TRAILING `\b`, so they match stems. "Diplomac" must
|
||||
// match "Diplomacy" and "Diplomatic"; "migrat" must match "migration"
|
||||
// and "migrating". A trailing `\b` here incorrectly required the stem
|
||||
// to end on a non-word char and caused every inflected form to fall
|
||||
// through to the default policy (caught in unit tests 2026-04-22).
|
||||
const CATEGORY_SECTION_POLICY: Array<{ match: RegExp; sections: SectionKey[]; label: string }> = [
|
||||
// Energy / commodity / markets / financial — forecasts + markets matter.
|
||||
{
|
||||
label: 'market',
|
||||
match: /\b(energy|commodit|market|financ|trade|oil|gas|fuel)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'forecasts', 'macroSignals', 'marketData'],
|
||||
},
|
||||
// Humanitarian / civil / social / rights — NO market, NO forecasts.
|
||||
// This is the #1 source of the "77% FX stress dragged into a Rwanda
|
||||
// story" pattern from the 2026-04-22 shadow review.
|
||||
{
|
||||
label: 'humanitarian',
|
||||
match: /\b(humanitarian|refuge|civil|social|rights|genocid|aid\b|migrat)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'riskScores'],
|
||||
},
|
||||
// Geopolitical risk / conflict / military / security — risk + forecasts
|
||||
// but not market data (the LLM would otherwise tack on a VIX reading to
|
||||
// every conflict story).
|
||||
{
|
||||
label: 'geopolitical',
|
||||
match: /\b(geopolit|military|conflict|war\b|terror|securit|defen[cs]e|nuclear)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'riskScores', 'forecasts'],
|
||||
},
|
||||
// Diplomacy / negotiations — risk + country framing, no market / macro.
|
||||
{
|
||||
label: 'diplomacy',
|
||||
match: /\b(diplomac|negotia|summit|sanction)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'riskScores'],
|
||||
},
|
||||
// Technology / cyber — world narrative + risk, not markets.
|
||||
{
|
||||
label: 'tech',
|
||||
match: /\b(tech|cyber|a\.?i\b|artificial|algorith|autonom)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'riskScores'],
|
||||
},
|
||||
// Aviation / airspace / drones — world narrative + risk, NO market /
|
||||
// forecasts / macro. Named explicitly in the RELEVANCE RULE (shared/
|
||||
// brief-llm-core.js WHY_MATTERS_ANALYST_SYSTEM_V2) — the prior revision
|
||||
// of this file only had the prompt-level guard, so aviation categories
|
||||
// still fell through to DEFAULT_SECTIONS and got all 6 bundles.
|
||||
// Structural fix ensures the LLM physically cannot cite a forecast
|
||||
// probability or VIX reading for an aviation story (PR #3281 review).
|
||||
{
|
||||
label: 'aviation',
|
||||
match: /\b(aviation|airspace|flight\b|aircraft|plane\b|drone)/i,
|
||||
sections: ['worldBrief', 'countryBrief', 'riskScores'],
|
||||
},
|
||||
];
|
||||
|
||||
/**
|
||||
* Resolve which context sections are editorially relevant for a given
|
||||
* story category. Exported for testability — the category → sections
|
||||
* map is the main lever for tuning analyst output relevance.
|
||||
*
|
||||
* @param category — the story's category slug (free-form, from the cron
|
||||
* payload). `""` or unknown categories fall back to DEFAULT_SECTIONS.
|
||||
*/
|
||||
export function sectionsForCategory(category: string): {
|
||||
sections: SectionKey[];
|
||||
policyLabel: string;
|
||||
} {
|
||||
if (typeof category === 'string' && category.length > 0) {
|
||||
for (const { match, sections, label } of CATEGORY_SECTION_POLICY) {
|
||||
if (match.test(category)) return { sections, policyLabel: label };
|
||||
}
|
||||
}
|
||||
return { sections: DEFAULT_SECTIONS, policyLabel: 'default' };
|
||||
}
|
||||
|
||||
function clip(s: string, cap: number): string {
|
||||
if (typeof s !== 'string' || s.length === 0) return '';
|
||||
if (s.length <= cap) return s;
|
||||
@@ -71,15 +176,26 @@ function clip(s: string, cap: number): string {
|
||||
}
|
||||
|
||||
/**
|
||||
* Assemble the compact context block. Skips empty sections. Respects
|
||||
* a total-chars budget so a bloated single section can't push the
|
||||
* prompt over its token limit.
|
||||
* Assemble the compact context block, filtered to sections that
|
||||
* editorially matter for the story's category. Skips empty sections.
|
||||
* Respects a total-chars budget so a bloated single section can't push
|
||||
* the prompt over its token limit.
|
||||
*
|
||||
* @param context — the full BriefStoryContext from assembleBriefStoryContext
|
||||
* @param allowedSections — whitelist from sectionsForCategory(category).
|
||||
* When omitted, all sections allowed (pre-gating behavior — kept for
|
||||
* test backcompat).
|
||||
*/
|
||||
export function buildContextBlock(context: BriefStoryContext): string {
|
||||
export function buildContextBlock(
|
||||
context: BriefStoryContext,
|
||||
allowedSections?: SectionKey[],
|
||||
): string {
|
||||
if (!context) return '';
|
||||
const allow = allowedSections ? new Set<SectionKey>(allowedSections) : null;
|
||||
const parts: string[] = [];
|
||||
let used = 0;
|
||||
for (const { key, label, cap } of SECTION_CAPS) {
|
||||
if (allow && !allow.has(key)) continue;
|
||||
const raw = context[key];
|
||||
if (typeof raw !== 'string' || raw.trim() === '') continue;
|
||||
const clipped = clip(raw, cap);
|
||||
@@ -105,9 +221,10 @@ export function buildContextBlock(context: BriefStoryContext): string {
|
||||
export function buildAnalystWhyMattersPrompt(
|
||||
story: StoryForPrompt,
|
||||
context: BriefStoryContext,
|
||||
): { system: string; user: string } {
|
||||
): { system: string; user: string; policyLabel: string } {
|
||||
const safe = sanitizeStoryFields(story);
|
||||
const contextBlock = buildContextBlock(context);
|
||||
const { sections: allowedSections, policyLabel } = sectionsForCategory(safe.category);
|
||||
const contextBlock = buildContextBlock(context, allowedSections);
|
||||
|
||||
const storyLineList = [
|
||||
`Headline: ${safe.headline}`,
|
||||
@@ -119,24 +236,39 @@ export function buildAnalystWhyMattersPrompt(
|
||||
];
|
||||
const storyLines = storyLineList.join('\n');
|
||||
|
||||
const sections = [];
|
||||
const parts: string[] = [];
|
||||
if (contextBlock) {
|
||||
sections.push('# Live WorldMonitor Context', contextBlock);
|
||||
parts.push('# Live WorldMonitor Context', contextBlock);
|
||||
}
|
||||
sections.push('# Story', storyLines);
|
||||
// Prompt footer matches the system prompt's SITUATION → ANALYSIS →
|
||||
// (optional) WATCH arc, but explicitly restates the grounding
|
||||
// requirement so the model can't ignore it from the system message
|
||||
// alone. Models follow inline instructions more reliably than
|
||||
// system-prompt constraints on longer outputs.
|
||||
sections.push(
|
||||
'Write 2–3 sentences (40–70 words) on why this story matters, grounded in at ' +
|
||||
'least ONE specific actor / metric / date / place drawn from the context above. ' +
|
||||
'Plain prose, no section labels in the output:',
|
||||
parts.push('# Story', storyLines);
|
||||
// Prompt footer restates the grounding requirement inline (models
|
||||
// follow inline instructions more reliably than system-prompt
|
||||
// constraints on longer outputs), and adds a relevance guardrail.
|
||||
//
|
||||
// Shadow review (2026-04-22, 15 v2 pairs) showed the analyst pattern-
|
||||
// matching loud context numbers — VIX 19.50, top forecast probability,
|
||||
// MidEast FX stress 77 — into humanitarian / aviation / Rwanda stories
|
||||
// regardless of editorial fit. Structural category gating above strips
|
||||
// the worst offenders (markets never reach humanitarian stories), but
|
||||
// category overlap is imperfect: a "Security" story that regex-matches
|
||||
// as geopolitical still gets forecasts, and not every forecast belongs.
|
||||
//
|
||||
// The guardrail below is the second layer: tell the model explicitly
|
||||
// that the category framing it was handed limits what's worth citing,
|
||||
// and that a named actor from the headline / description is a valid
|
||||
// grounding target when no context fact is a clean fit.
|
||||
parts.push(
|
||||
`Write 2–3 sentences (40–70 words) on why this ${safe.category || 'story'} matters, grounded in at ` +
|
||||
"least ONE specific reference. Prefer a fact drawn from the context block above WHEN it clearly " +
|
||||
"relates to this story's category and country. If no context fact is a clean fit, ground " +
|
||||
'instead in a named actor, place, date, or figure from the headline or description. ' +
|
||||
'DO NOT force an off-topic market metric, VIX value, FX reading, or forecast probability ' +
|
||||
"into a story where it does not belong. Plain prose, no section labels in the output:",
|
||||
);
|
||||
|
||||
return {
|
||||
system: WHY_MATTERS_ANALYST_SYSTEM_V2,
|
||||
user: sections.join('\n\n'),
|
||||
user: parts.join('\n\n'),
|
||||
policyLabel,
|
||||
};
|
||||
}
|
||||
|
||||
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
|
||||
'on why the story matters.\n\n' +
|
||||
'STRUCTURE:\n' +
|
||||
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
|
||||
'metric, date, or place drawn from the context.\n' +
|
||||
'metric, date, or place relevant to this story.\n' +
|
||||
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
|
||||
'the balance, triggers a cascade).\n' +
|
||||
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
|
||||
'HARD CONSTRAINTS:\n' +
|
||||
'- Total length 40–70 words across 2–3 sentences.\n' +
|
||||
'- MUST reference at least ONE specific: named person / country / organization / ' +
|
||||
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
|
||||
'number / percentage / date / city.\n' +
|
||||
'- No preamble ("This matters because…", "The importance of…").\n' +
|
||||
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
|
||||
'RELEVANCE RULE (critical, read carefully):\n' +
|
||||
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
|
||||
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
|
||||
"directly relevant to this story's category and country.\n" +
|
||||
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
|
||||
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
|
||||
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
|
||||
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
|
||||
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
|
||||
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
|
||||
"stress reading. If it isn't editorially connected to the story, leave it out.";
|
||||
|
||||
/**
|
||||
* Parse + validate the analyst-path v2 LLM response. Accepts
|
||||
|
||||
@@ -18,6 +18,7 @@ import { generateWhyMatters } from '../scripts/lib/brief-llm.mjs';
|
||||
import {
|
||||
hashBriefStory,
|
||||
parseWhyMatters,
|
||||
parseWhyMattersV2,
|
||||
WHY_MATTERS_SYSTEM,
|
||||
} from '../shared/brief-llm-core.js';
|
||||
|
||||
@@ -127,6 +128,106 @@ describe('cache key identity', () => {
|
||||
});
|
||||
});
|
||||
|
||||
// ── parseWhyMattersV2 — analyst-path output validator ───────────────────
|
||||
//
|
||||
// This is the only output-validation gate between the analyst LLM and
|
||||
// the cache envelope: if it returns null the whole response falls back
|
||||
// to the gemini layer. Its rejection rules differ from v1 (100–500
|
||||
// char range, multi-sentence preamble list, section-label check) and
|
||||
// were not previously covered by unit tests (greptile P2, PR #3281).
|
||||
|
||||
describe('parseWhyMattersV2 — analyst output validator', () => {
|
||||
const VALID_MULTI =
|
||||
"Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
|
||||
'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters.';
|
||||
|
||||
it('accepts a valid 2-sentence, ~40–70 word output', () => {
|
||||
const out = parseWhyMattersV2(VALID_MULTI);
|
||||
assert.equal(out, VALID_MULTI);
|
||||
});
|
||||
|
||||
it('accepts a valid 3-sentence output with optional WATCH arc', () => {
|
||||
const three =
|
||||
"Iran's closure of the Strait of Hormuz on April 21 halts roughly 20% of global seaborne oil. " +
|
||||
'The disruption forces an immediate repricing of sovereign risk across Gulf energy exporters. ' +
|
||||
'Watch IMF commentary in the next 48 hours for cascading guidance.';
|
||||
assert.equal(parseWhyMattersV2(three), three);
|
||||
});
|
||||
|
||||
it('rejects output under the 100-char minimum (distinguishes it from v1)', () => {
|
||||
// v1 accepts short outputs; v2 requires 100+ chars so the model has
|
||||
// room for SITUATION + ANALYSIS. A short string is "too terse".
|
||||
assert.equal(parseWhyMattersV2('Short sentence under 100 chars.'), null);
|
||||
assert.equal(parseWhyMattersV2('x'.repeat(99)), null);
|
||||
// Boundary: exactly 100 passes.
|
||||
assert.equal(typeof parseWhyMattersV2('x'.repeat(100)), 'string');
|
||||
});
|
||||
|
||||
it('rejects output over the 500-char cap (prevents runaway essays)', () => {
|
||||
assert.equal(parseWhyMattersV2('x'.repeat(501)), null);
|
||||
// Boundary: exactly 500 passes.
|
||||
assert.equal(typeof parseWhyMattersV2('x'.repeat(500)), 'string');
|
||||
});
|
||||
|
||||
it('rejects banned preamble phrases (v2-specific)', () => {
|
||||
for (const preamble of [
|
||||
'This matters because the Strait of Hormuz closure would halt 20% of global oil supply right now and this is very important for analysts.',
|
||||
'The importance of this event is that oil tankers cannot transit the strait, which forces a global supply rerouting and price shock.',
|
||||
'It is important to note that Iran has blockaded a critical global shipping chokepoint with real consequences for supply.',
|
||||
'Importantly, the closure of the Strait of Hormuz disrupts roughly 20% of global seaborne oil flows starting April 21.',
|
||||
'In summary, the analyst sees this as a major geopolitical escalation with wide-reaching market and security implications.',
|
||||
'To summarize, the blockade represents a sharp departure from the prior six months of relative calm in the Persian Gulf region.',
|
||||
]) {
|
||||
assert.equal(parseWhyMattersV2(preamble), null, `should reject preamble: "${preamble.slice(0, 40)}..."`);
|
||||
}
|
||||
});
|
||||
|
||||
it('rejects section-label leaks (SITUATION/ANALYSIS/WATCH prefixes)', () => {
|
||||
for (const leak of [
|
||||
'SITUATION: Iran has closed the Strait of Hormuz effective April 21, halting roughly 20% of seaborne global oil supply today.',
|
||||
'ANALYSIS — the disruption forces an immediate global sovereign risk repricing across Gulf exporters including Saudi Arabia and UAE.',
|
||||
'Watch: IMF commentary for the next 48 hours should give the earliest signal on the cascading global guidance implications.',
|
||||
]) {
|
||||
assert.equal(parseWhyMattersV2(leak), null, `should reject label leak: "${leak.slice(0, 40)}..."`);
|
||||
}
|
||||
});
|
||||
|
||||
it('rejects markdown leakage (bullets, headers, numbered lists)', () => {
|
||||
for (const md of [
|
||||
'# The closure of the Strait of Hormuz is the single most material geopolitical event of the quarter for sovereign credit.',
|
||||
'- Iran has blockaded the Strait of Hormuz, halting roughly 20% of the world seaborne oil on April 21 effective immediately.',
|
||||
'* The closure of the Strait of Hormuz halts roughly 20% of the world seaborne oil, which forces an immediate price shock today.',
|
||||
'1. The closure of the Strait of Hormuz halts roughly 20% of seaborne global oil, which forces an immediate sovereign risk repricing.',
|
||||
]) {
|
||||
assert.equal(parseWhyMattersV2(md), null, `should reject markdown: "${md.slice(0, 40)}..."`);
|
||||
}
|
||||
});
|
||||
|
||||
it('rejects the stub echo (same as v1)', () => {
|
||||
const stub =
|
||||
'Story flagged by your sensitivity settings — the analyst could not find a clean grounding fact and returned the pre-canned fallback.';
|
||||
assert.equal(parseWhyMattersV2(stub), null);
|
||||
});
|
||||
|
||||
it('trims surrounding quote marks the model sometimes wraps output in', () => {
|
||||
const quoted = `"${VALID_MULTI}"`;
|
||||
assert.equal(parseWhyMattersV2(quoted), VALID_MULTI);
|
||||
const smart = `\u201C${VALID_MULTI}\u201D`;
|
||||
assert.equal(parseWhyMattersV2(smart), VALID_MULTI);
|
||||
});
|
||||
|
||||
it('rejects non-string inputs (defensive)', () => {
|
||||
for (const v of [null, undefined, 123, {}, [], true]) {
|
||||
assert.equal(parseWhyMattersV2(v), null, `should reject ${typeof v}`);
|
||||
}
|
||||
});
|
||||
|
||||
it('rejects whitespace-only strings', () => {
|
||||
assert.equal(parseWhyMattersV2(''), null);
|
||||
assert.equal(parseWhyMattersV2(' \n\t '), null);
|
||||
});
|
||||
});
|
||||
|
||||
// ── Deterministic shadow sampling ────────────────────────────────────────
|
||||
|
||||
describe('shadow sample deterministic hashing', () => {
|
||||
@@ -296,7 +397,12 @@ describe('endpoint validation contract', () => {
|
||||
// test regression on the endpoint flow (see "endpoint end-to-end" below).
|
||||
const VALID_THREAT = new Set(['critical', 'high', 'medium', 'low']);
|
||||
const CAPS = { headline: 400, source: 120, category: 80, country: 80 };
|
||||
const MAX_BODY_BYTES = 4096;
|
||||
// Must match `api/internal/brief-why-matters.ts:116` — bumped to 8192 in
|
||||
// PR #3269 to accommodate v2 output + description. If this ever drifts
|
||||
// again, the bloated-fixture assertion below silently passes for
|
||||
// payloads in the (OLD_VALUE, NEW_VALUE] range that the real endpoint
|
||||
// now accepts (greptile P2, PR #3281).
|
||||
const MAX_BODY_BYTES = 8192;
|
||||
|
||||
function validate(raw) {
|
||||
if (!raw || typeof raw !== 'object') return { ok: false, msg: 'body' };
|
||||
@@ -364,7 +470,9 @@ describe('endpoint validation contract', () => {
|
||||
...story(),
|
||||
// Artificial oversize payload — would need headline cap bypassed
|
||||
// to reach in practice, but the total body-byte cap must still fire.
|
||||
extra: 'x'.repeat(5000),
|
||||
// Sized well above MAX_BODY_BYTES (8192) so a future bump doesn't
|
||||
// silently invalidate the assertion.
|
||||
extra: 'x'.repeat(10_000),
|
||||
},
|
||||
};
|
||||
assert.ok(measureBytes(bloated) > MAX_BODY_BYTES, 'fixture is oversize');
|
||||
@@ -480,6 +588,210 @@ describe('buildAnalystWhyMattersPrompt — shape and budget', () => {
|
||||
});
|
||||
});
|
||||
|
||||
// ── Category-gated context (2026-04-22 formulaic-grounding fix) ──────
|
||||
//
|
||||
// Shadow-diff of 15 v2 pairs showed the analyst pattern-matching loud
|
||||
// context numbers (VIX, top forecast probability, MidEast FX stress)
|
||||
// into every story regardless of editorial fit. The structural fix is
|
||||
// to only feed editorially-relevant context bundles per category; the
|
||||
// prompt-level RELEVANCE RULE is a second-layer guard.
|
||||
//
|
||||
// These tests pin the category → sections map so a future "loosen this
|
||||
// one little thing" edit can't silently re-introduce market metrics
|
||||
// into humanitarian stories.
|
||||
|
||||
describe('sectionsForCategory — structural relevance gating', () => {
|
||||
let sectionsForCategory;
|
||||
let builder;
|
||||
it('loads', async () => {
|
||||
const mod = await import('../server/worldmonitor/intelligence/v1/brief-why-matters-prompt.ts');
|
||||
sectionsForCategory = mod.sectionsForCategory;
|
||||
builder = mod.buildAnalystWhyMattersPrompt;
|
||||
assert.ok(typeof sectionsForCategory === 'function');
|
||||
});
|
||||
|
||||
it('market/commodity/finance → includes marketData + forecasts, excludes riskScores', () => {
|
||||
for (const cat of ['Energy', 'Commodity Squeeze', 'Market Activity', 'Financial Stress', 'Oil Markets', 'Trade Policy']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'market', `${cat} should match market policy`);
|
||||
assert.ok(sections.includes('marketData'), `${cat} should include marketData`);
|
||||
assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
|
||||
assert.ok(sections.includes('macroSignals'), `${cat} should include macroSignals`);
|
||||
assert.ok(!sections.includes('riskScores'), `${cat} should NOT include riskScores`);
|
||||
}
|
||||
});
|
||||
|
||||
it('humanitarian → excludes marketData AND forecasts (the #1 drift pattern)', () => {
|
||||
for (const cat of ['Humanitarian Crisis', 'Refugee Flow', 'Civil Unrest', 'Social Upheaval', 'Rights Violation', 'Aid Delivery', 'Migration']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'humanitarian', `${cat} should match humanitarian policy`);
|
||||
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
|
||||
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
|
||||
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
|
||||
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
|
||||
}
|
||||
});
|
||||
|
||||
it('geopolitical → includes forecasts + riskScores, excludes marketData', () => {
|
||||
for (const cat of ['Geopolitical Risk', 'Military Posture', 'Conflict', 'War', 'Terrorism', 'Security', 'Nuclear Policy', 'Defense']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'geopolitical', `${cat} should match geopolitical policy`);
|
||||
assert.ok(sections.includes('forecasts'), `${cat} should include forecasts`);
|
||||
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
|
||||
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
|
||||
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
|
||||
}
|
||||
});
|
||||
|
||||
it('diplomacy → riskScores only, no markets/forecasts', () => {
|
||||
for (const cat of ['Diplomacy', 'Negotiations', 'Summit Meetings', 'Sanctions']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'diplomacy', `${cat} should match diplomacy policy`);
|
||||
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
|
||||
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
|
||||
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
|
||||
}
|
||||
});
|
||||
|
||||
it('tech → riskScores only, no markets/forecasts/macro', () => {
|
||||
for (const cat of ['Tech Policy', 'Cyber Attack', 'AI Regulation', 'Artificial Intelligence', 'Algorithm Abuse', 'Autonomous Systems']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'tech', `${cat} should match tech policy`);
|
||||
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
|
||||
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
|
||||
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
|
||||
}
|
||||
});
|
||||
|
||||
it('aviation / airspace / drone → riskScores only, NO markets/forecasts/macro (PR #3281 review fix)', () => {
|
||||
// Reviewer caught that aviation was named in the RELEVANCE RULE as a
|
||||
// category banned from off-topic metrics, but had no structural
|
||||
// regex entry — so "Aviation Incident" / "Airspace Closure" / etc.
|
||||
// fell through to DEFAULT_SECTIONS and still got all 6 bundles
|
||||
// including marketData + forecasts + macroSignals. Direct repro
|
||||
// test so a future regex rewrite can't silently regress.
|
||||
for (const cat of ['Aviation Incident', 'Airspace Closure', 'Plane Crash', 'Flight Disruption', 'Drone Incursion', 'Aircraft Shot Down']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'aviation', `${cat} should match aviation policy`);
|
||||
assert.ok(sections.includes('riskScores'), `${cat} should include riskScores`);
|
||||
assert.ok(!sections.includes('marketData'), `${cat} must NOT include marketData`);
|
||||
assert.ok(!sections.includes('forecasts'), `${cat} must NOT include forecasts`);
|
||||
assert.ok(!sections.includes('macroSignals'), `${cat} must NOT include macroSignals`);
|
||||
}
|
||||
});
|
||||
|
||||
it('unknown / empty category → default (all 6 sections, backcompat)', () => {
|
||||
for (const cat of ['', 'General', 'Sports Event', 'Unknown Thing']) {
|
||||
const { sections, policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'default', `"${cat}" should fall through to default`);
|
||||
// Default must include everything — prevents a regression where
|
||||
// a refactor accidentally empties the default.
|
||||
for (const k of ['worldBrief', 'countryBrief', 'riskScores', 'forecasts', 'macroSignals', 'marketData']) {
|
||||
assert.ok(sections.includes(k), `default policy should include ${k}`);
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
it('RELEVANCE RULE categories have structural coverage (no prompt-only guards)', () => {
|
||||
// Meta-invariant: every category named in the system prompt's
|
||||
// RELEVANCE RULE as banned-from-off-topic-metrics MUST have a
|
||||
// matching policy entry. A prompt-only guard is too soft — models
|
||||
// follow inline instructions imperfectly. If someone adds a new
|
||||
// category to the prompt, this test fires until they add a regex.
|
||||
for (const cat of ['Humanitarian Crisis', 'Aviation Incident', 'Diplomatic Summit', 'Cyber Attack']) {
|
||||
const { policyLabel } = sectionsForCategory(cat);
|
||||
assert.notEqual(
|
||||
policyLabel,
|
||||
'default',
|
||||
`"${cat}" is named in WHY_MATTERS_ANALYST_SYSTEM_V2 as banned from market metrics — it must have a structural policy, not fall through to default`,
|
||||
);
|
||||
}
|
||||
});
|
||||
|
||||
it('non-string / null / undefined category → default fallback (defensive)', () => {
|
||||
for (const cat of [null, undefined, 123, {}, []]) {
|
||||
const { policyLabel } = sectionsForCategory(cat);
|
||||
assert.equal(policyLabel, 'default', `non-string ${JSON.stringify(cat)} should fall through to default`);
|
||||
}
|
||||
});
|
||||
|
||||
it('buildAnalystWhyMattersPrompt — humanitarian story must not see marketData or forecasts', () => {
|
||||
const humanitarian = {
|
||||
headline: 'Rwanda hosts fresh Congolese refugees',
|
||||
source: 'UNHCR',
|
||||
threatLevel: 'high',
|
||||
category: 'Humanitarian Crisis',
|
||||
country: 'RW',
|
||||
};
|
||||
const fullContext = {
|
||||
worldBrief: 'Global migration pressure is at a decade high.',
|
||||
countryBrief: 'Rwanda has absorbed 100K refugees this quarter.',
|
||||
riskScores: 'Risk index 62/100 (elevated).',
|
||||
forecasts: 'Top forecast: Congo ceasefire holds (72% by Q3).',
|
||||
// Use distinctive values that would never appear in the guardrail
|
||||
// text — the guardrail mentions "VIX value" / "FX reading" in the
|
||||
// abstract, so we assert on the concrete numeric fingerprint.
|
||||
marketData: 'VIX-READING-19-50. EUR/USD 1.0732. Gold $2,380.',
|
||||
macroSignals: 'MidEastFxStressSentinel-77.',
|
||||
degraded: false,
|
||||
};
|
||||
const { user, policyLabel } = builder(humanitarian, fullContext);
|
||||
assert.equal(policyLabel, 'humanitarian');
|
||||
// Structural guarantee: the distinctive context values physically
|
||||
// cannot appear in the prompt because we didn't pass them to the LLM.
|
||||
assert.doesNotMatch(user, /VIX-READING-19-50/, 'humanitarian prompt must not include marketData sentinel');
|
||||
assert.doesNotMatch(user, /EUR\/USD/, 'humanitarian prompt must not include FX pair');
|
||||
assert.doesNotMatch(user, /Top forecast/, 'humanitarian prompt must not include forecasts');
|
||||
assert.doesNotMatch(user, /MidEastFxStressSentinel/, 'humanitarian prompt must not include macro signals');
|
||||
assert.doesNotMatch(user, /## Market Data/, 'humanitarian prompt must not have a Market Data section heading');
|
||||
assert.doesNotMatch(user, /## Forecasts/, 'humanitarian prompt must not have a Forecasts section heading');
|
||||
assert.doesNotMatch(user, /## Macro Signals/, 'humanitarian prompt must not have a Macro Signals section heading');
|
||||
// But country + risk framing must survive.
|
||||
assert.match(user, /Rwanda has absorbed/);
|
||||
assert.match(user, /Risk index/);
|
||||
});
|
||||
|
||||
it('buildAnalystWhyMattersPrompt — market story DOES see marketData', () => {
|
||||
const marketStory = {
|
||||
headline: 'Crude oil jumps 4% on Houthi tanker strike',
|
||||
source: 'FT',
|
||||
threatLevel: 'high',
|
||||
category: 'Energy',
|
||||
country: 'YE',
|
||||
};
|
||||
const ctx = {
|
||||
worldBrief: 'Red Sea shipping activity down 35% YoY.',
|
||||
countryBrief: 'Yemen remains active conflict zone.',
|
||||
riskScores: 'Risk index 88/100.',
|
||||
forecasts: 'Top forecast: Houthi attacks continue (83%).',
|
||||
marketData: 'Brent $87.40. VIX 19.50. USD/SAR flat.',
|
||||
macroSignals: 'Shipping-stress index at 3-month high.',
|
||||
degraded: false,
|
||||
};
|
||||
const { user, policyLabel } = builder(marketStory, ctx);
|
||||
assert.equal(policyLabel, 'market');
|
||||
assert.match(user, /Brent/);
|
||||
assert.match(user, /Shipping-stress/);
|
||||
assert.match(user, /Top forecast/);
|
||||
// Market policy excludes riskScores — the LLM would otherwise tack
|
||||
// on a "country risk 88/100" into every commodity story.
|
||||
assert.doesNotMatch(user, /Risk index 88/);
|
||||
});
|
||||
|
||||
it('buildAnalystWhyMattersPrompt — prompt footer includes relevance guardrail', () => {
|
||||
const { user } = builder(
|
||||
{ headline: 'X', source: 'Y', threatLevel: 'low', category: 'General', country: 'US' },
|
||||
{ worldBrief: '', countryBrief: '', riskScores: '', forecasts: '', marketData: '', macroSignals: '', degraded: false },
|
||||
);
|
||||
// Guardrail phrases — if any of these drops out, the prompt-level
|
||||
// second-layer guard is broken and we're back to the formulaic v5
|
||||
// behavior for any story that still hits the default policy.
|
||||
assert.match(user, /DO NOT force/i, 'guardrail phrase "DO NOT force" must be in footer');
|
||||
assert.match(user, /off-topic market metric|VIX|forecast probability/i);
|
||||
assert.match(user, /named actor, place, date, or figure/);
|
||||
});
|
||||
});
|
||||
|
||||
// ── Env flag parsing (endpoint config resolution) ─────────────────────
|
||||
|
||||
describe('endpoint env flag parsing', () => {
|
||||
|
||||
Reference in New Issue
Block a user