Files
worldmonitor/shared/brief-llm-core.js
Elie Habib 425507d15a fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)
* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding

Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern-
matching the loudest context numbers — VIX 19.50, top forecast
probability, MidEast FX stress 77 — into every story regardless of
editorial fit. A Rwanda humanitarian story about refugees cited VIX;
an aviation story cited a forecast probability.

Root cause: every story got the same 6-bundle context block, so the
LLM had markets / forecasts / macro in-hand and the "cite a specific
fact" instruction did the rest.

Two-layer fix:

  1. STRUCTURAL — sectionsForCategory() maps the story's category to
     an editorially-relevant subset of bundles. Humanitarian stories
     don't see marketData / forecasts / macroSignals; diplomacy gets
     riskScores only; market/energy gets markets+forecasts but drops
     riskScores. The model physically cannot cite what it wasn't
     given. Unknown categories fall back to all six (backcompat).

  2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE
     that explicitly permits grounding in headline/description
     actors when no context fact is a clean fit, and bans dragging
     off-topic market metrics into humanitarian/aviation/diplomacy
     stories. The prompt footer (inline, per-call) restates the
     same guardrail — models follow inline instructions more
     reliably than system-prompt constraints on longer outputs.

Cache keys bumped to invalidate the formulaic v5 output: endpoint
v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5
policies + default fallback + humanitarian structural guarantee +
market policy does-see-markets + guardrail footer presence.

Observability: endpoint now logs policyLabel per call so operators
can confirm in Vercel logs that humanitarian/aviation stories are
NOT seeing marketData without dumping the full prompt.

* test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage

Greptile PR #3281 review raised two P2 test-quality issues:

1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint
   was bumped to 8192 in PR #3269 (v2 output + description). With
   the stale constant, a payload in the 4097–8192 range was
   accepted by the real endpoint but looked oversize in the test
   mirror, letting the body-cap invariant silently drift. Fixed
   by syncing to 8192 + bumping the bloated fixture to 10_000
   bytes so a future endpoint-cap bump doesn't silently
   re-invalidate the assertion.

2. parseWhyMattersV2 (the only output-validation gate on the
   analyst path) had no dedicated unit tests. Adds 11 targeted
   cases covering: valid 2 and 3 sentence output, 100/500 char
   bounds (incl. boundary assertions), all 6 banned preamble
   phrases, section-label leaks (SITUATION/ANALYSIS/Watch),
   markdown leakage (#, -, *, 1.), stub echo rejection, smart/
   plain quote stripping, non-string defensive branch, and
   whitespace-only strings.

Suite size: 50 to 61 tests, all green.

* fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1)

Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's
RELEVANCE RULE as a category banned from off-topic market metrics, but
had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation
Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all
fell through to DEFAULT_SECTIONS and still got all 6 bundles including
marketData, forecasts, and macroSignals — exactly the VIX / forecast
probability pattern the PR claimed to structurally prevent.

Reproduced on HEAD before fix:
  Aviation Incident -> default
  Airspace Closure  -> default
  Plane Crash       -> default
  ...etc.

Fix:
  1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/
     tech: worldBrief, countryBrief, riskScores).
  2. Adds dedicated aviation-gating test with 6 category variants.
  3. Adds meta-invariant test: every category named in the system
     prompt's RELEVANCE RULE MUST have a structural policy entry,
     asserting policyLabel !== 'default'. If someone adds a new
     category name to the prompt in the future, this test fires
     until they wire up a regex — prevents soft-guard drift.
  4. Removes 'Aviation Incident' from the default-fall-through test
     list (it now correctly matches aviation).

No cache bump needed — v6 was published to the feature branch only a
few minutes ago, no production entries have been written yet.
2026-04-22 08:21:01 +04:00

200 lines
8.5 KiB
JavaScript
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// @ts-check
/**
* Edge-safe pure helpers for the brief LLM enrichment path. Shared by:
* - scripts/lib/brief-llm.mjs (Railway cron, Node)
* - api/internal/brief-why-matters.ts (Vercel edge)
*
* No `node:*` imports. Hashing via Web Crypto (`crypto.subtle.digest`),
* which is available in both Edge and modern Node. Everything else is
* pure string manipulation.
*
* Any change here MUST be mirrored byte-for-byte to
* `scripts/shared/brief-llm-core.js` (enforced by the shared-mirror
* parity test; see `feedback_shared_dir_mirror_requirement`).
*/
/**
* System prompt for the one-sentence "why this matters" enrichment.
* Moved verbatim from scripts/lib/brief-llm.mjs so the edge endpoint
* and the cron fallback emit the identical editorial voice.
*/
export const WHY_MATTERS_SYSTEM =
'You are the editor of WorldMonitor Brief, a geopolitical intelligence magazine. ' +
'For each story below, write ONE concise sentence (1830 words) explaining the ' +
'regional or global stakes. Editorial, impersonal, serious. No preamble ' +
'("This matters because…"), no questions, no calls to action, no markdown, ' +
'no quotes. One sentence only.';
/**
* @param {{
* headline: string;
* source: string;
* threatLevel: string;
* category: string;
* country: string;
* }} story
* @returns {{ system: string; user: string }}
*/
export function buildWhyMattersUserPrompt(story) {
const user = [
`Headline: ${story.headline}`,
`Source: ${story.source}`,
`Severity: ${story.threatLevel}`,
`Category: ${story.category}`,
`Country: ${story.country}`,
'',
'One editorial sentence on why this matters:',
].join('\n');
return { system: WHY_MATTERS_SYSTEM, user };
}
/**
* Parse + validate the LLM response into a single editorial sentence.
* Returns null when the output is obviously wrong (empty, boilerplate
* preamble that survived stripReasoningPreamble, too short / too long).
*
* @param {unknown} text
* @returns {string | null}
*/
export function parseWhyMatters(text) {
if (typeof text !== 'string') return null;
let s = text.trim();
if (!s) return null;
s = s.replace(/^[\u201C"']+/, '').replace(/[\u201D"']+$/, '').trim();
const match = s.match(/^[^.!?]+[.!?]/);
const sentence = match ? match[0].trim() : s;
if (sentence.length < 30 || sentence.length > 400) return null;
if (/^story flagged by your sensitivity/i.test(sentence)) return null;
return sentence;
}
/**
* Deterministic 16-char hex hash of the SIX story fields that flow
* into the whyMatters prompt (5 core + description). Cache identity
* MUST cover every field that shapes the LLM output, or two requests
* with the same core fields but different descriptions will share a
* cache entry and the second caller gets prose grounded in the first
* caller's description (P1 regression caught in PR #3269 review).
*
* History:
* - pre-v3: 5 fields, sync `node:crypto.createHash`.
* - v3: moved to Web Crypto (async), same 5 fields.
* - v5 (with endpoint cache bump to brief:llm:whymatters:v5:):
* 6 fields — `description` added to match the analyst path's
* v2 prompt which interpolates `Description: <desc>` between
* headline and source.
*
* Uses Web Crypto so the module is edge-safe. Returns a Promise because
* `crypto.subtle.digest` is async; cron call sites are already in an
* async context so the await is free.
*
* @param {{
* headline?: string;
* source?: string;
* threatLevel?: string;
* category?: string;
* country?: string;
* description?: string;
* }} story
* @returns {Promise<string>}
*/
export async function hashBriefStory(story) {
const material = [
story.headline ?? '',
story.source ?? '',
story.threatLevel ?? '',
story.category ?? '',
story.country ?? '',
// New in v5: description is a prompt input on the analyst path,
// so MUST be part of cache identity. Absent on legacy paths →
// empty string → deterministic; same-story-same-description pairs
// still collide on purpose, different descriptions don't.
story.description ?? '',
].join('||');
const bytes = new TextEncoder().encode(material);
const digest = await crypto.subtle.digest('SHA-256', bytes);
let hex = '';
const view = new Uint8Array(digest);
for (let i = 0; i < view.length; i++) {
hex += view[i].toString(16).padStart(2, '0');
}
return hex.slice(0, 16);
}
// ── Analyst-path prompt v2 (multi-sentence, grounded) ──────────────────────
//
// Shadow-diff on 12 prod stories (2026-04-21) showed the v1 analyst output
// was indistinguishable from the legacy Gemini-only output: identical
// single-sentence abstraction-speak ("destabilize / systemic / sovereign
// risk repricing") with no named actors, metrics, or dates. Root cause:
// the 1830 word cap compressed the context's specifics out of the LLM's
// response. v2 loosens to 4070 words across 23 sentences and REQUIRES
// the LLM to ground at least one specific reference from the live context.
/**
* System prompt for the analyst-path v2 (23 sentences, ~4070 words,
* grounded in a specific named actor / metric / date / place drawn
* from the live context). Shape nudged toward the WMAnalyst chat voice
* (SITUATION → ANALYSIS → optional WATCH) but rendered as plain prose,
* no section labels in the output.
*/
export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
'You are the lead analyst at WorldMonitor Brief, a geopolitical intelligence magazine. ' +
'Using the Live WorldMonitor Context AND the story, write 23 sentences (4070 words total) ' +
'on why the story matters.\n\n' +
'STRUCTURE:\n' +
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
'metric, date, or place relevant to this story.\n' +
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
'the balance, triggers a cascade).\n' +
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
'HARD CONSTRAINTS:\n' +
'- Total length 4070 words across 23 sentences.\n' +
'- MUST reference at least ONE specific: named person / country / organization / ' +
'number / percentage / date / city.\n' +
'- No preamble ("This matters because…", "The importance of…").\n' +
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
'RELEVANCE RULE (critical, read carefully):\n' +
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
"directly relevant to this story's category and country.\n" +
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
"stress reading. If it isn't editorially connected to the story, leave it out.";
/**
* Parse + validate the analyst-path v2 LLM response. Accepts
* multi-sentence output (23 sentences), 100500 chars. Otherwise
* same rejection semantics as v1 (stub echo, empty) plus explicit
* rejection of preamble boilerplate and leaked section labels.
*
* Returns null when the output is obviously wrong so the caller can
* fall through to the next layer.
*
* @param {unknown} text
* @returns {string | null}
*/
export function parseWhyMattersV2(text) {
if (typeof text !== 'string') return null;
let s = text.trim();
if (!s) return null;
// Drop surrounding quotes if the model insisted.
s = s.replace(/^[\u201C"']+/, '').replace(/[\u201D"']+$/, '').trim();
if (s.length < 100 || s.length > 500) return null;
// Reject the stub echo (same as v1).
if (/^story flagged by your sensitivity/i.test(s)) return null;
// Reject common preamble the system prompt explicitly banned.
if (/^(this matters because|the importance of|it is important|importantly,|in summary,|to summarize)/i.test(s)) {
return null;
}
// Reject markdown / section-label leakage (we told it to use plain prose).
if (/^(#|-|\*|\d+\.\s)/.test(s)) return null;
if (/^(situation|analysis|watch)\s*[:\-–—]/i.test(s)) return null;
return s;
}