feat(brief): analyst prompt v2 — multi-sentence, grounded, story description (#3269)

* feat(brief): analyst prompt v2 — multi-sentence, grounded, includes story description

Shadow-diff of 12 prod stories on 2026-04-21 showed v1 analyst output
indistinguishable from legacy Gemini: identical single-sentence
abstraction ("destabilize / systemic / sovereign risk repricing") with
no named actors, metrics, or dates — in several cases Gemini was MORE
specific.

Root cause: 18–30 word cap compressed context specifics out.

v2 loosens three dials at once so we can settle the A/B:

1. New system prompt WHY_MATTERS_ANALYST_SYSTEM_V2 — 2–3 sentences,
   40–70 words, implicit SITUATION→ANALYSIS→(optional) WATCH arc,
   MUST cite one specific named actor / metric / date / place from
   the context. Analyst path only; gemini path stays on v1.

2. New parser parseWhyMattersV2 — accepts 100–500 chars, rejects
   preamble boilerplate + leaked section labels + markdown.

3. Story description plumbed through — endpoint body accepts optional
   story.description (≤ 1000 chars, body cap bumped 4 KB → 8 KB).
   Cron forwards it when upstream has one (skipped when it equals the
   headline — no new signal).

Cache + shadow bumped v3 → v4 / v1 → v2 so fresh output lands on the
first post-deploy cron tick. maxTokens 180 → 260 for ~3× output length.

If shadow-diff 24h after deploy still shows no delta vs gemini, kill
is BRIEF_WHY_MATTERS_PRIMARY=gemini on Vercel (instant, no redeploy).

Tests: 6059 pass (was 6022 + 37 new). typecheck × 2 clean.

* fix(brief): stop truncating v2 multi-sentence output + description in cache hash

Two P1s caught in PR #3269 review.

P1a — cron reparsed endpoint output with v1 single-sentence parser,
silently dropping sentences 2+3 of v2 analyst output. The endpoint had
ALREADY validated the string (parseWhyMattersV2 for analyst path;
parseWhyMatters for gemini). Re-parsing with v1 took only the first
sentence — exact regression #3269 was meant to fix.

Fix: trust the endpoint. Replace re-parse with bounds check (30–500
chars) + stub-echo reject. Added regression test asserting multi-
sentence output reaches the envelope unchanged.

P1b — `story.description` flowed into the analyst prompt but NOT into
the cache hash. Two requests with identical core fields but different
descriptions collided on one cache slot → second caller got prose
grounded in the FIRST caller's description.

Fix: add `description` as the 6th field of `hashBriefStory`. Bump
endpoint cache v4→v5 and shadow v2→v3 so buggy 5-field entries are
dropped. Updated the parity sentinel in brief-llm-core.test.mjs to
match 6-field semantics. Added regression tests covering different-
descriptions-differ and present-vs-absent-differ.

Tests: 6083 pass. typecheck × 2 clean.
This commit is contained in:
Elie Habib
2026-04-21 22:25:54 +04:00
committed by GitHub
parent 048bb8bb52
commit ec35cf4158
10 changed files with 454 additions and 40 deletions

View File

@@ -4,6 +4,9 @@ export interface BriefStoryHashInput {
threatLevel?: string;
category?: string;
country?: string;
/** v5: part of cache identity so same-story + different description
* don't collide on cached analyst output. */
description?: string;
}
export interface BriefStoryPromptInput {
@@ -24,3 +27,7 @@ export function buildWhyMattersUserPrompt(story: BriefStoryPromptInput): {
export function parseWhyMatters(text: unknown): string | null;
export function hashBriefStory(story: BriefStoryHashInput): Promise<string>;
// ── v2 (analyst path only) ────────────────────────────────────────────────
export const WHY_MATTERS_ANALYST_SYSTEM_V2: string;
export function parseWhyMattersV2(text: unknown): string | null;

View File

@@ -69,11 +69,20 @@ export function parseWhyMatters(text) {
}
/**
* Deterministic 16-char hex hash of the five story fields that flow
* into the whyMatters prompt. Same material as the pre-v3 sync
* implementation (`scripts/lib/brief-llm.mjs:hashBriefStory`) — a
* fixed fixture in tests/brief-llm-core.test.mjs pins the output so a
* future refactor cannot silently invalidate every cached entry.
* Deterministic 16-char hex hash of the SIX story fields that flow
* into the whyMatters prompt (5 core + description). Cache identity
* MUST cover every field that shapes the LLM output, or two requests
* with the same core fields but different descriptions will share a
* cache entry and the second caller gets prose grounded in the first
* caller's description (P1 regression caught in PR #3269 review).
*
* History:
* - pre-v3: 5 fields, sync `node:crypto.createHash`.
* - v3: moved to Web Crypto (async), same 5 fields.
* - v5 (with endpoint cache bump to brief:llm:whymatters:v5:):
* 6 fields — `description` added to match the analyst path's
* v2 prompt which interpolates `Description: <desc>` between
* headline and source.
*
* Uses Web Crypto so the module is edge-safe. Returns a Promise because
* `crypto.subtle.digest` is async; cron call sites are already in an
@@ -85,6 +94,7 @@ export function parseWhyMatters(text) {
* threatLevel?: string;
* category?: string;
* country?: string;
* description?: string;
* }} story
* @returns {Promise<string>}
*/
@@ -95,6 +105,11 @@ export async function hashBriefStory(story) {
story.threatLevel ?? '',
story.category ?? '',
story.country ?? '',
// New in v5: description is a prompt input on the analyst path,
// so MUST be part of cache identity. Absent on legacy paths →
// empty string → deterministic; same-story-same-description pairs
// still collide on purpose, different descriptions don't.
story.description ?? '',
].join('||');
const bytes = new TextEncoder().encode(material);
const digest = await crypto.subtle.digest('SHA-256', bytes);
@@ -105,3 +120,69 @@ export async function hashBriefStory(story) {
}
return hex.slice(0, 16);
}
// ── Analyst-path prompt v2 (multi-sentence, grounded) ──────────────────────
//
// Shadow-diff on 12 prod stories (2026-04-21) showed the v1 analyst output
// was indistinguishable from the legacy Gemini-only output: identical
// single-sentence abstraction-speak ("destabilize / systemic / sovereign
// risk repricing") with no named actors, metrics, or dates. Root cause:
// the 1830 word cap compressed the context's specifics out of the LLM's
// response. v2 loosens to 4070 words across 23 sentences and REQUIRES
// the LLM to ground at least one specific reference from the live context.
/**
* System prompt for the analyst-path v2 (23 sentences, ~4070 words,
* grounded in a specific named actor / metric / date / place drawn
* from the live context). Shape nudged toward the WMAnalyst chat voice
* (SITUATION → ANALYSIS → optional WATCH) but rendered as plain prose,
* no section labels in the output.
*/
export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
'You are the lead analyst at WorldMonitor Brief, a geopolitical intelligence magazine. ' +
'Using the Live WorldMonitor Context AND the story, write 23 sentences (4070 words total) ' +
'on why the story matters.\n\n' +
'STRUCTURE:\n' +
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
'metric, date, or place drawn from the context.\n' +
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
'the balance, triggers a cascade).\n' +
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
'HARD CONSTRAINTS:\n' +
'- Total length 4070 words across 23 sentences.\n' +
'- MUST reference at least ONE specific: named person / country / organization / ' +
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
'- No preamble ("This matters because…", "The importance of…").\n' +
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
/**
* Parse + validate the analyst-path v2 LLM response. Accepts
* multi-sentence output (23 sentences), 100500 chars. Otherwise
* same rejection semantics as v1 (stub echo, empty) plus explicit
* rejection of preamble boilerplate and leaked section labels.
*
* Returns null when the output is obviously wrong so the caller can
* fall through to the next layer.
*
* @param {unknown} text
* @returns {string | null}
*/
export function parseWhyMattersV2(text) {
if (typeof text !== 'string') return null;
let s = text.trim();
if (!s) return null;
// Drop surrounding quotes if the model insisted.
s = s.replace(/^[\u201C"']+/, '').replace(/[\u201D"']+$/, '').trim();
if (s.length < 100 || s.length > 500) return null;
// Reject the stub echo (same as v1).
if (/^story flagged by your sensitivity/i.test(s)) return null;
// Reject common preamble the system prompt explicitly banned.
if (/^(this matters because|the importance of|it is important|importantly,|in summary,|to summarize)/i.test(s)) {
return null;
}
// Reject markdown / section-label leakage (we told it to use plain prose).
if (/^(#|-|\*|\d+\.\s)/.test(s)) return null;
if (/^(situation|analysis|watch)\s*[:\-–—]/i.test(s)) return null;
return s;
}