feat(brief): analyst prompt v2 — multi-sentence, grounded, story description (#3269)

* feat(brief): analyst prompt v2 — multi-sentence, grounded, includes story description Shadow-diff of 12 prod stories on 2026-04-21 showed v1 analyst output indistinguishable from legacy Gemini: identical single-sentence abstraction ("destabilize / systemic / sovereign risk repricing") with no named actors, metrics, or dates — in several cases Gemini was MORE specific. Root cause: 18–30 word cap compressed context specifics out. v2 loosens three dials at once so we can settle the A/B: 1. New system prompt WHY_MATTERS_ANALYST_SYSTEM_V2 — 2–3 sentences, 40–70 words, implicit SITUATION→ANALYSIS→(optional) WATCH arc, MUST cite one specific named actor / metric / date / place from the context. Analyst path only; gemini path stays on v1. 2. New parser parseWhyMattersV2 — accepts 100–500 chars, rejects preamble boilerplate + leaked section labels + markdown. 3. Story description plumbed through — endpoint body accepts optional story.description (≤ 1000 chars, body cap bumped 4 KB → 8 KB). Cron forwards it when upstream has one (skipped when it equals the headline — no new signal). Cache + shadow bumped v3 → v4 / v1 → v2 so fresh output lands on the first post-deploy cron tick. maxTokens 180 → 260 for ~3× output length. If shadow-diff 24h after deploy still shows no delta vs gemini, kill is BRIEF_WHY_MATTERS_PRIMARY=gemini on Vercel (instant, no redeploy). Tests: 6059 pass (was 6022 + 37 new). typecheck × 2 clean. * fix(brief): stop truncating v2 multi-sentence output + description in cache hash Two P1s caught in PR #3269 review. P1a — cron reparsed endpoint output with v1 single-sentence parser, silently dropping sentences 2+3 of v2 analyst output. The endpoint had ALREADY validated the string (parseWhyMattersV2 for analyst path; parseWhyMatters for gemini). Re-parsing with v1 took only the first sentence — exact regression #3269 was meant to fix. Fix: trust the endpoint. Replace re-parse with bounds check (30–500 chars) + stub-echo reject. Added regression test asserting multi- sentence output reaches the envelope unchanged. P1b — `story.description` flowed into the analyst prompt but NOT into the cache hash. Two requests with identical core fields but different descriptions collided on one cache slot → second caller got prose grounded in the FIRST caller's description. Fix: add `description` as the 6th field of `hashBriefStory`. Bump endpoint cache v4→v5 and shadow v2→v3 so buggy 5-field entries are dropped. Updated the parity sentinel in brief-llm-core.test.mjs to match 6-field semantics. Added regression tests covering different- descriptions-differ and present-vs-absent-differ. Tests: 6083 pass. typecheck × 2 clean.
2026-04-25 17:14:57 +02:00 · 2026-04-21 22:25:54 +04:00
parent 048bb8bb52
commit ec35cf4158
10 changed files with 454 additions and 40 deletions
--- a/shared/brief-llm-core.d.ts
+++ b/shared/brief-llm-core.d.ts
@@ -4,6 +4,9 @@ export interface BriefStoryHashInput {
  threatLevel?: string;
  category?: string;
  country?: string;
+  /** v5: part of cache identity so same-story + different description
+   *  don't collide on cached analyst output. */
+  description?: string;
 }

 export interface BriefStoryPromptInput {
@@ -24,3 +27,7 @@ export function buildWhyMattersUserPrompt(story: BriefStoryPromptInput): {
 export function parseWhyMatters(text: unknown): string | null;

 export function hashBriefStory(story: BriefStoryHashInput): Promise<string>;
+
+// ── v2 (analyst path only) ────────────────────────────────────────────────
+export const WHY_MATTERS_ANALYST_SYSTEM_V2: string;
+export function parseWhyMattersV2(text: unknown): string | null;
--- a/shared/brief-llm-core.js
+++ b/shared/brief-llm-core.js
@@ -69,11 +69,20 @@ export function parseWhyMatters(text) {
 }

 /**
- * Deterministic 16-char hex hash of the five story fields that flow
- * into the whyMatters prompt. Same material as the pre-v3 sync
- * implementation (`scripts/lib/brief-llm.mjs:hashBriefStory`) — a
- * fixed fixture in tests/brief-llm-core.test.mjs pins the output so a
- * future refactor cannot silently invalidate every cached entry.
+ * Deterministic 16-char hex hash of the SIX story fields that flow
+ * into the whyMatters prompt (5 core + description). Cache identity
+ * MUST cover every field that shapes the LLM output, or two requests
+ * with the same core fields but different descriptions will share a
+ * cache entry and the second caller gets prose grounded in the first
+ * caller's description (P1 regression caught in PR #3269 review).
+ *
+ * History:
+ *   - pre-v3: 5 fields, sync `node:crypto.createHash`.
+ *   - v3: moved to Web Crypto (async), same 5 fields.
+ *   - v5 (with endpoint cache bump to brief:llm:whymatters:v5:):
+ *     6 fields — `description` added to match the analyst path's
+ *     v2 prompt which interpolates `Description: <desc>` between
+ *     headline and source.
 *
 * Uses Web Crypto so the module is edge-safe. Returns a Promise because
 * `crypto.subtle.digest` is async; cron call sites are already in an
@@ -85,6 +94,7 @@ export function parseWhyMatters(text) {
 *   threatLevel?: string;
 *   category?: string;
 *   country?: string;
+ *   description?: string;
 * }} story
 * @returns {Promise<string>}
 */
@@ -95,6 +105,11 @@ export async function hashBriefStory(story) {
    story.threatLevel ?? '',
    story.category ?? '',
    story.country ?? '',
+    // New in v5: description is a prompt input on the analyst path,
+    // so MUST be part of cache identity. Absent on legacy paths →
+    // empty string → deterministic; same-story-same-description pairs
+    // still collide on purpose, different descriptions don't.
+    story.description ?? '',
  ].join('||');
  const bytes = new TextEncoder().encode(material);
  const digest = await crypto.subtle.digest('SHA-256', bytes);
@@ -105,3 +120,69 @@ export async function hashBriefStory(story) {
  }
  return hex.slice(0, 16);
 }
+
+// ── Analyst-path prompt v2 (multi-sentence, grounded) ──────────────────────
+//
+// Shadow-diff on 12 prod stories (2026-04-21) showed the v1 analyst output
+// was indistinguishable from the legacy Gemini-only output: identical
+// single-sentence abstraction-speak ("destabilize / systemic / sovereign
+// risk repricing") with no named actors, metrics, or dates. Root cause:
+// the 18–30 word cap compressed the context's specifics out of the LLM's
+// response. v2 loosens to 40–70 words across 2–3 sentences and REQUIRES
+// the LLM to ground at least one specific reference from the live context.
+
+/**
+ * System prompt for the analyst-path v2 (2–3 sentences, ~40–70 words,
+ * grounded in a specific named actor / metric / date / place drawn
+ * from the live context). Shape nudged toward the WMAnalyst chat voice
+ * (SITUATION → ANALYSIS → optional WATCH) but rendered as plain prose,
+ * no section labels in the output.
+ */
+export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
+  'You are the lead analyst at WorldMonitor Brief, a geopolitical intelligence magazine. ' +
+  'Using the Live WorldMonitor Context AND the story, write 2–3 sentences (40–70 words total) ' +
+  'on why the story matters.\n\n' +
+  'STRUCTURE:\n' +
+  '1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
+  'metric, date, or place drawn from the context.\n' +
+  '2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
+  'the balance, triggers a cascade).\n' +
+  '3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
+  'HARD CONSTRAINTS:\n' +
+  '- Total length 40–70 words across 2–3 sentences.\n' +
+  '- MUST reference at least ONE specific: named person / country / organization / ' +
+  'number / percentage / date / city — drawn from the context, NOT invented.\n' +
+  '- No preamble ("This matters because…", "The importance of…").\n' +
+  '- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
+  '- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
+
+/**
+ * Parse + validate the analyst-path v2 LLM response. Accepts
+ * multi-sentence output (2–3 sentences), 100–500 chars. Otherwise
+ * same rejection semantics as v1 (stub echo, empty) plus explicit
+ * rejection of preamble boilerplate and leaked section labels.
+ *
+ * Returns null when the output is obviously wrong so the caller can
+ * fall through to the next layer.
+ *
+ * @param {unknown} text
+ * @returns {string | null}
+ */
+export function parseWhyMattersV2(text) {
+  if (typeof text !== 'string') return null;
+  let s = text.trim();
+  if (!s) return null;
+  // Drop surrounding quotes if the model insisted.
+  s = s.replace(/^[\u201C"']+/, '').replace(/[\u201D"']+$/, '').trim();
+  if (s.length < 100 || s.length > 500) return null;
+  // Reject the stub echo (same as v1).
+  if (/^story flagged by your sensitivity/i.test(s)) return null;
+  // Reject common preamble the system prompt explicitly banned.
+  if (/^(this matters because|the importance of|it is important|importantly,|in summary,|to summarize)/i.test(s)) {
+    return null;
+  }
+  // Reject markdown / section-label leakage (we told it to use plain prose).
+  if (/^(#|-|\*|\d+\.\s)/.test(s)) return null;
+  if (/^(situation|analysis|watch)\s*[:\-–—]/i.test(s)) return null;
+  return s;
+}