mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)
* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding
Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern-
matching the loudest context numbers — VIX 19.50, top forecast
probability, MidEast FX stress 77 — into every story regardless of
editorial fit. A Rwanda humanitarian story about refugees cited VIX;
an aviation story cited a forecast probability.
Root cause: every story got the same 6-bundle context block, so the
LLM had markets / forecasts / macro in-hand and the "cite a specific
fact" instruction did the rest.
Two-layer fix:
1. STRUCTURAL — sectionsForCategory() maps the story's category to
an editorially-relevant subset of bundles. Humanitarian stories
don't see marketData / forecasts / macroSignals; diplomacy gets
riskScores only; market/energy gets markets+forecasts but drops
riskScores. The model physically cannot cite what it wasn't
given. Unknown categories fall back to all six (backcompat).
2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE
that explicitly permits grounding in headline/description
actors when no context fact is a clean fit, and bans dragging
off-topic market metrics into humanitarian/aviation/diplomacy
stories. The prompt footer (inline, per-call) restates the
same guardrail — models follow inline instructions more
reliably than system-prompt constraints on longer outputs.
Cache keys bumped to invalidate the formulaic v5 output: endpoint
v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5
policies + default fallback + humanitarian structural guarantee +
market policy does-see-markets + guardrail footer presence.
Observability: endpoint now logs policyLabel per call so operators
can confirm in Vercel logs that humanitarian/aviation stories are
NOT seeing marketData without dumping the full prompt.
* test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage
Greptile PR #3281 review raised two P2 test-quality issues:
1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint
was bumped to 8192 in PR #3269 (v2 output + description). With
the stale constant, a payload in the 4097–8192 range was
accepted by the real endpoint but looked oversize in the test
mirror, letting the body-cap invariant silently drift. Fixed
by syncing to 8192 + bumping the bloated fixture to 10_000
bytes so a future endpoint-cap bump doesn't silently
re-invalidate the assertion.
2. parseWhyMattersV2 (the only output-validation gate on the
analyst path) had no dedicated unit tests. Adds 11 targeted
cases covering: valid 2 and 3 sentence output, 100/500 char
bounds (incl. boundary assertions), all 6 banned preamble
phrases, section-label leaks (SITUATION/ANALYSIS/Watch),
markdown leakage (#, -, *, 1.), stub echo rejection, smart/
plain quote stripping, non-string defensive branch, and
whitespace-only strings.
Suite size: 50 to 61 tests, all green.
* fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1)
Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's
RELEVANCE RULE as a category banned from off-topic market metrics, but
had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation
Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all
fell through to DEFAULT_SECTIONS and still got all 6 bundles including
marketData, forecasts, and macroSignals — exactly the VIX / forecast
probability pattern the PR claimed to structurally prevent.
Reproduced on HEAD before fix:
Aviation Incident -> default
Airspace Closure -> default
Plane Crash -> default
...etc.
Fix:
1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/
tech: worldBrief, countryBrief, riskScores).
2. Adds dedicated aviation-gating test with 6 category variants.
3. Adds meta-invariant test: every category named in the system
prompt's RELEVANCE RULE MUST have a structural policy entry,
asserting policyLabel !== 'default'. If someone adds a new
category name to the prompt in the future, this test fires
until they wire up a regex — prevents soft-guard drift.
4. Removes 'Aviation Incident' from the default-fall-through test
list (it now correctly matches aviation).
No cache bump needed — v6 was published to the feature branch only a
few minutes ago, no production entries have been written yet.
This commit is contained in:
@@ -144,17 +144,28 @@ export const WHY_MATTERS_ANALYST_SYSTEM_V2 =
|
||||
'on why the story matters.\n\n' +
|
||||
'STRUCTURE:\n' +
|
||||
'1. SITUATION — what is happening right now, grounded in a SPECIFIC named actor, ' +
|
||||
'metric, date, or place drawn from the context.\n' +
|
||||
'metric, date, or place relevant to this story.\n' +
|
||||
'2. ANALYSIS — the structural consequence (why this forces a repricing, shifts ' +
|
||||
'the balance, triggers a cascade).\n' +
|
||||
'3. (Optional) WATCH — the threshold or indicator to track, if clear from the context.\n\n' +
|
||||
'HARD CONSTRAINTS:\n' +
|
||||
'- Total length 40–70 words across 2–3 sentences.\n' +
|
||||
'- MUST reference at least ONE specific: named person / country / organization / ' +
|
||||
'number / percentage / date / city — drawn from the context, NOT invented.\n' +
|
||||
'number / percentage / date / city.\n' +
|
||||
'- No preamble ("This matters because…", "The importance of…").\n' +
|
||||
'- No markdown, no bullet points, no section labels in the output — plain prose.\n' +
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.';
|
||||
'- Editorial, impersonal, serious. No calls to action, no questions, no quotes.\n\n' +
|
||||
'RELEVANCE RULE (critical, read carefully):\n' +
|
||||
'- The context block may contain facts from world-brief, country-brief, risk scores, ' +
|
||||
'forecasts, macro signals, and market data. These are BACKGROUND — only cite what is ' +
|
||||
"directly relevant to this story's category and country.\n" +
|
||||
'- If NO context fact clearly fits, ground instead in a named actor, place, date, ' +
|
||||
'or figure drawn from the headline or description. That is a VALID grounding — do ' +
|
||||
'NOT invent a market reading, VIX value, or forecast probability to satisfy the rule.\n' +
|
||||
'- NEVER drag an off-topic market metric, FX reading, or probability into a ' +
|
||||
'humanitarian, aviation, diplomacy, or cyber story. A story about a refugee flow ' +
|
||||
'does not need a VIX number; a story about a drone incursion does not need an FX ' +
|
||||
"stress reading. If it isn't editorially connected to the story, leave it out.";
|
||||
|
||||
/**
|
||||
* Parse + validate the analyst-path v2 LLM response. Accepts
|
||||
|
||||
Reference in New Issue
Block a user