Files
worldmonitor/api
Elie Habib 425507d15a fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding (#3281)
* fix(brief): category-gated context + RELEVANCE RULE to stop formulaic grounding

Shadow-diff of 15 v2 pairs (2026-04-22) showed the analyst pattern-
matching the loudest context numbers — VIX 19.50, top forecast
probability, MidEast FX stress 77 — into every story regardless of
editorial fit. A Rwanda humanitarian story about refugees cited VIX;
an aviation story cited a forecast probability.

Root cause: every story got the same 6-bundle context block, so the
LLM had markets / forecasts / macro in-hand and the "cite a specific
fact" instruction did the rest.

Two-layer fix:

  1. STRUCTURAL — sectionsForCategory() maps the story's category to
     an editorially-relevant subset of bundles. Humanitarian stories
     don't see marketData / forecasts / macroSignals; diplomacy gets
     riskScores only; market/energy gets markets+forecasts but drops
     riskScores. The model physically cannot cite what it wasn't
     given. Unknown categories fall back to all six (backcompat).

  2. PROMPT — WHY_MATTERS_ANALYST_SYSTEM_V2 adds a RELEVANCE RULE
     that explicitly permits grounding in headline/description
     actors when no context fact is a clean fit, and bans dragging
     off-topic market metrics into humanitarian/aviation/diplomacy
     stories. The prompt footer (inline, per-call) restates the
     same guardrail — models follow inline instructions more
     reliably than system-prompt constraints on longer outputs.

Cache keys bumped to invalidate the formulaic v5 output: endpoint
v5 to v6, shadow v3 to v4. Adds 11 unit tests pinning the 5
policies + default fallback + humanitarian structural guarantee +
market policy does-see-markets + guardrail footer presence.

Observability: endpoint now logs policyLabel per call so operators
can confirm in Vercel logs that humanitarian/aviation stories are
NOT seeing marketData without dumping the full prompt.

* test(brief): address greptile P2 — sync MAX_BODY_BYTES + add parseWhyMattersV2 coverage

Greptile PR #3281 review raised two P2 test-quality issues:

1. Test-side MAX_BODY_BYTES mirror was still 4096 — the endpoint
   was bumped to 8192 in PR #3269 (v2 output + description). With
   the stale constant, a payload in the 4097–8192 range was
   accepted by the real endpoint but looked oversize in the test
   mirror, letting the body-cap invariant silently drift. Fixed
   by syncing to 8192 + bumping the bloated fixture to 10_000
   bytes so a future endpoint-cap bump doesn't silently
   re-invalidate the assertion.

2. parseWhyMattersV2 (the only output-validation gate on the
   analyst path) had no dedicated unit tests. Adds 11 targeted
   cases covering: valid 2 and 3 sentence output, 100/500 char
   bounds (incl. boundary assertions), all 6 banned preamble
   phrases, section-label leaks (SITUATION/ANALYSIS/Watch),
   markdown leakage (#, -, *, 1.), stub echo rejection, smart/
   plain quote stripping, non-string defensive branch, and
   whitespace-only strings.

Suite size: 50 to 61 tests, all green.

* fix(brief): add aviation policy to sectionsForCategory (PR #3281 review P1)

Reviewer caught that aviation was named in WHY_MATTERS_ANALYST_SYSTEM_V2's
RELEVANCE RULE as a category banned from off-topic market metrics, but
had no matching regex entry in CATEGORY_SECTION_POLICY. So 'Aviation
Incident' / 'Airspace Closure' / 'Plane Crash' / 'Drone Incursion' all
fell through to DEFAULT_SECTIONS and still got all 6 bundles including
marketData, forecasts, and macroSignals — exactly the VIX / forecast
probability pattern the PR claimed to structurally prevent.

Reproduced on HEAD before fix:
  Aviation Incident -> default
  Airspace Closure  -> default
  Plane Crash       -> default
  ...etc.

Fix:
  1. Adds aviation policy (same 3 bundles as humanitarian/diplomacy/
     tech: worldBrief, countryBrief, riskScores).
  2. Adds dedicated aviation-gating test with 6 category variants.
  3. Adds meta-invariant test: every category named in the system
     prompt's RELEVANCE RULE MUST have a structural policy entry,
     asserting policyLabel !== 'default'. If someone adds a new
     category name to the prompt in the future, this test fires
     until they wire up a regex — prevents soft-guard drift.
  4. Removes 'Aviation Incident' from the default-fall-through test
     list (it now correctly matches aviation).

No cache bump needed — v6 was published to the feature branch only a
few minutes ago, no production entries have been written yet.
2026-04-22 08:21:01 +04:00
..
2026-03-20 12:37:24 +04:00