worldmonitor

eliott/worldmonitor

Fork 0

mirror of https://github.com/koala73/worldmonitor.git synced 2026-04-25 17:14:57 +02:00

Commit Graph

Author	SHA1	Message	Date
Elie Habib	7aa8dd1bf2	fix(scoring): relay recomputes importanceScore post-LLM + shadow-log v2 + parity test (#3069 ) * fix(scoring): relay recomputes importanceScore post-LLM + shadow-log v2 + parity test Before this change, classified rss_alert events published by ais-relay carried a stale importanceScore: the digest computed it from the keyword-level threat before the LLM upgrade, and the relay republished that value unchanged. Shadow log (2,850 entries / 7 days) had Pearson 0.31 vs human rating with zero events reaching the ≥85 critical gate — the score being measured was the keyword fallback, not the AI classification. Fixes: - ais-relay.cjs: recompute importanceScore locally from the post-LLM level using an exact mirror of the digest formula (SEVERITY_SCORES, SCORE_WEIGHTS, SOURCE_TIERS, formula). Publish includes corroborationCount for downstream shadow-log enrichment. - notification-relay.cjs: delete the duplicate shadowLogScore() call that produced ~50% near-duplicate pairs. Move shadow log to v2 key with JSON-encoded members carrying severity, source, corroborationCount, variant — future calibration cycles get cleaner data. - shadow-score-{report,rank}.mjs: parse both v2 JSON and legacy v1 string members; default to v2, override via SHADOW_SCORE_KEY env. - _classifier.ts: narrow keyword additions — blockade, siege, sanction (singular), escalation → HIGH; evacuation orders (plural) → CRITICAL. - tests/importance-score-parity.test.mjs: extracts tier map and formula from both TS digest and CJS relay sources, asserts identical output across 12 sample cases. Catches any future drift. - tests/relay-importance-recompute.test.mjs + notification-relay-shadow-log .test.mjs: regression tests for the publish path and single-write discipline. Gates remain OFF. After deploy, collect 48h of fresh shadow:score-log:v2 data, re-run scripts/shadow-score-rank.mjs for calibration, then set final IMPORTANCE_SCORE_MIN / high / critical thresholds before enabling IMPORTANCE_SCORE_LIVE=1. See docs/internal/scoringDiagnostic.md (local) for full diagnosis. 🤖 Generated with Claude Sonnet 4.6 via Claude Code + Compound Engineering v2.49.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(scoring): PR #3069 review amendments — revert risky keywords + extract SOURCE_TIERS Addresses review findings on PR #3069 (todos/193 through 204). BLOCKING (P1): - Revert 5 keyword additions in _classifier.ts. Review showed `escalation`, `sanction`, `siege`, `blockade`, `evacuation orders` fire HIGH on `de-escalation`, `sanctioned`, `besieged`, `blockaded` (substring matches), and the plural `evacuation orders` is already covered by the singular. Classifier work will land in a separate PR with phrase-based rules. - Delete dead `digestImportanceScore` field from relay allTitles metadata (written in two places, read nowhere). IMPORTANT (P2): - Extract SOURCE_TIERS to shared/source-tiers.{json,cjs} using the existing shared/rss-allowed-domains.* precedent. Dockerfile.relay already `COPY shared/` (whole dir), so no infra change. Deletes 255-line inline duplicate from ais-relay.cjs; TS digest imports the same JSON via resolveJsonModule. Tier-map parity is now structural. - Simplify parity test — tier extraction no longer needed. SEVERITY_SCORES + SCORE_WEIGHTS + scoring function parity retained across 12 cases plus an unknown-level defensiveness case. Deleted no-op regex replace (`.replace(/X/g, 'X')`). Fixed misleading recency docstring. - Pipeline shadow log: ZADD + ZREMRANGEBYSCORE + belt+suspenders EXPIRE now go in a single Upstash /pipeline POST (~50% RT reduction, no billing delta). - Bounded ZRANGE in shadow-score-report.mjs (20k cap + warn if reached). - Bump outbound webhook envelope v1→v2 to signal the new `corroborationCount` field on rss_alert payloads. - Restore rss_alert eventType gate at shadowLogScore caller (skip promise cost for non-rss events). - Align ais-relay scorer comment with reality: it has ONE intentional deviation from digest (`?? 0` on severity for defensiveness, returning 0 vs NaN on unknown levels). Documented + tested. P3: - Narrow loadEnv in scripts to only UPSTASH_REDIS_REST_* (was setting any UPPER_UNDERSCORE env var from .env.local). - Escape markdown specials in rating-sheet.md title embeds. Pre-existing activation blockers NOT fixed here (tracked as todos 196, 203): `/api/notify` accepts arbitrary importanceScore from authenticated Pro users, and notification channel bodies don't escape mrkdwn/Discord markup. Both must close before `IMPORTANCE_SCORE_LIVE=1`. Net: -614 lines (more deleted than added). 26 regression assertions pass. npm run typecheck, typecheck:api, test:data all pass. 🤖 Generated with Claude Sonnet 4.6 via Claude Code + Compound Engineering v2.49.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(scoring): mirror source-tiers.{json,cjs} into scripts/shared/ The `scripts-shared-mirror` enforcement in tests/edge-functions.test.mjs requires every .json and .cjs in shared/ to have a byte-identical copy in scripts/shared/ (Railway rootDirectory=scripts deploy bundle cannot reach repo-root shared/). Last commit added shared/source-tiers.{json,cjs} without mirroring them. CI caught it. * fix(scoring): revert webhook envelope to v1 + log shadow-log pipeline failures Two P1/P2 review findings on PR #3069: P1: Bumping webhook envelope to version: '2' was a unilateral breaking change — the other webhook producers (proactive-intelligence.mjs:407, seed-digest-notifications.mjs:736) still emit v1, so the same endpoint would receive mixed envelope versions per event type. A consumer validating `version === '1'` would break specifically on realtime rss_alert deliveries while proactive_brief and digest events kept working. Revert to '1' and document why — `corroborationCount` is an additive payload field, backwards-compatible for typical consumers; strict consumers using `additionalProperties: false` should be handled via a coordinated version bump across all producers in a separate PR. P2: The new shadow-log /pipeline write swallowed all errors silently (no resp.ok check, no per-command error inspection), a regression from the old upstashRest() path which at least logged non-2xx. Since the 48h recalibration cycle depends on shadow:score-log:v2 filling with clean data, a bad auth token or malformed pipeline body would leave operators staring at an empty ZSET with no signal. Now logs HTTP failures and per-command pipeline errors. * docs(scoring): fix stale v1 references + clarify two-copy source-tiers mirror Two follow-up review findings on PR #3069: P2 — Source-tier "single source of truth" comments were outdated. PR #3069 ships TWO JSON copies (shared/source-tiers.json for Vercel edge + main relay container, scripts/shared/source-tiers.json for Railway services using rootDirectory=scripts). Comments in server/_shared/source-tiers.ts and scripts/ais-relay.cjs now explicitly document the mirror setup and point at the two tests that enforce byte-identity: the existing scripts-shared-mirror test (tests/edge-functions.test.mjs:37-48) and a new explicit cross-check in tests/importance-score-parity.test.mjs. Adding the assertion here is belt-and-suspenders: if edge-functions.test.mjs is ever narrowed, the parity test still catches drift. P3 — Stale v1 references in shared metadata/docs. The actual writer moved to shadow:score-log:v2 in notification-relay.cjs, but server/_shared/cache-keys.ts:23-31 still documented v1 and exported the v1 string. No runtime impact (the export has zero importers — relay uses its own local const) but misleading. Updated the doc block to explain both v1 (legacy, self-pruning) and v2 (current), bumped the constant to v2, and added a comment that notification-relay.cjs is the owner. Header comment in scripts/shadow-score-report.mjs now documents the SHADOW_SCORE_KEY override path too. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-13 21:53:21 +04:00

Author

SHA1

Message

Date

Elie Habib

7aa8dd1bf2

fix(scoring): relay recomputes importanceScore post-LLM + shadow-log v2 + parity test (#3069 )

* fix(scoring): relay recomputes importanceScore post-LLM + shadow-log v2 + parity test

Before this change, classified rss_alert events published by ais-relay carried
a stale importanceScore: the digest computed it from the keyword-level threat
before the LLM upgrade, and the relay republished that value unchanged. Shadow
log (2,850 entries / 7 days) had Pearson 0.31 vs human rating with zero events
reaching the ≥85 critical gate — the score being measured was the keyword
fallback, not the AI classification.

Fixes:
- ais-relay.cjs: recompute importanceScore locally from the post-LLM level
  using an exact mirror of the digest formula (SEVERITY_SCORES, SCORE_WEIGHTS,
  SOURCE_TIERS, formula). Publish includes corroborationCount for downstream
  shadow-log enrichment.
- notification-relay.cjs: delete the duplicate shadowLogScore() call that
  produced ~50% near-duplicate pairs. Move shadow log to v2 key with
  JSON-encoded members carrying severity, source, corroborationCount,
  variant — future calibration cycles get cleaner data.
- shadow-score-{report,rank}.mjs: parse both v2 JSON and legacy v1 string
  members; default to v2, override via SHADOW_SCORE_KEY env.
- _classifier.ts: narrow keyword additions — blockade, siege, sanction
  (singular), escalation → HIGH; evacuation orders (plural) → CRITICAL.
- tests/importance-score-parity.test.mjs: extracts tier map and formula from
  both TS digest and CJS relay sources, asserts identical output across 12
  sample cases. Catches any future drift.
- tests/relay-importance-recompute.test.mjs + notification-relay-shadow-log
  .test.mjs: regression tests for the publish path and single-write discipline.

Gates remain OFF. After deploy, collect 48h of fresh shadow:score-log:v2
data, re-run scripts/shadow-score-rank.mjs for calibration, then set final
IMPORTANCE_SCORE_MIN / high / critical thresholds before enabling
IMPORTANCE_SCORE_LIVE=1.

See docs/internal/scoringDiagnostic.md (local) for full diagnosis.

🤖 Generated with Claude Sonnet 4.6 via Claude Code + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(scoring): PR #3069 review amendments — revert risky keywords + extract SOURCE_TIERS

Addresses review findings on PR #3069 (todos/193 through 204).

BLOCKING (P1):
- Revert 5 keyword additions in _classifier.ts. Review showed `escalation`,
  `sanction`, `siege`, `blockade`, `evacuation orders` fire HIGH on
  `de-escalation`, `sanctioned`, `besieged`, `blockaded` (substring matches),
  and the plural `evacuation orders` is already covered by the singular.
  Classifier work will land in a separate PR with phrase-based rules.
- Delete dead `digestImportanceScore` field from relay allTitles metadata
  (written in two places, read nowhere).

IMPORTANT (P2):
- Extract SOURCE_TIERS to shared/source-tiers.{json,cjs} using the
  existing shared/rss-allowed-domains.* precedent. Dockerfile.relay
  already `COPY shared/` (whole dir), so no infra change. Deletes
  255-line inline duplicate from ais-relay.cjs; TS digest imports the
  same JSON via resolveJsonModule. Tier-map parity is now structural.
- Simplify parity test — tier extraction no longer needed. SEVERITY_SCORES
  + SCORE_WEIGHTS + scoring function parity retained across 12 cases
  plus an unknown-level defensiveness case. Deleted no-op regex replace
  (`.replace(/X/g, 'X')`). Fixed misleading recency docstring.
- Pipeline shadow log: ZADD + ZREMRANGEBYSCORE + belt+suspenders EXPIRE
  now go in a single Upstash /pipeline POST (~50% RT reduction, no
  billing delta).
- Bounded ZRANGE in shadow-score-report.mjs (20k cap + warn if reached).
- Bump outbound webhook envelope v1→v2 to signal the new
  `corroborationCount` field on rss_alert payloads.
- Restore rss_alert eventType gate at shadowLogScore caller (skip
  promise cost for non-rss events).
- Align ais-relay scorer comment with reality: it has ONE intentional
  deviation from digest (`?? 0` on severity for defensiveness, returning
  0 vs NaN on unknown levels). Documented + tested.

P3:
- Narrow loadEnv in scripts to only UPSTASH_REDIS_REST_* (was setting
  any UPPER_UNDERSCORE env var from .env.local).
- Escape markdown specials in rating-sheet.md title embeds.

Pre-existing activation blockers NOT fixed here (tracked as todos 196,
203): `/api/notify` accepts arbitrary importanceScore from authenticated
Pro users, and notification channel bodies don't escape mrkdwn/Discord
markup. Both must close before `IMPORTANCE_SCORE_LIVE=1`.

Net: -614 lines (more deleted than added). 26 regression assertions pass.
npm run typecheck, typecheck:api, test:data all pass.

🤖 Generated with Claude Sonnet 4.6 via Claude Code + Compound Engineering v2.49.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(scoring): mirror source-tiers.{json,cjs} into scripts/shared/

The `scripts-shared-mirror` enforcement in tests/edge-functions.test.mjs
requires every *.json and *.cjs in shared/ to have a byte-identical copy
in scripts/shared/ (Railway rootDirectory=scripts deploy bundle cannot
reach repo-root shared/). Last commit added shared/source-tiers.{json,cjs}
without mirroring them. CI caught it.

* fix(scoring): revert webhook envelope to v1 + log shadow-log pipeline failures

Two P1/P2 review findings on PR #3069:

P1: Bumping webhook envelope to version: '2' was a unilateral breaking
change — the other webhook producers (proactive-intelligence.mjs:407,
seed-digest-notifications.mjs:736) still emit v1, so the same endpoint
would receive mixed envelope versions per event type. A consumer
validating `version === '1'` would break specifically on realtime
rss_alert deliveries while proactive_brief and digest events kept
working. Revert to '1' and document why — `corroborationCount` is an
additive payload field, backwards-compatible for typical consumers;
strict consumers using `additionalProperties: false` should be handled
via a coordinated version bump across all producers in a separate PR.

P2: The new shadow-log /pipeline write swallowed all errors silently
(no resp.ok check, no per-command error inspection), a regression from
the old upstashRest() path which at least logged non-2xx. Since the
48h recalibration cycle depends on shadow:score-log:v2 filling with
clean data, a bad auth token or malformed pipeline body would leave
operators staring at an empty ZSET with no signal. Now logs HTTP
failures and per-command pipeline errors.

* docs(scoring): fix stale v1 references + clarify two-copy source-tiers mirror

Two follow-up review findings on PR #3069:

P2 — Source-tier "single source of truth" comments were outdated.
PR #3069 ships TWO JSON copies (shared/source-tiers.json for Vercel edge +
main relay container, scripts/shared/source-tiers.json for Railway services
using rootDirectory=scripts). Comments in server/_shared/source-tiers.ts
and scripts/ais-relay.cjs now explicitly document the mirror setup and
point at the two tests that enforce byte-identity: the existing
scripts-shared-mirror test (tests/edge-functions.test.mjs:37-48) and a
new explicit cross-check in tests/importance-score-parity.test.mjs.
Adding the assertion here is belt-and-suspenders: if edge-functions.test.mjs
is ever narrowed, the parity test still catches drift.

P3 — Stale v1 references in shared metadata/docs. The actual writer
moved to shadow:score-log:v2 in notification-relay.cjs, but
server/_shared/cache-keys.ts:23-31 still documented v1 and exported the
v1 string. No runtime impact (the export has zero importers — relay
uses its own local const) but misleading. Updated the doc block to
explain both v1 (legacy, self-pruning) and v2 (current), bumped the
constant to v2, and added a comment that notification-relay.cjs is the
owner. Header comment in scripts/shadow-score-report.mjs now documents
the SHADOW_SCORE_KEY override path too.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-13 21:53:21 +04:00

1 Commits