Files
worldmonitor/scripts/lib
Elie Habib 8ea4c8f163 feat(digest-dedup): replayable per-story input log (opt-in, no behaviour change) (#3330)
* feat(digest-dedup): replayable per-story input log (opt-in, no behaviour change)

Ship the measurement layer before picking any recall-lift strategy.

Why: the current dedup path embeds titles only, so brief-wire headlines
that share a real event but drop the geographic anchor (e.g. "Alleged
Coup: defendant arrives in court" vs "Trial opens after Nigeria charges
six over 2025 coup plot") can slip past the 0.60 cosine threshold. To
tune recall without regressing precision we need a replayable per-tick
dataset — one record per story with the exact fields any downstream
candidate (title+slug, LLM-canonicalise, text-embedding-3-large, cross-
encoder re-rank, etc.) would need to score.

This PR ships ONLY the log. Zero behaviour change:
  - Opt-in via DIGEST_DEDUP_REPLAY_LOG=1 (default OFF).
  - Writer is best-effort: all errors swallowed + warned, never affects
    digest delivery. No throw path.
  - Records include hash, originalIndex, isRep, clusterId, raw +
    normalised title, link, severity/score/mentions/phase/sources,
    embeddingCacheKey, hasEmbedding sidecar flag, and the tick's config
    snapshot (mode, clustering, cosineThreshold, topicThreshold, veto).
  - clusterId derives from rep.mergedHashes (already set by
    materializeCluster) so the orchestrator is untouched.
  - Storage: Upstash list keyed by {variant}:{lang}:{sensitivity}:{date}
    with 30-day EXPIRE. Date suffix caps per-key growth; retention
    covers the labelling cadence + cross-candidate comparison window.
  - Env flag is '1'-only (fail-closed on typos, same pattern as
    DIGEST_DEDUP_MODE).

Activation path (post-merge): flip DIGEST_DEDUP_REPLAY_LOG=1 on the
seed-digest-notifications Railway service. Watch one cron tick for the
RPUSH + EXPIRE pair (or a single warn line if creds/upstream flake),
then leave running for at least one week to accumulate calibration data.

Tests: 21 unit tests covering flag parsing, key shape + sanitisation,
record field correctness (isRep, clusterId, embeddingCacheKey,
hasEmbedding, tickConfig), pipeline null/throw handling, malformed
input. Existing 77 dedup tests unchanged and still green.

* fix(digest-dedup): capture topicGroupingEnabled in replay tickConfig

Review catch (PR #3330): the tickConfig snapshot omitted
topicGroupingEnabled even though readOrchestratorConfig returns it and
the digest's post-dedup topic ordering gates on it. A tick run with
DIGEST_DEDUP_TOPIC_GROUPING=0 serialised identically to a default
tick, making those runs non-replayable for the calibration work this
log is meant to enable.

Add topicGroupingEnabled to the recorded tickConfig. One-line schema
fix + regression test asserting topic-grouping-off ticks serialise
distinctly from default.

22/22 tests pass.

* fix(digest-dedup): await replay-log write to survive explicit process.exit

Review catch (PR #3330): the fire-and-forget `void writeReplayLog(...)`
call could be dropped on the explicit-exit paths — the brief-compose
failure gate at line 1539 and main().catch at line 1545 both call
process.exit(1). Unlike natural exit, process.exit does not drain
in-flight promises, so the last N ticks' replay records could be
silently lost on runs where measurement fidelity matters most.

Fix: await the writeReplayLog call. Safe because:
  - writeReplayLog returns synchronously when the flag is off
    (replayLogEnabled check is the first thing it does)
  - It has a top-level try/catch that always returns a result object
  - The Upstash pipeline call has a 10s timeout ceiling
  - buildDigest already awaits many Upstash calls (dedup, compose,
    render) so one more is not a hot-path concern

Comment block added above the call explains why the await is
deliberate — so a future refactor doesn't revert it to void thinking
it's a leftover.

No test change: existing writeReplayLog unit tests already cover the
disabled / empty / success / error paths. The fix is a single-keyword
change in a caller that was already guaranteed-safe by the callee's
contract.

* refactor(digest-dedup): address Greptile P2 review comments on replay log

Three non-blocking polish items from the automated review, bundled
because they all touch the same new module and none change behaviour.

1. tsMs captured BEFORE deduplicateStories (seed-digest-notifications.mjs).
   Previously sampled after dedup returned, so briefTickId reflected
   dedup-completion time rather than tick-start. For downstream readers
   the natural reading of "briefTickId" is when the tick began
   processing; moved the Date.now() call to match that expectation.
   Drift is maybe 100ms-2s on cold-cache embed calls — small, but
   moving it is free.

2. buildReplayLogKey emptiness check now strips ':' and '-' in addition
   to '_'. A pathological ruleId of ':::' previously passed through
   verbatim, producing keys like `digest:replay-log:v1::::2026-04-23`
   that confuse redis-cli's namespace tooling (SCAN / KEYS / tab
   completion). The new guard falls back to "unknown" on any input
   that's all separators. Added a regression test covering the
   ':::' / '---' / '___' / mixed cases.

3. tickConfig is now a per-record shallow copy instead of a shared
   reference. Storage is unaffected (writeReplayLog serialises each
   record via JSON.stringify independently) but an in-memory consumer
   that mutated one record's tickConfig for experimentation would have
   silently affected all other records in the same batch. Added a
   regression test asserting mutation doesn't leak across records.

Tests: 24/24 pass (22 prior + 2 new regression). Typecheck + lint clean.
2026-04-23 11:50:19 +04:00
..