worldmonitor

mirror of https://github.com/koala73/worldmonitor.git synced 2026-04-25 17:14:57 +02:00

Files

Elie Habib 621ac8d300 feat(brief): topic-threshold sweep + quality dashboard + labeled pairs (#3390 )

* feat(brief): topic-threshold sweep + daily quality dashboard + labeled pairs

Adds the "are we getting better" measurement infrastructure for the
brief topic-grouping pipeline. Three artifacts:

1. scripts/data/brief-adjacency-pairs.json — labeled "should-cluster"
   and "should-separate" pairs from real production briefs (12 pairs,
   7 cluster + 5 separate). Append-only labeled corpus.

2. scripts/sweep-topic-thresholds.mjs — pulls the per-tick replay log
   captured by writeReplayLog, reconstructs each tick's reps + cached
   embeddings, re-runs single-link clustering at multiple cosine
   thresholds, and outputs a markdown table with pair_recall,
   false_adjacency, topic_count, multi-member share, and a composite
   quality_score per threshold. Picks the highest-scoring as the
   recommendation.

3. scripts/brief-quality-report.mjs — daily quality dashboard. Pulls
   the latest tick, computes metrics at the active threshold, prints
   which labeled pairs were violated. Run before each config change;
   compare deltas; revert if quality_score drops.

Both scripts mirror the production slice (score floor + top-N) before
clustering so metrics reflect what users actually receive.

First sweep result against 2026-04-24 production replay records:

  threshold | quality | recall | false-adj
     0.30   |  0.649  | 100.0% | 100.0%
     0.32   |  0.705  | 100.0% |  75.0%
     0.35   |  0.825  | 100.0% |  33.3%
     0.38   |  0.815  | 100.0% |  33.3%
     0.40   |  0.815  | 100.0% |  33.3%
     0.42   |  0.895  | 100.0% |   8.3%  ⭐
     0.45   |  0.535  |  36.4% |   0.0%  ← current production

Recommended env flip: DIGEST_DEDUP_TOPIC_THRESHOLD=0.42 — lifts
pair_recall from 36% to 100% while introducing only one false-adjacency
case (1 of 12 separate pairs).

* fix(brief-quality): reviewer feedback — cap-aware metrics + env-readable + missing-embed survival

Addresses 6 of 8 review comments on PR #3390:

B. Drop redundant groupTopicsPostDedup call. singleLinkCluster IS the
   partition algorithm production uses internally; the second pass was
   paying cosine work per threshold per tick to read only .error.

C. Score floor + topN + cap now read from production env
   (DIGEST_SCORE_MIN, DIGEST_MAX_ITEMS, DIGEST_MAX_STORIES_PER_USER)
   with documented defaults. CLI flags --score-floor / --top-n /
   --cap (--max-stories) override.

D. Filter reps with missing embeddings instead of returning null on
   the whole tick. Skip only if fewer than 5 survive. Drop count
   reported in Coverage.

E. Removed dead local cosine() in both files.

F. JSON metadata moved from underscore-prefixed top-level keys into a
   nested `meta: {}` object.

H. Recommendation output now names the Railway service explicitly
   so copy-paste can't go to the wrong service.

Adds visible-window pair-recall: scores cluster correctness on what
the user actually sees post-MAX_STORIES_PER_USER truncation, in
addition to partition correctness on the full 30-rep sliced set.

Visible-window finding (against 2026-04-24 production replay):

  threshold=0.45 cap=12 → visible_quality 0.916
  threshold=0.45 cap=16 → visible_quality 0.716  ← cap bump HURTS
  threshold=0.42 cap=12 → visible_quality 0.845
  threshold=0.42 cap=16 → visible_quality 0.845

PR #3389's cap bump 12 → 16 is NOT evidence-justified at the current
0.45 threshold. Positions 13-16 dilute without helping adjacency.
PR #3389 will be revised separately to keep cap=12 default but add
env-tunability.

Skipping G (helper extraction) per reviewer guidance — defer until a
third tool justifies the abstraction.

* fix(brief-quality): reviewer round 2 — single-star, cap=12 default, error path surfaced

Three Greptile review comments on PR #3390:

P1 — sweep ⭐ marker tagged every running-best row instead of only
the global best. Compute the global best in a first pass, render
in a second; only the single best row is starred.

P2 — sweep MAX_STORIES_DEFAULT was 16 (assumed PR #3389 would land
the bump). PR #3389 was revised after evidence to keep cap at 12;
default reverted here too. Local runs without DIGEST_MAX_STORIES_PER_USER
now evaluate the correct production-equivalent visible window.

P2 — brief-quality-report's main() gated `scoreReplay` on
`embeddingByHash.size === reps.length`, defeating the missing-embed
survival logic inside scoreReplay (which already filters and falls
back to MIN_SURVIVING_REPS). Removed the outer gate; renderReport's
existing ⚠️ error path now surfaces the diagnostic when too few
embeddings survive instead of silently omitting the section.

Re-running the sweep with the corrected cap=12 default produces a
substantially different recommendation than the original commit
message claimed:

  threshold | visible_quality (cap=12)
     0.30   |   0.649
     0.35   |   0.625
     0.40   |   0.615
     0.42   |   0.845
     0.45   |   0.916  ⭐  ← current production IS the local optimum

The original commit's "lower threshold to 0.42" recommendation was
an artifact of the cap=16 default. At the actual production cap (12),
the labeled corpus says the current 0.45 threshold is best. PR
description will be updated separately.

The 'shadowed `items`' Greptile mention refers to two `items`
declarations in DIFFERENT function scopes (`redisLrangeAll` and
`scoreOneTick`); not a real shadowing — skipped.