mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(brief): topic-threshold sweep + daily quality dashboard + labeled pairs
Adds the "are we getting better" measurement infrastructure for the
brief topic-grouping pipeline. Three artifacts:
1. scripts/data/brief-adjacency-pairs.json — labeled "should-cluster"
and "should-separate" pairs from real production briefs (12 pairs,
7 cluster + 5 separate). Append-only labeled corpus.
2. scripts/sweep-topic-thresholds.mjs — pulls the per-tick replay log
captured by writeReplayLog, reconstructs each tick's reps + cached
embeddings, re-runs single-link clustering at multiple cosine
thresholds, and outputs a markdown table with pair_recall,
false_adjacency, topic_count, multi-member share, and a composite
quality_score per threshold. Picks the highest-scoring as the
recommendation.
3. scripts/brief-quality-report.mjs — daily quality dashboard. Pulls
the latest tick, computes metrics at the active threshold, prints
which labeled pairs were violated. Run before each config change;
compare deltas; revert if quality_score drops.
Both scripts mirror the production slice (score floor + top-N) before
clustering so metrics reflect what users actually receive.
First sweep result against 2026-04-24 production replay records:
threshold | quality | recall | false-adj
0.30 | 0.649 | 100.0% | 100.0%
0.32 | 0.705 | 100.0% | 75.0%
0.35 | 0.825 | 100.0% | 33.3%
0.38 | 0.815 | 100.0% | 33.3%
0.40 | 0.815 | 100.0% | 33.3%
0.42 | 0.895 | 100.0% | 8.3% ⭐
0.45 | 0.535 | 36.4% | 0.0% ← current production
Recommended env flip: DIGEST_DEDUP_TOPIC_THRESHOLD=0.42 — lifts
pair_recall from 36% to 100% while introducing only one false-adjacency
case (1 of 12 separate pairs).
* fix(brief-quality): reviewer feedback — cap-aware metrics + env-readable + missing-embed survival
Addresses 6 of 8 review comments on PR #3390:
B. Drop redundant groupTopicsPostDedup call. singleLinkCluster IS the
partition algorithm production uses internally; the second pass was
paying cosine work per threshold per tick to read only .error.
C. Score floor + topN + cap now read from production env
(DIGEST_SCORE_MIN, DIGEST_MAX_ITEMS, DIGEST_MAX_STORIES_PER_USER)
with documented defaults. CLI flags --score-floor / --top-n /
--cap (--max-stories) override.
D. Filter reps with missing embeddings instead of returning null on
the whole tick. Skip only if fewer than 5 survive. Drop count
reported in Coverage.
E. Removed dead local cosine() in both files.
F. JSON metadata moved from underscore-prefixed top-level keys into a
nested `meta: {}` object.
H. Recommendation output now names the Railway service explicitly
so copy-paste can't go to the wrong service.
Adds visible-window pair-recall: scores cluster correctness on what
the user actually sees post-MAX_STORIES_PER_USER truncation, in
addition to partition correctness on the full 30-rep sliced set.
Visible-window finding (against 2026-04-24 production replay):
threshold=0.45 cap=12 → visible_quality 0.916
threshold=0.45 cap=16 → visible_quality 0.716 ← cap bump HURTS
threshold=0.42 cap=12 → visible_quality 0.845
threshold=0.42 cap=16 → visible_quality 0.845
PR #3389's cap bump 12 → 16 is NOT evidence-justified at the current
0.45 threshold. Positions 13-16 dilute without helping adjacency.
PR #3389 will be revised separately to keep cap=12 default but add
env-tunability.
Skipping G (helper extraction) per reviewer guidance — defer until a
third tool justifies the abstraction.
* fix(brief-quality): reviewer round 2 — single-star, cap=12 default, error path surfaced
Three Greptile review comments on PR #3390:
P1 — sweep ⭐ marker tagged every running-best row instead of only
the global best. Compute the global best in a first pass, render
in a second; only the single best row is starred.
P2 — sweep MAX_STORIES_DEFAULT was 16 (assumed PR #3389 would land
the bump). PR #3389 was revised after evidence to keep cap at 12;
default reverted here too. Local runs without DIGEST_MAX_STORIES_PER_USER
now evaluate the correct production-equivalent visible window.
P2 — brief-quality-report's main() gated `scoreReplay` on
`embeddingByHash.size === reps.length`, defeating the missing-embed
survival logic inside scoreReplay (which already filters and falls
back to MIN_SURVIVING_REPS). Removed the outer gate; renderReport's
existing ⚠️ error path now surfaces the diagnostic when too few
embeddings survive instead of silently omitting the section.
Re-running the sweep with the corrected cap=12 default produces a
substantially different recommendation than the original commit
message claimed:
threshold | visible_quality (cap=12)
0.30 | 0.649
0.35 | 0.625
0.40 | 0.615
0.42 | 0.845
0.45 | 0.916 ⭐ ← current production IS the local optimum
The original commit's "lower threshold to 0.42" recommendation was
an artifact of the cap=16 default. At the actual production cap (12),
the labeled corpus says the current 0.45 threshold is best. PR
description will be updated separately.
The 'shadowed `items`' Greptile mention refers to two `items`
declarations in DIFFERENT function scopes (`redisLrangeAll` and
`scoreOneTick`); not a real shadowing — skipped.