fix(digest): strict > on secondary Jaccard to close boundary-exact oil-price case

Reviewer's third P1: shorter 'Oil prices rise on Iran nuclear talks optimism' variant lands EXACTLY on Jaccard=0.25 (3 shared in 12-union) and scraped through the secondary merge via inclusive >= comparison.

Fix: strict > instead of >= on SECONDARY_MERGE_MIN_JACCARD at both call sites. Keeps the named constant at 0.25 (any bump would break legitimate close-miss Hormuz merges at J=0.267).

Trade-off: the Hormuz bridge-headline 'tanker reports attack as Iran closes strait of Hormuz; French soldier killed in Lebanon' also lands at J=0.25 against the 4-story Hormuz cluster, so it now stays its own cluster. That's 6→3 reduction vs the original 6→1 goal, but the cluster algorithm structurally cannot distinguish this bridge case from the oil-price reaction via bag-of-words alone — admitting one means admitting the other. Chose to block the false positive. Hormuz test relaxed from ≥5/6 to ≥4/6 with detailed comment explaining both outliers.

Tests: 24/24 -> 25/25. Added explicit reviewer reproducer (short oil-price variant at J=0.25). Full test:data 5773/5773. Typecheck + lint clean.
This commit is contained in:
Elie Habib
2026-04-19 09:55:02 +04:00
parent 7270599636
commit b490dfe2bc
2 changed files with 66 additions and 18 deletions

View File

@@ -320,6 +320,16 @@ const CLUSTER_JOIN_DISTINCTIVE_LEN = 5;
// "nuclear" + "talks"
// All of those have Jaccard between 0.14 and 0.24 — below 0.25
// they won't merge regardless of distinctive-word overlap.
//
// IMPORTANT: the secondary rule compares with STRICT inequality
// (jaccard > SECONDARY_MERGE_MIN_JACCARD), not ≥. A short
// downstream-reaction headline like "Oil prices rise on Iran
// nuclear talks optimism" vs an Iran-talks cluster can land
// EXACTLY on 0.25 (3 shared words in a 12-word union), and with
// ≥ it would scrape through merge despite being a different
// event. Strict > excludes that boundary case without moving
// the constant — which would risk breaking legitimate close-
// miss Hormuz merges (P10 joining at J=0.267, P02↔P08 at 0.308).
const SECONDARY_MERGE_MIN_JACCARD = 0.25;
/**
@@ -406,7 +416,7 @@ function deduplicateStories(stories) {
// exists beyond the distinctive entities alone.
const jaccard = jaccardSimilarity(words, cluster.words);
let shouldMerge = jaccard >= JACCARD_MERGE_THRESHOLD;
if (!shouldMerge && jaccard >= SECONDARY_MERGE_MIN_JACCARD) {
if (!shouldMerge && jaccard > SECONDARY_MERGE_MIN_JACCARD) {
const distinct = countDistinctiveShared(words, cluster.core);
const total = countShared(words, cluster.words);
shouldMerge =
@@ -454,7 +464,7 @@ function deduplicateStories(stories) {
const totalCore = countShared(a.core, b.core);
const postMerge =
jaccardUnion >= JACCARD_MERGE_THRESHOLD ||
(jaccardUnion >= SECONDARY_MERGE_MIN_JACCARD &&
(jaccardUnion > SECONDARY_MERGE_MIN_JACCARD &&
distinctiveCore >= CLUSTER_JOIN_MIN_DISTINCTIVE_SHARED &&
totalCore >= CLUSTER_JOIN_MIN_SHARED_WORDS);
if (postMerge) {

View File

@@ -200,14 +200,25 @@ describe('deduplicateStories', () => {
// about the Strait-of-Hormuz closure with wildly different phrasing.
// At the original Jaccard ≥ 0.55 with frozen cluster words, all 6
// passed through as distinct. The new rules collapse the high-
// overlap variants into one cluster. One outlier (the "Defiant
// message from Iran as vessels attempting to cross Hormuz"
// headline) has Jaccard 0.13 with the rest and only shares
// {iran, hormuz} — we intentionally leave that outlier as its own
// cluster rather than relax the rule further, because relaxing
// would over-merge genuinely distinct events (see the Russia/
// Odesa and Iran/oil-price regressions below).
it('collapses at least 5 of 6 Hormuz wire variants into one cluster', () => {
// overlap core variants into one cluster. Two outliers intentionally
// stay separate:
//
// - "Defiant message from Iran as vessels attempting to cross
// Hormuz report gunfire" — shares only {iran, hormuz} (J≈0.13)
// with the rest; below the Jaccard floor.
//
// - "Middle East crisis live: tanker reports attack as Iran
// closes strait of Hormuz; French soldier killed in Lebanon" —
// a bridge headline whose French/Lebanon content dilutes its
// Jaccard against the 4-story Hormuz cluster to exactly 0.25,
// which our strict > floor rejects (same boundary value that
// blocks the reviewer's oil-price reaction false positive, so
// we cannot admit one without the other).
//
// Accepting these two outliers — a 6→3 reduction — is the deliberate
// trade-off for blocking the structural false positives in the
// regressions below (Russia/Odesa, Iran/oil-price, Lebanon-pair).
it('collapses at least 4 of 6 Hormuz wire variants into one cluster', () => {
const stories = [
story('Iran says it has closed Strait of Hormuz again over US blockade', 95, 1, 'h02'),
story('Iran closes Strait of Hormuz again over US blockade and fires on ships', 90, 1, 'h05'),
@@ -219,14 +230,14 @@ describe('deduplicateStories', () => {
const result = mod.deduplicateStories(stories);
const largestClusterSize = Math.max(...result.map((r) => r.mergedHashes.length));
assert.ok(
largestClusterSize >= 5,
`expected main Hormuz cluster to absorb ≥5 of 6 variants, got max size ${largestClusterSize}. Clusters: ${result.map((r) => `${r.mergedHashes.length}:${r.title.slice(0, 40)}`).join(' | ')}`,
largestClusterSize >= 4,
`expected main Hormuz cluster to absorb ≥4 of 6 variants, got max size ${largestClusterSize}. Clusters: ${result.map((r) => `${r.mergedHashes.length}:${r.title.slice(0, 40)}`).join(' | ')}`,
);
// At most 2 clusters survive — the main one plus (possibly) the
// low-overlap "defiant message" outlier.
// At most 3 clusters survive — the main one plus up to 2
// documented outliers (P07 defiant-message + P11 bridge-headline).
assert.ok(
result.length <= 2,
`expected ≤2 clusters, got ${result.length}`,
result.length <= 3,
`expected ≤3 clusters, got ${result.length}`,
);
});
@@ -403,8 +414,8 @@ describe('deduplicateStories — P1 false-positive regressions (from PR #3195 re
// oil-price reaction story sharing {iran, nuclear, talks}. All
// three words clear the 5-char "distinctive" bar but they're
// generic diplomatic-event vocabulary. Secondary merge requires
// Jaccard 0.25 — the oil-price story's Jaccard against the
// talks cluster is 0.231, just below the floor.
// Jaccard > 0.25 — the oil-price story's Jaccard against the
// talks cluster is 0.231, below the floor.
it('Iran nuclear talks coverage does not absorb an oil-price-reaction story sharing {iran, nuclear, talks}', () => {
const stories = [
story('US Iran nuclear talks resume in Oman', 95, 'nt1'),
@@ -421,6 +432,33 @@ describe('deduplicateStories — P1 false-positive regressions (from PR #3195 re
assert.deepEqual(sizes, [2, 1], 'larger cluster has both talks stories');
});
// REGRESSION (third round / Jaccard boundary): a SHORTER oil-
// price variant landed EXACTLY on Jaccard = 0.25 against the
// talks cluster (3 shared words in a 12-word union), and with
// an inclusive ≥ comparison it scraped through and collapsed
// the three stories into 1. The fix is strict > SECONDARY_
// MERGE_MIN_JACCARD so the exact-boundary case falls out
// without bumping the constant (which would break legitimate
// close-miss Hormuz merges at J ≈ 0.267).
it('shorter oil-price variant at Jaccard exactly 0.25 still does NOT merge into talks cluster', () => {
const stories = [
story('US Iran nuclear talks resume in Oman', 95, 'nt1'),
story('US Iran nuclear talks enter second day in Oman', 90, 'nt2'),
// This variant is shorter than the one above (7 words vs 8
// after stop-filter). Against the 8-word cluster union, the
// 3 shared words give Jaccard = 3/(7+8-3) = 0.25 exactly.
story('Oil prices rise on Iran nuclear talks optimism', 85, 'op2'),
];
const result = mod.deduplicateStories(stories);
assert.equal(
result.length,
2,
`expected 2 clusters (oil-price variant stays separate even at J=0.25 boundary), got ${result.length}: ${result.map((r) => r.title).join(' | ')}`,
);
const sizes = result.map((r) => r.mergedHashes.length).sort((a, b) => b - a);
assert.deepEqual(sizes, [2, 1], 'talks cluster has the two talks stories; oil-price is alone');
});
// REGRESSION (directional-word stop-words concern): north/south/
// east/west were added to STOP_WORDS to strip news-framing
// boilerplate ("Middle East crisis live"). That also strips them