Files
worldmonitor/tests/brief-dedup-embedding.test.mjs
Elie Habib 305dc5ef36 feat(digest-dedup): Phase A — embedding-based dedup scaffolding (no-op) (#3200)
* feat(digest-dedup): Phase A — embedding-based dedup scaffolding (no-op)

Replaces the inline Jaccard story-dedup in seed-digest-notifications
with an orchestrator that can run Jaccard, shadow, or full embedding
modes. Ships with DIGEST_DEDUP_MODE=jaccard as the default so
production behaviour is unchanged until Phase C shadow + Phase D flip.

New modules (scripts/lib/):
- brief-dedup-consts.mjs       tunables + cache prefix + __constants bag
- brief-dedup-jaccard.mjs      verbatim 0.55-threshold extract (fallback)
- entity-gazetteer.mjs         cities/regions gazetteer + common-caps
- brief-embedding.mjs          OpenRouter /embeddings client with Upstash
                               cache, all-or-nothing timeout, cosineSimilarity
- brief-dedup-embed.mjs        complete-link clustering + entity veto (pure)
- brief-dedup.mjs              orchestrator, env read at call entry,
                               shadow archive, structured log line

Operator tools (scripts/tools/):
- calibrate-dedup-threshold.mjs  offline calibration runner + histogram
- golden-pair-validator.mjs      live-embedder drift detector (nightly CI)
- shadow-sample.mjs              Sample A/B CSV emitter over SCAN archive

Tests:
- brief-dedup-jaccard.test.mjs    migrated from regex-harness to direct
                                   import plus orchestrator parity tests (22)
- brief-dedup-embedding.test.mjs  9 plan scenarios incl. 10-permutation
                                   property test, complete-link non-chain (21)
- brief-dedup-golden.test.mjs     20-pair mocked canary (21)

Workflows:
- .github/workflows/dedup-golden-pairs.yml  nightly live-embedder canary
                                             (07:17 UTC), opens issue on drift

Deviation from plan: the shouldVeto("Iran closes Hormuz", "Tehran
shuts Hormuz") case can't return true under a single coherent
classification (country-in-A vs capital-in-B sit on different sides
of the actor/location boundary). Gazetteer follows the plan's
"countries are actors" intent; the test is updated to assert false
with a comment pointing at the irreducible capital-country
coreference limitation.

Verification:
- npm run test:data          5825/5825 pass
- tests/edge-functions        171/171 pass
- typecheck + typecheck:api  clean
- biome check on new files    clean
- lint:md                     0 errors

Phase B (calibration), Phase C (shadow), and Phase D (flip) are
subsequent PRs.

* refactor(digest-dedup): address review findings 193-199

Fresh-eyes review found 3 P1s, 3 P2s, and a P3 bundle across
kieran-typescript, security-sentinel, performance-oracle, architecture-
strategist, and code-simplicity reviewers. Fixes below; all 64 dedup
tests + 5825 data tests + 171 edge-function tests still green.

P1 #193 - dedup regex + redis pipeline duplication
- Extract defaultRedisPipeline into scripts/lib/_upstash-pipeline.mjs;
  both orchestrator and embedding client import from there.
- normalizeForEmbedding now delegates to stripSourceSuffix from the
  Jaccard module so the outlet allow-list is single-sourced.

P1 #194 - embedding timeout floor + negative-budget path
- callEmbeddingsApi throws EmbeddingTimeoutError when timeoutMs<=0
  instead of opening a doomed 250ms fetch.
- Removed Math.max(250, ...) floor that let wall-clock cap overshoot.

P1 #195 - dead env getters
- Deleted getMode / isRemoteEmbedEnabled / isEntityVetoEnabled /
  getCosineThreshold / getWallClockMs from brief-dedup-consts.mjs
  (zero callers; orchestrator reimplements inline).

P2 #196 - orchestrator cleanup bundle
- Removed re-exports at bottom of brief-dedup.mjs.
- Extracted materializeCluster into brief-dedup-jaccard.mjs; both
  the fallback and orchestrator use the shared helper.
- Deleted clusterWithEntityVeto wrapper; orchestrator inlines the
  vetoFn wiring at the single call site.
- Shadow mode now runs Jaccard exactly once per tick (was twice).
- Fallback warn line carries reason=ErrorName so operators can
  filter timeout vs provider vs shape errors.
- Invalid DIGEST_DEDUP_MODE values emit a warn once per run (vs
  silently falling to jaccard).

P2 #197 - workflow + shadow-sample hardening
- dedup-golden-pairs.yml body composition no longer relies on a
  heredoc that would command-substitute validator stdout. Switched
  to printf with sanitised LOG_TAIL (printable ASCII only) and
  --body-file so crafted fixture text cannot escape into the runner.
- shadow-sample.mjs Upstash helper enforces a hardcoded command
  allowlist (SCAN | GET | EXISTS).

P2 #198 - test + observability polish
- Scenarios 2 and 3 deep-equal returned clusters against the Jaccard
  expected shape, not just length. Also assert the reason= field.

P3 #199 - nits
- Removed __constants test-bag; jaccard tests use named imports.
- Renamed deps.apiKey to deps._apiKey in embedding client.
- Added @pre JSDoc on diffClustersByHash about unique-hash contract.
- Deferred: mocked golden-pair test removal, gazetteer JSON migration,
  scripts/tools AGENTS.md doc note.

Todos 193-199 moved from pending to complete.

Verification:
- npm run test:data            5825/5825 pass
- tests/edge-functions          171/171 pass
- typecheck + typecheck:api    clean
- biome check on changed files clean

* fix(digest-dedup): address Greptile P2 findings on PR #3200

1. brief-embedding.mjs: wrap fetch lookup as
   `(...args) => globalThis.fetch(...args)` instead of aliasing bare
   `fetch`. Aliasing captures the binding at module-load time, so
   later instrumentation / Edge-runtime shims don't see the wrapper —
   same class of bug as the banned `fetch.bind(globalThis)` pattern
   flagged in AGENTS.md.

2. dedup-golden-pairs.yml: `gh issue create --label "..." || true`
   silently swallowed the failure when any of dedup/canary/p1 labels
   didn't pre-exist, breaking the drift alert channel while leaving
   the job red in the Actions UI. Switched to repeated `--label`
   flags + `--create-label` so any missing label is auto-created on
   first drift, and dropped the `|| true` so a legitimate failure
   (network / auth) surfaces instead of hiding.

Both fixes are P2-style per Greptile (confidence 5/5, no P0/P1);
applied pre-merge so the nightly canary is usable from day one.

* fix(digest-dedup): two P1s found on PR #3200

P1 — canary classifier must match production
Nightly golden-pair validator was checking a hardcoded threshold
(default 0.60) and always applied the entity veto, while the actual
dedup path at runtime reads DIGEST_DEDUP_COSINE_THRESHOLD and
DIGEST_DEDUP_ENTITY_VETO_ENABLED from env at every call. A Phase
C/D env flip could make the canary green while prod was wrong or
red while prod was healthy, defeating the whole point of a drift
detector.

Fix:
- golden-pair-validator.mjs now calls readOrchestratorConfig(process.env)
  — the same helper the orchestrator uses — so any classifier knob
  added later is picked up automatically. The threshold and veto-
  enabled flags are sourced from env by default; a --threshold CLI
  flag still overrides for manual calibration sweeps.
- dedup-golden-pairs.yml sources DIGEST_DEDUP_COSINE_THRESHOLD and
  DIGEST_DEDUP_ENTITY_VETO_ENABLED from GitHub repo variables (vars.*),
  which operators must keep in lockstep with Railway. The
  workflow_dispatch threshold input now defaults to empty; the
  scheduled canary always uses the production-parity config.
- Validator log line prints the effective config + source so nightly
  output makes the classifier visible.

P1 — shadow archive writes were fail-open
`defaultRedisPipeline()` returns null on timeout / auth / HTTP
failure. `writeShadowArchive()` only had a try/catch, so the null
result was silently treated as success. A Phase C rollout could
log clean "mode=shadow … disagreements=X" lines every tick while
the Upstash archive received zero writes — and Sample B labelling
would then find no batches, silently killing calibration.

Fix:
- writeShadowArchive now inspects the pipeline return. null result,
  non-array response, per-command {error}, or a cell without
  {result: "OK"} all return {ok: false, reason}.
- Orchestrator emits a warn line with the failure reason, and the
  structured log line carries archive_write=ok|failed so operators
  can grep for failed ticks.
- Regression test in brief-dedup-embedding.test.mjs simulates the
  null-pipeline contract and asserts both the warn and the structured
  field land.

Verification:
- test:data           5825/5825 pass
- dedup suites         65/65   pass (new: archive-fail regression)
- typecheck + api     clean
- biome check         clean on changed files

* fix(digest-dedup): two more P1s found on PR #3200

P1 — canary must also honour DIGEST_DEDUP_MODE + REMOTE_EMBED_ENABLED
The prior round fixed the threshold/veto knobs but left the canary
running embeddings regardless of whether production could actually
reach the embed path. If Railway has DIGEST_DEDUP_MODE=jaccard or
DIGEST_DEDUP_REMOTE_EMBED_ENABLED=0, production never calls the
classifier, so a drift signal is meaningless — or worse, a live
OpenRouter issue flags the canary while prod is obliviously fine.

Fix:
- golden-pair-validator.mjs reads mode + remoteEmbedEnabled from the
  same readOrchestratorConfig() helper the orchestrator uses. When
  either says "embed path inactive in prod", the validator logs an
  explicit skip line and exits 0. The nightly workflow then shows
  green, which is the correct signal ("nothing to drift against").
- A --force CLI flag remains for manual dispatch during staged
  rollouts.
- dedup-golden-pairs.yml sources DIGEST_DEDUP_MODE and
  DIGEST_DEDUP_REMOTE_EMBED_ENABLED from GitHub repo variables
  alongside the threshold and veto-enabled knobs, so all four
  classifier gates stay in lockstep with Railway.
- Validator log line now prints mode + remoteEmbedEnabled so the
  canary output surfaces which classifier it validated.

P1 — shadow-sample Sample A was biased by SCAN order
enumerate-and-dedup added every seen pair to a dedup key BEFORE
filtering by agreement. If the same pair appeared in an agreeing
batch first and a disagreeing batch later, the disagreeing
occurrence was silently dropped. SCAN order is unspecified, so
Sample A could omit real disagreement pairs.

Fix:
- Extracted the enumeration into a pure `enumeratePairs(archives, mode)`
  export so the logic is testable. Mode filter runs BEFORE the dedup
  check: agreeing pairs are skipped entirely under
  --mode disagreements, so any later disagreeing occurrence can
  still claim the dedup slot.
- Added tests/brief-dedup-shadow-sample.test.mjs with 5 regression
  cases: agreement-then-disagreement, reversed order (symmetry),
  always-agreed omission, population enumeration, cross-batch dedup.
- isMain guard added so importing the module for tests does not
  kick off the CLI scan path.

Verification:
- test:data           5825/5825 pass
- dedup suites         70/70   pass (5 new shadow-sample regressions)
- typecheck + api     clean
- biome check         clean on changed files

Operator follow-up before Phase C:
Set all FOUR dedup repo variables in GitHub alongside Railway:
  DIGEST_DEDUP_MODE, DIGEST_DEDUP_REMOTE_EMBED_ENABLED,
  DIGEST_DEDUP_COSINE_THRESHOLD, DIGEST_DEDUP_ENTITY_VETO_ENABLED

* refactor(digest-dedup): Railway is the single source of truth for dedup config

Fair user pushback: asking operators to set four DIGEST_DEDUP_*
values in BOTH Railway (where the cron runs) AND GitHub repo
variables (where the canary runs) is architectural debt. Two
copies of the same truth will always drift.

Solution: the digest cron publishes its resolved config to Upstash
on every tick under brief:dedup:config:v1 (2h TTL). The nightly
golden-pair canary reads that key instead of env vars. Railway
stays the sole source of truth; no parallel repo variables to
maintain. A missing/expired key signals "cron hasn't run" and
the canary skips with exit 0 — better than validating against
hardcoded defaults that might diverge from prod.

Changes:
- brief-dedup-consts.mjs: new ACTIVE_CONFIG_KEY + TTL constants.
- brief-dedup.mjs: new publishActiveConfig() fires at the start of
  every deduplicateStories() call (before the mode short-circuit,
  so jaccard ticks also publish a "mode=jaccard" signal the canary
  can read). Fire-and-forget; archive-write error semantics still
  apply if the operator wants stricter tracking.
- golden-pair-validator.mjs: removed readOrchestratorConfig(env)
  path. Now calls fetchActiveConfigFromUpstash() and either
  validates against that config, skips when the embed path is
  inactive, or skips when the key is missing (with --force
  override for manual dispatch).
- dedup-golden-pairs.yml: dropped the four DIGEST_DEDUP_* env lines
  and the corresponding repo-variable dependency. Only the three
  Upstash + OpenRouter secrets remain.
- tests: two new regressions assert config is published on every
  tick (shadow AND jaccard modes) with the right shape + TTL.

Operator onboarding now takes one action: set the four
DIGEST_DEDUP_* variables on the Railway seed-digest-notifications
service. Nothing to set in GitHub beyond the existing
OPENROUTER_API_KEY / UPSTASH_* secrets.

Verification:
- test:data           5825/5825 pass
- dedup suites         72/72   pass (2 new config-publish regressions)
- typecheck + api     clean
- biome check         clean on changed files

* refactor(digest-dedup): ship embed directly, drop phases/canary/shadow

User feedback: "i dont need multiple phases and shit, we go directly
to embed". Fair. Ripping out the overengineering I accumulated:

DELETED
- .github/workflows/dedup-golden-pairs.yml (nightly canary)
- scripts/tools/golden-pair-validator.mjs
- scripts/tools/shadow-sample.mjs
- scripts/tools/calibrate-dedup-threshold.mjs
- tests/fixtures/brief-dedup-golden-pairs.json
- tests/brief-dedup-golden.test.mjs
- tests/brief-dedup-shadow-sample.test.mjs

SIMPLIFIED
- brief-dedup.mjs: removed shadow mode, publishActiveConfig,
  writeShadowArchive, diffClustersByHash, jaccardRepsToClusterHashes,
  and the DIGEST_DEDUP_REMOTE_EMBED_ENABLED knob. MODE is now
  binary: `embed` (default) or `jaccard` (instant kill switch).
- brief-dedup-consts.mjs: dropped SHADOW_ARCHIVE_*, ACTIVE_CONFIG_*.
- Default flipped: DIGEST_DEDUP_MODE unset = embed (prod path).
  Railway deploy with OPENROUTER_API_KEY set = embeddings live on
  next cron tick. Set MODE=jaccard on Railway to revert instantly.

Orchestrator still falls back to Jaccard on any embed-path failure
(timeout, provider outage, missing API key, bad response). Fallback
warn carries reason=<ErrorName>. The cron never fails because
embeddings flaked. All 64 dedup tests + 5825 data tests still green.

Net diff: -1,407 lines.

Operator single action: set OPENROUTER_API_KEY on Railway's
seed-digest-notifications service (already present) and ship. No
GH Actions, no shadow archives, no labelling sprints. If the 0.60
threshold turns out wrong, tune DIGEST_DEDUP_COSINE_THRESHOLD on
Railway — takes effect on next tick, no redeploy.

* fix(digest-dedup): multi-word location phrases in the entity veto

Extractor was whitespace-tokenising and only single-token matching
against LOCATION_GAZETTEER, silently making every multi-word entry
unreachable:

  extractEntities("Houthis strike ship in Red Sea")
    → { locations: [], actors: ['houthis','red','sea'] }   ✗
  shouldVeto("Houthis strike ship in Red Sea",
             "US escorts convoy in Red Sea")  → false       ✗

With MODE=embed as the default, that turned off the main
anti-overmerge safety rail for bodies of water, regions, and
compound city names — exactly the P07-Hormuz / Houthis-Red-Sea
headlines the veto was designed to cover.

Fix: greedy longest-phrase scan with a sliding window. At each
token position try the longest multi-word phrase first (down to
2), require first AND last tokens to be capitalised (so lowercase
prose like "the middle east" doesn't falsely match while headline
"Middle East" does), lowercase connectors in between are fine
("Strait of Hormuz" → phrase "strait of hormuz" ✓). Falls back to
single-token lookup when no multi-word phrase fits.

Now:
  extractEntities("Houthis strike ship in Red Sea")
    → { locations: ['red sea'], actors: ['houthis'] }       ✓
  shouldVeto(Red-Sea-Houthis, Red-Sea-US) → true             ✓

Complexity still O(N · MAX_PHRASE_LEN) — MAX_PHRASE_LEN is 4
(longest gazetteer entry: "ho chi minh city"), so this is
effectively O(N).

Added 5 regression tests covering Red Sea, South China Sea,
Strait of Hormuz (lowercase-connector case), Abu Dhabi, and
New York, plus the Houthis-vs-US veto reproducer from the P1.
All 5825 data tests + 45 dedup tests green; lint + typecheck clean.
2026-04-19 13:49:48 +04:00

557 lines
22 KiB
JavaScript

/**
* Embedding-dedup integration tests against a deterministic stub
* embedder — no network. Covers the 9 scenarios enumerated in
* docs/plans/2026-04-19-001-feat-embedding-based-story-dedup-plan.md:
*
* 1. Happy path
* 2. Cold-cache timeout → Jaccard fallback
* 3. Provider outage → Jaccard fallback
* 4. Shadow mode
* 5. Entity veto fires
* 6. Complete-link non-chaining
* 7. Cluster-level fixture
* 8. Remote-embed-disabled bypass
* 9. Permutation-invariance property test
*
* The live-embedder golden-pair validator lives in a separate nightly
* CI job (.github/workflows/dedup-golden-pairs.yml) — it's NOT run
* from the brief cron and NOT in this file.
*
* Run: node --test tests/brief-dedup-embedding.test.mjs
*/
import { describe, it } from 'node:test';
import assert from 'node:assert/strict';
import { deduplicateStories } from '../scripts/lib/brief-dedup.mjs';
import { deduplicateStoriesJaccard } from '../scripts/lib/brief-dedup-jaccard.mjs';
import {
EmbeddingProviderError,
EmbeddingTimeoutError,
cosineSimilarity,
normalizeForEmbedding,
} from '../scripts/lib/brief-embedding.mjs';
import {
completeLinkCluster,
extractEntities,
shouldVeto,
} from '../scripts/lib/brief-dedup-embed.mjs';
// ── Fixture helpers ───────────────────────────────────────────────────────────
function story(title, score = 10, mentions = 1, hash = undefined) {
return {
title,
currentScore: score,
mentionCount: mentions,
sources: [],
severity: 'critical',
hash: hash ?? `h-${title.slice(0, 16).replace(/\W+/g, '-')}`,
};
}
// Orchestrator env that turns on the embed path without shadow-archive
// dependencies.
const EMBED_MODE = { DIGEST_DEDUP_MODE: 'embed', DIGEST_DEDUP_COSINE_THRESHOLD: '0.5' };
/**
* Build a stub embedBatch that looks up each normalised title in a
* provided map. Captures call count for assertion-based tests. Any
* title missing from the map is embedded as the zero vector — which
* will fail cosine similarity > 0, so the test will notice.
*/
function stubEmbedder(vectorByNormalizedTitle) {
const calls = [];
async function embedBatch(normalizedTitles) {
calls.push(normalizedTitles.slice());
return normalizedTitles.map((t) => {
const v = vectorByNormalizedTitle.get(t);
if (!v) throw new Error(`stubEmbedder: no vector for "${t}"`);
return v;
});
}
return { embedBatch, calls };
}
function noopPipeline() {
return null;
}
/**
* Captures log lines emitted by the orchestrator so tests can assert
* on observability output without swallowing real console output.
*/
function lineCollector() {
const lines = [];
return {
lines,
log: (line) => lines.push({ level: 'log', line }),
warn: (line) => lines.push({ level: 'warn', line }),
};
}
// ── Scenario 1 — Happy path ───────────────────────────────────────────────────
describe('Scenario 1 — happy path: embed clusters near-duplicates', () => {
it('merges two near-duplicate stories into one cluster when embed mode is on', async () => {
const titles = [
'iran closes strait of hormuz',
'iran shuts strait of hormuz',
'myanmar coup leader elected president',
];
// Near-parallel vectors for 0/1 (cos ≈ 0.95), orthogonal for 2.
const vecByTitle = new Map([
[titles[0], [1, 0, 0]],
[titles[1], [0.95, Math.sqrt(1 - 0.95 * 0.95), 0]],
[titles[2], [0, 0, 1]],
]);
const embedder = stubEmbedder(vecByTitle);
const collector = lineCollector();
const stories = [
story('Iran closes Strait of Hormuz', 90, 1, 'h0'),
story('Iran shuts Strait of Hormuz', 85, 1, 'h1'),
story('Myanmar coup leader elected president', 80, 1, 'h2'),
];
const out = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: embedder.embedBatch,
redisPipeline: noopPipeline,
...collector,
});
assert.equal(embedder.calls.length, 1, 'exactly one batched embedBatch call');
assert.equal(out.length, 2, 'two clusters (merged pair + singleton)');
const merged = out.find((c) => c.mergedHashes.length === 2);
assert.ok(merged, 'one cluster contains the two Hormuz variants');
assert.deepEqual(new Set(merged.mergedHashes), new Set(['h0', 'h1']));
assert.equal(merged.mentionCount, 2);
const singleton = out.find((c) => c.mergedHashes.length === 1);
assert.ok(singleton);
assert.equal(singleton.mergedHashes[0], 'h2');
// Structured log line emitted.
assert.ok(collector.lines.some((l) => l.line.includes('mode=embed')));
assert.ok(collector.lines.some((l) => l.line.includes('fallback=false')));
});
});
// ── Scenario 2 — timeout ──────────────────────────────────────────────────────
describe('Scenario 2 — cold-cache timeout collapses to Jaccard', () => {
it('EmbeddingTimeoutError falls back to Jaccard for the whole batch', async () => {
const throwingEmbedder = async () => {
throw new EmbeddingTimeoutError();
};
const stories = [
story('Iran closes Strait of Hormuz', 90, 1, 'h0'),
story('Iran shuts Strait of Hormuz', 85, 1, 'h1'),
];
const collector = lineCollector();
const out = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: throwingEmbedder,
redisPipeline: noopPipeline,
...collector,
});
// Jaccard output is the ground truth under fallback — deep-equal
// cluster shape, not just length, so a regression that preserves
// count but changes membership or representative can't slip.
const expected = deduplicateStoriesJaccard(stories);
assert.equal(out.length, expected.length);
for (let i = 0; i < out.length; i++) {
assert.equal(out[i].hash, expected[i].hash);
assert.deepEqual(out[i].mergedHashes, expected[i].mergedHashes);
assert.equal(out[i].mentionCount, expected[i].mentionCount);
}
// Fallback warn line must carry a filterable reason= field.
const fallbackWarn = collector.lines.find(
(l) => l.level === 'warn' && l.line.includes('falling back to Jaccard'),
);
assert.ok(fallbackWarn, 'warn line on fallback');
assert.match(fallbackWarn.line, /reason=EmbeddingTimeoutError\b/);
});
});
// ── Scenario 3 — provider outage ──────────────────────────────────────────────
describe('Scenario 3 — provider outage collapses to Jaccard', () => {
it('EmbeddingProviderError (HTTP 503) falls back', async () => {
const throwingEmbedder = async () => {
throw new EmbeddingProviderError('OpenRouter returned HTTP 503', { status: 503 });
};
const stories = [story('a', 10, 1, 'a1'), story('b', 10, 1, 'b1')];
const collector = lineCollector();
const out = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: throwingEmbedder,
redisPipeline: noopPipeline,
...collector,
});
const expected = deduplicateStoriesJaccard(stories);
assert.equal(out.length, expected.length);
for (let i = 0; i < out.length; i++) {
assert.equal(out[i].hash, expected[i].hash);
assert.deepEqual(out[i].mergedHashes, expected[i].mergedHashes);
assert.equal(out[i].mentionCount, expected[i].mentionCount);
}
const fallbackWarn = collector.lines.find((l) => l.level === 'warn');
assert.ok(fallbackWarn, 'warn line on fallback');
assert.match(fallbackWarn.line, /reason=EmbeddingProviderError\b/);
});
});
// ── Scenario 4 / 8 — shadow mode and remote-embed kill switch were
// removed when the rollout was simplified to "ship embed directly".
// MODE=jaccard is the only rollback path; covered in
// tests/brief-dedup-jaccard.test.mjs.
// ── Scenario 5 — entity veto ──────────────────────────────────────────────────
describe('Scenario 5 — entity veto blocks same-location, different-actor merges', () => {
it('shouldVeto fires on canonical Biden/Xi vs Biden/Putin case', () => {
assert.equal(
shouldVeto('Biden meets Xi in Tokyo', 'Biden meets Putin in Tokyo'),
true,
);
});
it('defers to cosine on Iran/Tehran + Hormuz (documented heuristic limitation)', () => {
// Capital-country coreference is not resolved in v1. The plan's
// original spec claimed the veto would fire here via "unique
// actors {Iran} vs {Tehran}", but the classification rule is:
// - Iran → actor (country, not in gazetteer)
// - Tehran → location (capital city IS in the gazetteer)
// - Hormuz → location
// With the two anchors on different sides of the actor/location
// boundary, there's no symmetric "unique actor on each side"
// signal and the veto can't conclude. Behaviour falls through
// to cosine — which on real text may merge (false positive)
// or split (false negative) depending on wording. Accepted for
// v1 as the documented limitation; a name-normaliser is the
// future fix.
assert.equal(
shouldVeto('Iran closes Hormuz', 'Tehran shuts Hormuz'),
false,
);
});
it('shouldVeto does NOT fire when actors fully match', () => {
assert.equal(shouldVeto('Trump meets Xi', 'Trump Xi summit'), false);
});
it('shouldVeto defers to cosine when proper-noun sets are empty on both sides', () => {
assert.equal(shouldVeto('the meeting concludes', 'the meeting ends'), false);
});
it('veto blocks cluster admission end-to-end', async () => {
// High cosine (0.99) but disagreeing actors → veto fires and
// the stories stay in separate clusters.
const stories = [
story('Biden meets Xi in Tokyo', 90, 1, 'xi'),
story('Biden meets Putin in Tokyo', 85, 1, 'putin'),
];
const vecByTitle = new Map([
[normalizeForEmbedding(stories[0].title), [1, 0, 0]],
[normalizeForEmbedding(stories[1].title), [0.99, Math.sqrt(1 - 0.99 * 0.99), 0]],
]);
const embedder = stubEmbedder(vecByTitle);
const out = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: embedder.embedBatch,
redisPipeline: noopPipeline,
});
assert.equal(out.length, 2, 'veto keeps the two titles in separate clusters');
});
it('DIGEST_DEDUP_ENTITY_VETO_ENABLED=0 disables the veto at runtime', async () => {
const stories = [
story('Biden meets Xi in Tokyo', 90, 1, 'xi'),
story('Biden meets Putin in Tokyo', 85, 1, 'putin'),
];
const vecByTitle = new Map([
[normalizeForEmbedding(stories[0].title), [1, 0, 0]],
[normalizeForEmbedding(stories[1].title), [0.99, Math.sqrt(1 - 0.99 * 0.99), 0]],
]);
const embedder = stubEmbedder(vecByTitle);
const out = await deduplicateStories(stories, {
env: { ...EMBED_MODE, DIGEST_DEDUP_ENTITY_VETO_ENABLED: '0' },
embedBatch: embedder.embedBatch,
redisPipeline: noopPipeline,
});
assert.equal(out.length, 1, 'without the veto, cosine alone merges the two titles');
});
});
// ── Scenario 6 — complete-link non-chaining ───────────────────────────────────
describe('Scenario 6 — complete-link blocks transitive chaining', () => {
it('A~B=0.65, B~C=0.65, A~C=0.30 → {A,B} and {C}, NOT {A,B,C}', () => {
// Constructed so pairwise cosines are exact (see plan for derivation).
const a = [1, 0, 0, 0];
const b = [0.65, Math.sqrt(1 - 0.65 * 0.65), 0, 0];
// c must satisfy: a·c = 0.30, b·c = 0.65, |c| = 1.
// Solving: cx=0.30; cy=(0.65 - 0.65*0.30)/sqrt(1-0.4225) = 0.4550/0.7599 = 0.599;
// cz = sqrt(1 - 0.09 - 0.359) = sqrt(0.551) = 0.7423
const cx = 0.3;
const cy = (0.65 - 0.65 * 0.3) / Math.sqrt(1 - 0.65 * 0.65);
const cz = Math.sqrt(1 - cx * cx - cy * cy);
const c = [cx, cy, cz, 0];
// Sanity-check the construction so a regression in the derivation
// can't mask a real bug.
assert.ok(Math.abs(cosineSimilarity(a, b) - 0.65) < 1e-6);
assert.ok(Math.abs(cosineSimilarity(b, c) - 0.65) < 1e-6);
assert.ok(Math.abs(cosineSimilarity(a, c) - 0.3) < 1e-6);
const items = [
{ title: 'A', embedding: a },
{ title: 'B', embedding: b },
{ title: 'C', embedding: c },
];
const { clusters } = completeLinkCluster(items, { cosineThreshold: 0.5 });
// {A,B} should be one cluster, {C} separate — not {A,B,C}.
assert.equal(clusters.length, 2);
const abCluster = clusters.find((cl) => cl.length === 2);
const cCluster = clusters.find((cl) => cl.length === 1);
assert.ok(abCluster && cCluster, 'two clusters: the A+B pair and the C singleton');
assert.ok(abCluster.includes(0) && abCluster.includes(1));
assert.ok(cCluster.includes(2));
});
});
// ── Scenario 7 — cluster-level fixture ────────────────────────────────────────
describe('Scenario 7 — cluster-level fixture', () => {
it('10-story fixture clusters into the expected shape', async () => {
// Four real wire-headline clusters plus two singletons = 6 clusters.
// Vectors are hand-crafted so only intended-cluster pairs clear 0.5.
const e1 = [1, 0, 0, 0, 0, 0];
const e2 = [0, 1, 0, 0, 0, 0];
const e3 = [0, 0, 1, 0, 0, 0];
const e4 = [0, 0, 0, 1, 0, 0];
const e5 = [0, 0, 0, 0, 1, 0];
const e6 = [0, 0, 0, 0, 0, 1];
function near(axis, epsilon = 0.03) {
// Same-direction vector at cosine > 0.99 to `axis` basis.
const out = axis.slice();
return out.map((v) => v * (1 - epsilon));
}
const fixtures = [
{ title: 'Iran closes Strait of Hormuz', hash: 'a1', v: e1, expectCluster: 'A' },
{ title: 'Iran shuts Strait of Hormuz', hash: 'a2', v: near(e1), expectCluster: 'A' },
{ title: 'US fighter jet downed over Iran', hash: 'b1', v: e2, expectCluster: 'B' },
{ title: 'American aircraft shot down in Iran', hash: 'b2', v: near(e2), expectCluster: 'B' },
{ title: 'Myanmar coup leader sworn in', hash: 'c1', v: e3, expectCluster: 'C' },
{ title: 'Myanmar junta chief takes office', hash: 'c2', v: near(e3), expectCluster: 'C' },
{ title: 'Brent crude tops $140', hash: 'd1', v: e4, expectCluster: 'D' },
{ title: 'Oil price surges past $140', hash: 'd2', v: near(e4), expectCluster: 'D' },
{ title: 'Singleton 1', hash: 's1', v: e5, expectCluster: 'E' },
{ title: 'Singleton 2', hash: 's2', v: e6, expectCluster: 'F' },
];
const stories = fixtures.map((f) =>
story(f.title, 100 - fixtures.indexOf(f), 1, f.hash),
);
const vecByTitle = new Map(
fixtures.map((f) => [normalizeForEmbedding(f.title), f.v]),
);
const embedder = stubEmbedder(vecByTitle);
const out = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: embedder.embedBatch,
redisPipeline: noopPipeline,
});
// 6 clusters total: 4 pairs + 2 singletons.
assert.equal(out.length, 6);
// Each expected pair's hashes should land in the same cluster.
const pairs = [['a1', 'a2'], ['b1', 'b2'], ['c1', 'c2'], ['d1', 'd2']];
for (const [x, y] of pairs) {
const cluster = out.find((c) => c.mergedHashes.includes(x));
assert.ok(cluster?.mergedHashes.includes(y), `${x} and ${y} should cluster together`);
}
// Singletons stay alone.
const s1 = out.find((c) => c.mergedHashes.includes('s1'));
const s2 = out.find((c) => c.mergedHashes.includes('s2'));
assert.equal(s1.mergedHashes.length, 1);
assert.equal(s2.mergedHashes.length, 1);
});
});
// ── Scenario 9 — permutation-invariance property test ────────────────────────
describe('Scenario 9 — permutation-invariance', () => {
it('10 random input orders of the same 15-story set produce identical clusters', async () => {
// Construct 15 stories in 5 clusters of 3. Each cluster shares a
// near-unit basis vector; clusters are pairwise orthogonal.
const N_CLUSTERS = 5;
const PER_CLUSTER = 3;
const fixtures = [];
for (let c = 0; c < N_CLUSTERS; c++) {
const basis = Array.from({ length: N_CLUSTERS }, (_, i) => (i === c ? 1 : 0));
for (let k = 0; k < PER_CLUSTER; k++) {
const jitter = basis.map((v, i) => (i === c ? v - k * 0.002 : v));
fixtures.push({
title: `Cluster ${c} item ${k}`,
hash: `c${c}-k${k}`,
v: jitter,
score: 100 - (c * PER_CLUSTER + k),
});
}
}
const stories = fixtures.map((f) => story(f.title, f.score, 1, f.hash));
const vecByTitle = new Map(
fixtures.map((f) => [normalizeForEmbedding(f.title), f.v]),
);
function sigFor(out) {
// Canonical representation: each cluster as a sorted hash list,
// overall list sorted.
return out.map((c) => [...c.mergedHashes].sort()).map((l) => l.join(',')).sort().join('|');
}
// Baseline run on the canonical input order.
const baseline = await deduplicateStories(stories, {
env: EMBED_MODE,
embedBatch: stubEmbedder(vecByTitle).embedBatch,
redisPipeline: noopPipeline,
});
const baselineSig = sigFor(baseline);
// Ten random permutations — each must produce the IDENTICAL cluster set.
let seed = 42;
function rand() {
seed = (seed * 1103515245 + 12345) & 0x7fffffff;
return seed / 0x7fffffff;
}
for (let run = 0; run < 10; run++) {
const shuffled = [...stories];
for (let i = shuffled.length - 1; i > 0; i--) {
const j = Math.floor(rand() * (i + 1));
[shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
}
const out = await deduplicateStories(shuffled, {
env: EMBED_MODE,
embedBatch: stubEmbedder(vecByTitle).embedBatch,
redisPipeline: noopPipeline,
});
assert.equal(
sigFor(out),
baselineSig,
`permutation ${run} produced a different cluster set`,
);
}
});
});
// ── Entity extraction unit tests ──────────────────────────────────────────────
describe('extractEntities', () => {
it('classifies country name as actor, strait as location', () => {
// Per plan intent: countries are geopolitical actors ("Iran does X"),
// physical geography is the venue.
const { locations, actors } = extractEntities('Iran closes Strait of Hormuz');
assert.ok(actors.includes('iran'));
// Multi-word match finds "strait of hormuz", NOT the single-token
// fallback "hormuz" — the full phrase is in the gazetteer.
assert.ok(
locations.includes('strait of hormuz') || locations.includes('hormuz'),
'hormuz location must be detected (as phrase or single token)',
);
assert.ok(!locations.includes('iran'));
});
it('classifies city as location, person as actor', () => {
const { locations, actors } = extractEntities('Biden meets Xi in Tokyo');
assert.ok(locations.includes('tokyo'));
assert.ok(actors.includes('biden'));
assert.ok(actors.includes('xi'));
});
it('skips common capitalized sentence-starters', () => {
const { locations, actors } = extractEntities('The meeting begins');
assert.equal(locations.length, 0);
assert.equal(actors.length, 0);
});
it('keeps sentence-start proper nouns', () => {
const { actors } = extractEntities('Trump to visit Japan');
assert.ok(actors.includes('trump'));
// Japan is a country → actor, not location
assert.ok(actors.includes('japan'));
});
// Regression: multi-word gazetteer entries are matched as whole
// phrases. An earlier implementation split on whitespace and only
// checked single tokens, so "Red Sea", "South China Sea", "New York",
// etc. silently fell through to the actor bucket and disabled the
// veto for a whole class of real headlines.
it('matches multi-word location: Red Sea', () => {
const { locations, actors } = extractEntities('Houthis strike ship in Red Sea');
assert.ok(locations.includes('red sea'));
assert.ok(!actors.includes('red'));
assert.ok(!actors.includes('sea'));
assert.ok(actors.includes('houthis'));
});
it('matches multi-word location: South China Sea', () => {
const { locations } = extractEntities('Tensions flare in South China Sea');
assert.ok(locations.includes('south china sea'));
});
it('matches multi-word location with lowercase connector: Strait of Hormuz', () => {
const { locations } = extractEntities('Iran closes Strait of Hormuz');
assert.ok(locations.includes('strait of hormuz'));
});
it('matches multi-word city: Abu Dhabi', () => {
const { locations } = extractEntities('Summit held in Abu Dhabi');
assert.ok(locations.includes('abu dhabi'));
});
it('matches multi-word city: New York', () => {
const { locations } = extractEntities('UN meeting in New York');
assert.ok(locations.includes('new york'));
});
// Veto end-to-end: reproducer from the P1 finding. Two Red-Sea
// headlines share a location and disagree on the actor — veto
// MUST fire (otherwise the main anti-overmerge guard is off for
// bodies-of-water / region headlines).
it('shouldVeto: Houthis vs US on Red Sea — location phrase match fires the veto', () => {
assert.equal(
shouldVeto('Houthis strike ship in Red Sea', 'US escorts convoy in Red Sea'),
true,
);
});
});
// ── Cosine helper ─────────────────────────────────────────────────────────────
describe('cosineSimilarity', () => {
it('returns 1 for identical vectors', () => {
assert.equal(cosineSimilarity([1, 2, 3], [1, 2, 3]), 1);
});
it('returns 0 for orthogonal vectors', () => {
assert.equal(cosineSimilarity([1, 0], [0, 1]), 0);
});
it('handles a zero vector without throwing', () => {
assert.equal(cosineSimilarity([0, 0], [1, 1]), 0);
});
});