1 Commits

Author SHA1 Message Date
Elie Habib
9a0b0ccef8 fix(geo): tokenization-based keyword matching to prevent false positives (#503)
* fix(geo): tokenization-based keyword matching to prevent false positives

Replace String.includes() with tokenization-based Set.has() matching
across the geo-tagging pipeline. Prevents false positives like "assad"
matching inside "ambassador" and "hts" matching inside "rights".

- Add src/utils/keyword-match.ts as single source of truth
- Decompose possessives/hyphens ("Assad's" → includes "assad")
- Support multi-word phrase matching ("white house" as contiguous)
- Remove false-positive-prone DC keywords ('house', 'us ')
- Update 9 consumer files across geo-hub, map, CII, and asset systems
- Add 44 tests covering false positives, true positives, edge cases

Co-authored-by: karim <mirakijka@gmail.com>
Fixes #324

* fix(geo): add inflection suffix matching + fix test imports

Address code review feedback:

P1a: Add suffix-aware matching for plurals and demonyms so existing
keyword lists don't regress (houthi→houthis, ukraine→ukrainian,
iran→iranian, israel→israeli, russia→russian, taiwan→taiwanese).
Uses curated suffix list + e-dropping rule to avoid false positives.

P1b: Expand conflictTopics arrays in DeckGLMap and Map with demonym
forms so "Iranian senate..." correctly registers as conflict topic.

P2: Replace inline test functions with real module import via tsx.
Tests now exercise the production keyword-match.ts directly.

* fix: wire geo-keyword tests into test:data command

The .mts test file wasn't covered by `node --test tests/*.test.mjs`.
Add `npx tsx --test tests/*.test.mts` so test:data runs both suites.

* fix: cross-platform test:data + pin tsx in devDependencies

- Use tsx as test runner for both .mjs and .mts (single invocation)
- Removes ; separator which breaks on Windows cmd.exe
- Add tsx to devDependencies so it works in offline/CI environments

* fix(geo): multi-word demonym matching + short-keyword suffix guard

- Add wordMatches() for suffix-aware phrase matching so "South Korean"
  matches keyword "south korea" and "North Korean" matches "north korea"
- Add MIN_SUFFIX_KEYWORD_LEN=4 guard so short keywords like "ai", "us",
  "hts" only do exact-match (prevents "ais"→"ai", "uses"→"us" false positives)
- Add 5 new tests covering both fixes (58 total, all passing)

* fix(geo): support plural demonyms in keyword matching

Add compound suffixes (ians, eans, ans, ns, is) to handle plural
demonym forms like "Iranians"→"iran", "Ukrainians"→"ukraine",
"Russians"→"russia", "Israelis"→"israel". Adds 5 new tests (63 total).

---------

Co-authored-by: karim <mirakijka@gmail.com>
2026-02-28 10:58:53 +04:00