Files
worldmonitor/scripts
Elie Habib 93304c5c25 feat(intelligence): add countryCode geo-attribution to topStories (#2051) (#2094)
* feat(intelligence): add countryCode geo-attribution to topStories (#2051)

* fix(geo-extract): filter EU as supranational, add unigram stopwords, type countryCode in ServerInsightStory

- Map 'eu'/'europe' to 'XX' (supranational marker, returns null) instead of 'EU' which is not a valid ISO2 code and would be silently ignored by downstream CII scorer
- Add UNIGRAM_STOPWORDS set for high-false-positive single-word entries in country-names.json: chad/jordan/georgia/niger/guinea/mali/peru — these match too frequently as person names and US state names in English headlines; their country meanings are covered by unambiguous aliases (nigerian, georgian context via bigrams, etc.)
- Add countryCode: string | null and pubDate: string to ServerInsightStory TypeScript interface to match what seed-insights.mjs now writes to Redis

* fix(geo-extract): add 'us' to UNIGRAM_STOPWORDS to prevent pronoun false positives

'us' as a bare word matches almost every English headline ("give us",
"tells us", etc.). US coverage is preserved via the 'washington' and
'american' aliases in ALIAS_MAP.

* fix(geo-extract): fix US abbreviation, bigram punctuation, and scan ordering

Three issues:
1. 'us' stopword suppressed uppercase US (country). Fix: pre-process
   \bUS\b → 'United States' before lowercasing; remove 'us' from stopwords.

2. Bigram matching used raw tokens so 'West Bank,' and 'Tel Aviv:' missed
   their alias entries. Fix: strip punctuation from each token before
   forming the bigram key.

3. Two-pass scan (all bigrams then all unigrams) meant 'United States'
   bigram fired before earlier unigrams like 'Iran' in 'Iran blames US'.
   Fix: single left-to-right scan with local longest-match (bigram at i
   before unigram at i), preserving first-mention document order.
2026-03-23 16:32:34 +04:00
..