mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-05-11 17:46:20 +02:00
* feat(intelligence): add countryCode geo-attribution to topStories (#2051) * fix(geo-extract): filter EU as supranational, add unigram stopwords, type countryCode in ServerInsightStory - Map 'eu'/'europe' to 'XX' (supranational marker, returns null) instead of 'EU' which is not a valid ISO2 code and would be silently ignored by downstream CII scorer - Add UNIGRAM_STOPWORDS set for high-false-positive single-word entries in country-names.json: chad/jordan/georgia/niger/guinea/mali/peru — these match too frequently as person names and US state names in English headlines; their country meanings are covered by unambiguous aliases (nigerian, georgian context via bigrams, etc.) - Add countryCode: string | null and pubDate: string to ServerInsightStory TypeScript interface to match what seed-insights.mjs now writes to Redis * fix(geo-extract): add 'us' to UNIGRAM_STOPWORDS to prevent pronoun false positives 'us' as a bare word matches almost every English headline ("give us", "tells us", etc.). US coverage is preserved via the 'washington' and 'american' aliases in ALIAS_MAP. * fix(geo-extract): fix US abbreviation, bigram punctuation, and scan ordering Three issues: 1. 'us' stopword suppressed uppercase US (country). Fix: pre-process \bUS\b → 'United States' before lowercasing; remove 'us' from stopwords. 2. Bigram matching used raw tokens so 'West Bank,' and 'Tel Aviv:' missed their alias entries. Fix: strip punctuation from each token before forming the bigram key. 3. Two-pass scan (all bigrams then all unigrams) meant 'United States' bigram fired before earlier unigrams like 'Iran' in 'Iran blames US'. Fix: single left-to-right scan with local longest-match (bigram at i before unigram at i), preserving first-mention document order.