Files
worldmonitor/scripts/seed-research.mjs
Elie Habib 485d416065 feat(seeds): Railway seed scripts for all unseeded Vercel RPC endpoints (#1599)
* feat(seeds): add Railway seed scripts for economic and trade endpoints

Two new seed scripts to eliminate Vercel edge external API calls:

seed-economy.mjs:
- EIA energy prices (WTI, Brent) -> economic:energy:v1:all
- EIA energy capacity (Solar, Wind, Coal) -> economic:capacity:v1:COL,SUN,WND:20
- FRED series (10 series) -> economic:fred:v1:<id>:120
- Macro signals (Yahoo, Alternative.me, Mempool) -> economic:macro-signals:v1

seed-supply-chain-trade.mjs:
- Shipping rates (FRED) -> supply_chain:shipping:v2
- Trade barriers (WTO tariff gap) -> trade:barriers:v1:tariff-gap:50
- Trade restrictions (WTO MFN overview) -> trade:restrictions:v1:tariff-overview:50
- Trade flows (WTO, 15 major reporters) -> trade:flows:v1:<reporter>:000:10
- Tariff trends (WTO, 15 major reporters) -> trade:tariffs:v1:<reporter>:all:10

Cache keys match handler patterns exactly so cachedFetchJson finds
pre-seeded data and avoids live external API calls from Vercel edge.

* feat(seeds): add seed-aviation.mjs for airport ops and aviation news

Seeds 2 aviation endpoints with predictable default params:
- getAirportOpsSummary (AviationStack + NOTAM) -> aviation:ops-summary:v1:CDG,ESB,FRA,IST,LHR,SAW
- listAviationNews (9 RSS feeds, 24h window) -> aviation:news::24:v1

NOT seeded (inherently on-demand, user-specific inputs):
- getFlightStatus: specific flight number lookup
- trackAircraft: bounding-box or icao24 queries
- listAirportFlights: arbitrary airport+direction+limit combos
- getCarrierOps: depends on listAirportFlights with variable params

* feat(seeds): add seed-conflict-intel.mjs for ACLED, HAPI, and PizzINT

Seeds 3 conflict/intelligence endpoints with predictable default params:
- listAcledEvents (all countries, last 30 days) -> conflict:acled:v1:all:0:0
- getHumanitarianSummary (20 top conflict countries) -> conflict:humanitarian:v1:<CC>
- getPizzintStatus (base + GDELT variants) -> intel:pizzint:v1:base, intel:pizzint:v1:gdelt

NOT seeded (inherently on-demand, LLM or user-specific inputs):
- classifyEvent: per-headline LLM classification
- deductSituation: per-query LLM deduction
- getCountryIntelBrief: per-country LLM brief with context hash
- getCountryFacts: per-country REST Countries + Wikidata + Wikipedia
- searchGdeltDocuments: per-query GDELT search

Requires: ACLED_EMAIL, ACLED_KEY, UPSTASH_REDIS_REST_URL/TOKEN

* feat(seeds): add seed-research.mjs for arXiv, HN, tech events, trending repos

Seeds 4 research endpoints:
- listArxivPapers (cs.AI, cs.CL, cs.CR) -> research:arxiv:v1:<cat>::50
- listHackernewsItems (top, best feeds) -> research:hackernews:v1:<feed>:30
- listTechEvents (Techmeme ICS + dev.events RSS) -> research:tech-events:v1
- listTrendingRepos (python, javascript, typescript) -> research:trending:v1:<lang>:daily:50

Tech events key is also seeded by the relay, this script provides backup
hydration and ensures the key is warm even if relay hasn't run yet.

Requires: UPSTASH_REDIS_REST_URL/TOKEN

* feat(seeds): add seed-military-maritime-news.mjs for USNI and nav warnings

Seeds 2 endpoints with predictable default params:
- USNI Fleet Report (WordPress JSON API) -> usni-fleet:sebuf:v1 + stale backup
- Navigational Warnings (NGA broadcast, all areas) -> maritime:navwarnings:v1:all

NOT seeded (inherently on-demand):
- getAircraftDetails/batch: per-icao24 Wingbits lookup
- listMilitaryFlights: bounding-box query (quantized 1-degree grid)
- getVesselSnapshot: in-memory cache, reads from relay /ais-snapshot
- listFeedDigest: per-feed-URL RSS caching (hundreds of feeds, relay proxied)
- summarizeArticle: per-article LLM summarization

Requires: UPSTASH_REDIS_REST_URL/TOKEN

* feat(seeds): add seed-infra.mjs warm-ping for service statuses and cable health

Uses warm-ping pattern (calls Vercel RPC from Railway) because:
- list-service-statuses: 30 status page parsers with 8 custom formats
- get-cable-health: NGA text analysis with cable name matching + proximity
Replicating this logic in a standalone script is fragile and duplicative.

NOT seeded (on-demand):
- search-imagery: per-bbox/datetime STAC query
- get-giving-summary: hardcoded baselines, no external fetches
- get-webcam-image: per-webcamId Windy API lookup

* fix(seeds): move secondary key writes before process.exit, fix data shapes

Critical bugs found in code review:

1. runSeed() calls process.exit(0) after primary key write, so .then()
   callbacks were dead code. All secondary keys (FRED, macro signals,
   trade data, HAPI summaries, pizzint, HN, trending, etc.) were NEVER
   written. Fix: move writeExtraKey calls inside fetchAll() before return.

2. FRED cache key used :120 suffix but handler default is :0 (req.limit||0).
   Fixed to :0 so seed matches handler cache key for default requests.

3. USNI and nav warnings seed parsers produced wrong data shapes vs handler
   (different field names, missing fields). Converted to warm-ping pattern
   (like seed-infra.mjs) to avoid shape divergence.

* fix(seeds): reduce GDELT 429 rate limiting in seed-gdelt-intel

Problems from logs: every topic fetch hits 429, runs take 3-5min,
4th run failed fatally after 12min of cascading retries.

Fixes:
- Increase inter-topic delay: 12s -> 20s (GDELT needs longer cooldown)
- Increase initial backoff: 10s -> 20s, with 15s increments per retry
- Graceful degradation: exhausted retries return empty topic instead of
  throwing (prevents withRetry from restarting ALL topics from scratch)
- Align TTL with health.js: 3600s -> 7200s (matches maxStaleMin:120)
- Validation allows partial success (3/6 topics minimum)

Cron interval should also be increased from 30min to 2h on Railway
to match the new 2h TTL.

* fix(seeds): 4 bugs from review - ACLED auth, NOTAM key, infra precedence, curated events

P1: ACLED auth used wrong endpoint (api/acled/token) and env vars (ACLED_KEY).
Fixed to match server/acled-auth.ts: ACLED_EMAIL+ACLED_PASSWORD via /oauth/token,
with ACLED_ACCESS_TOKEN static fallback.

P1: Aviation NOTAM key was aviation:notam-closures:v1, handler reads
aviation:notam:closures:v2. Fixed key to match _shared.ts.

P2: Infra warm-ping had operator precedence bug in nullish coalescing:
(a ?? b) ? c : d instead of a ?? (b ? c : d). Added parens.

P2: Research seed missed curated conferences that the handler appends
(CURATED_EVENTS in list-tech-events.ts). Added same curated events so
seeded data matches what the handler would produce.

* fix(seeds): add seed-meta freshness metadata for all secondary keys

Added writeExtraKeyWithMeta() to _seed-utils.mjs that writes both the
data key and a seed-meta:<key> freshness metadata entry. All secondary
key writes in seed scripts now use this helper so health.js can track
freshness for: energy capacity, FRED series, macro signals, trade
barriers/restrictions/flows/tariffs, aviation news, HAPI summaries,
PizzINT, arXiv categories, HN feeds, tech events, trending repos.

Previously only the primary key per script got seed-meta (via runSeed),
leaving secondary keys operationally invisible to health monitoring.

* fix(seeds): align seed-meta keys with health.js conventions

P1: writeExtraKeyWithMeta wrote seed-meta:<full-cache-key> (e.g.,
seed-meta:economic:macro-signals:v1), but health.js expects normalized
names without version suffixes (seed-meta:economic:macro-signals).
Fixed by stripping trailing :v\d+ from key. Added metaKeyOverride
param for cases needing explicit control.

P1: shipping seed used runSeed('supply-chain', 'shipping-trade', ...)
producing seed-meta:supply-chain:shipping-trade, but health.js expects
seed-meta:supply_chain:shipping. Fixed domain/resource to match.

* fix(seeds): only write seed-meta after successful data key write

writeExtraKey() now returns false on failure. writeExtraKeyWithMeta()
skips seed-meta write when the data write fails, preventing false-positive
health reports for keys like macro-signals and tech-events.
2026-03-15 00:37:31 +04:00

319 lines
13 KiB
JavaScript
Executable File

#!/usr/bin/env node
/**
* Seed research data to Redis for 4 research endpoints:
* - listArxivPapers (cs.AI default category)
* - listHackernewsItems (top feed)
* - listTechEvents (Techmeme ICS + dev.events RSS) — relay also seeds this
* - listTrendingRepos (python, javascript, typescript daily)
*/
import { loadEnvFile, CHROME_UA, runSeed, writeExtraKeyWithMeta, sleep } from './_seed-utils.mjs';
loadEnvFile(import.meta.url);
const ARXIV_TTL = 3600;
const HN_TTL = 600;
const TECH_EVENTS_TTL = 21600;
const TRENDING_TTL = 3600;
// ─── arXiv Papers ───
async function fetchArxivPapers() {
const categories = ['cs.AI', 'cs.CL', 'cs.CR'];
const results = {};
for (const cat of categories) {
const url = `https://export.arxiv.org/api/query?search_query=cat:${cat}&start=0&max_results=50`;
const resp = await fetch(url, {
headers: { Accept: 'application/xml', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(15_000),
});
if (!resp.ok) { console.warn(` arXiv ${cat}: HTTP ${resp.status}`); continue; }
const xml = await resp.text();
// Simple XML parse for arXiv entries
const papers = [];
const entryBlocks = xml.split('<entry>').slice(1);
for (const block of entryBlocks) {
const id = (block.match(/<id>([\s\S]*?)<\/id>/)?.[1] || '').trim().split('/').pop() || '';
const title = (block.match(/<title>([\s\S]*?)<\/title>/)?.[1] || '').trim().replace(/\s+/g, ' ');
const summary = (block.match(/<summary>([\s\S]*?)<\/summary>/)?.[1] || '').trim().replace(/\s+/g, ' ');
const published = block.match(/<published>([\s\S]*?)<\/published>/)?.[1]?.trim() || '';
const publishedAt = published ? new Date(published).getTime() : 0;
const urlMatch = block.match(/<link[^>]*rel="alternate"[^>]*href="([^"]+)"/);
const paperUrl = urlMatch?.[1] || `https://arxiv.org/abs/${id}`;
const authors = [];
const authorMatches = block.matchAll(/<author>\s*<name>([\s\S]*?)<\/name>/g);
for (const m of authorMatches) authors.push(m[1].trim());
const cats = [];
const catMatches = block.matchAll(/<category[^>]*term="([^"]+)"/g);
for (const m of catMatches) cats.push(m[1]);
if (title && id) papers.push({ id, title, summary, authors, categories: cats, publishedAt, url: paperUrl });
}
const cacheKey = `research:arxiv:v1:${cat}::50`;
if (papers.length > 0) {
results[cacheKey] = { papers, pagination: undefined };
}
console.log(` arXiv ${cat}: ${papers.length} papers`);
await sleep(3000); // arXiv rate limit: 1 req/3s
}
return results;
}
// ─── Hacker News ───
async function fetchHackerNews() {
const feeds = ['top', 'best'];
const results = {};
for (const feed of feeds) {
const idsResp = await fetch(`https://hacker-news.firebaseio.com/v0/${feed}stories.json`, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
});
if (!idsResp.ok) { console.warn(` HN ${feed}: HTTP ${idsResp.status}`); continue; }
const allIds = await idsResp.json();
if (!Array.isArray(allIds)) continue;
const ids = allIds.slice(0, 30);
const items = [];
for (let i = 0; i < ids.length; i += 10) {
const batch = ids.slice(i, i + 10);
const batchResults = await Promise.all(
batch.map(async (id) => {
try {
const res = await fetch(`https://hacker-news.firebaseio.com/v0/item/${id}.json`, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(5_000),
});
if (!res.ok) return null;
const raw = await res.json();
if (!raw || raw.type !== 'story') return null;
return {
id: raw.id || 0, title: raw.title || '', url: raw.url || '',
score: raw.score || 0, commentCount: raw.descendants || 0,
by: raw.by || '', submittedAt: (raw.time || 0) * 1000,
};
} catch { return null; }
}),
);
items.push(...batchResults.filter(Boolean));
}
const cacheKey = `research:hackernews:v1:${feed}:30`;
if (items.length > 0) {
results[cacheKey] = { items, pagination: undefined };
}
console.log(` HN ${feed}: ${items.length} stories`);
}
return results;
}
// ─── Tech Events (Techmeme ICS + dev.events RSS) ───
async function fetchTechEvents() {
const ICS_URL = 'https://www.techmeme.com/newsy_events.ics';
const RSS_URL = 'https://dev.events/rss.xml';
const events = [];
// Techmeme ICS
try {
const resp = await fetch(ICS_URL, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(8_000),
});
if (resp.ok) {
const ics = await resp.text();
const blocks = ics.split('BEGIN:VEVENT').slice(1);
for (const block of blocks) {
const summary = block.match(/SUMMARY:(.+)/)?.[1]?.trim() || '';
const location = block.match(/LOCATION:(.+)/)?.[1]?.trim() || '';
const dtstart = block.match(/DTSTART;VALUE=DATE:(\d+)/)?.[1] || '';
const dtend = block.match(/DTEND;VALUE=DATE:(\d+)/)?.[1] || dtstart;
const url = block.match(/URL:(.+)/)?.[1]?.trim() || '';
const uid = block.match(/UID:(.+)/)?.[1]?.trim() || '';
if (!summary || !dtstart) continue;
let type = 'other';
if (summary.startsWith('Earnings:')) type = 'earnings';
else if (summary.startsWith('IPO')) type = 'ipo';
else if (location) type = 'conference';
events.push({
id: uid, title: summary, type, location,
startDate: `${dtstart.slice(0, 4)}-${dtstart.slice(4, 6)}-${dtstart.slice(6, 8)}`,
endDate: `${dtend.slice(0, 4)}-${dtend.slice(4, 6)}-${dtend.slice(6, 8)}`,
url, source: 'techmeme', description: '',
});
}
console.log(` Techmeme ICS: ${events.length} events`);
}
} catch (e) { console.warn(` Techmeme ICS: ${e.message}`); }
// dev.events RSS
const rssCount = events.length;
try {
const resp = await fetch(RSS_URL, {
headers: { 'User-Agent': CHROME_UA, Accept: 'application/rss+xml, text/xml, */*' },
signal: AbortSignal.timeout(8_000),
});
if (resp.ok) {
const rss = await resp.text();
const items = rss.matchAll(/<item>([\s\S]*?)<\/item>/g);
const today = new Date().toISOString().split('T')[0];
for (const m of items) {
const block = m[1];
const title = (block.match(/<title><!\[CDATA\[(.*?)\]\]><\/title>|<title>(.*?)<\/title>/)?.[1] ||
block.match(/<title>(.*?)<\/title>/)?.[1] || '').trim();
const link = block.match(/<link>(.*?)<\/link>/)?.[1]?.trim() || '';
const desc = (block.match(/<description><!\[CDATA\[([\s\S]*?)\]\]><\/description>/)?.[1] ||
block.match(/<description>([\s\S]*?)<\/description>/)?.[1] || '').trim();
const guid = block.match(/<guid[^>]*>(.*?)<\/guid>/)?.[1]?.trim() || '';
if (!title) continue;
const dateMatch = desc.match(/on\s+(\w+\s+\d{1,2},?\s+\d{4})/i);
let startDate = null;
if (dateMatch) { const p = new Date(dateMatch[1]); if (!isNaN(p.getTime())) startDate = p.toISOString().split('T')[0]; }
if (!startDate || startDate < today) continue;
events.push({
id: guid || `dev-${title.slice(0, 20)}`, title, type: 'conference',
location: '', startDate, endDate: startDate, url: link,
source: 'dev.events', description: '',
});
}
console.log(` dev.events RSS: ${events.length - rssCount} events`);
}
} catch (e) { console.warn(` dev.events RSS: ${e.message}`); }
// Curated major conferences (must match list-tech-events.ts CURATED_EVENTS)
const now = new Date();
now.setHours(0, 0, 0, 0);
const CURATED = [
{ id: 'gitex-global-2026', title: 'GITEX Global 2026', type: 'conference', location: 'Dubai World Trade Centre, Dubai',
coords: { lat: 25.2285, lng: 55.2867, country: 'UAE', original: 'Dubai World Trade Centre, Dubai', virtual: false },
startDate: '2026-12-07', endDate: '2026-12-11', url: 'https://www.gitex.com', source: 'curated', description: "World's largest tech & startup show" },
{ id: 'token2049-dubai-2026', title: 'TOKEN2049 Dubai 2026', type: 'conference', location: 'Dubai, UAE',
coords: { lat: 25.2048, lng: 55.2708, country: 'UAE', original: 'Dubai, UAE', virtual: false },
startDate: '2026-04-29', endDate: '2026-04-30', url: 'https://www.token2049.com', source: 'curated', description: 'Premier crypto event in Dubai' },
{ id: 'collision-2026', title: 'Collision 2026', type: 'conference', location: 'Toronto, Canada',
coords: { lat: 43.6532, lng: -79.3832, country: 'Canada', original: 'Toronto, Canada', virtual: false },
startDate: '2026-06-22', endDate: '2026-06-25', url: 'https://collisionconf.com', source: 'curated', description: "North America's fastest growing tech conference" },
{ id: 'web-summit-2026', title: 'Web Summit 2026', type: 'conference', location: 'Lisbon, Portugal',
coords: { lat: 38.7223, lng: -9.1393, country: 'Portugal', original: 'Lisbon, Portugal', virtual: false },
startDate: '2026-11-02', endDate: '2026-11-05', url: 'https://websummit.com', source: 'curated', description: "The world's premier tech conference" },
];
for (const c of CURATED) { if (new Date(c.startDate) >= now) events.push(c); }
// Deduplicate
const seen = new Set();
const deduped = events.filter(e => {
const key = e.title.toLowerCase().replace(/[^a-z0-9]/g, '').slice(0, 30) + e.startDate.slice(0, 4);
if (seen.has(key)) return false;
seen.add(key);
return true;
}).sort((a, b) => a.startDate.localeCompare(b.startDate));
console.log(` Tech events total: ${deduped.length} (deduplicated)`);
return {
success: true, count: deduped.length,
conferenceCount: deduped.filter(e => e.type === 'conference').length,
mappableCount: 0, lastUpdated: new Date().toISOString(),
events: deduped, error: '',
};
}
// ─── Trending Repos ───
async function fetchTrendingRepos() {
const languages = ['python', 'javascript', 'typescript'];
const results = {};
for (const lang of languages) {
try {
let data;
const primaryUrl = `https://api.gitterapp.com/repositories?language=${lang}&since=daily`;
const resp = await fetch(primaryUrl, {
headers: { Accept: 'application/json', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
});
if (resp.ok) {
data = await resp.json();
} else {
const fallback = await fetch(`https://gh-trending-api.herokuapp.com/repositories/${lang}?since=daily`, {
headers: { Accept: 'application/json', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
});
if (fallback.ok) data = await fallback.json();
}
if (!Array.isArray(data)) { console.warn(` Trending ${lang}: not an array`); continue; }
const repos = data.slice(0, 50).map(r => ({
fullName: `${r.author}/${r.name}`, description: r.description || '',
language: r.language || '', stars: r.stars || 0,
starsToday: r.currentPeriodStars || 0, forks: r.forks || 0,
url: r.url || `https://github.com/${r.author}/${r.name}`,
}));
const cacheKey = `research:trending:v1:${lang}:daily:50`;
if (repos.length > 0) results[cacheKey] = { repos, pagination: undefined };
console.log(` Trending ${lang}: ${repos.length} repos`);
await sleep(500);
} catch (e) {
console.warn(` Trending ${lang}: ${e.message}`);
}
}
return results;
}
// ─── Main ───
let allData = null;
async function fetchAll() {
const [arxiv, hn, techEvents, trending] = await Promise.allSettled([
fetchArxivPapers(),
fetchHackerNews(),
fetchTechEvents(),
fetchTrendingRepos(),
]);
allData = {
arxiv: arxiv.status === 'fulfilled' ? arxiv.value : null,
hn: hn.status === 'fulfilled' ? hn.value : null,
techEvents: techEvents.status === 'fulfilled' ? techEvents.value : null,
trending: trending.status === 'fulfilled' ? trending.value : null,
};
if (!allData.arxiv && !allData.hn && !allData.trending) throw new Error('All research fetches failed');
// Write secondary keys BEFORE returning (runSeed calls process.exit after primary write)
if (allData.arxiv) {
for (const [key, data] of Object.entries(allData.arxiv)) {
if (key === 'research:arxiv:v1:cs.AI::50') continue;
await writeExtraKeyWithMeta(key, data, ARXIV_TTL, data.papers?.length ?? 0);
}
}
if (allData.hn) { for (const [key, data] of Object.entries(allData.hn)) await writeExtraKeyWithMeta(key, data, HN_TTL, data.items?.length ?? 0); }
if (allData.techEvents?.events?.length > 0) await writeExtraKeyWithMeta('research:tech-events:v1', allData.techEvents, TECH_EVENTS_TTL, allData.techEvents.events.length);
if (allData.trending) { for (const [key, data] of Object.entries(allData.trending)) await writeExtraKeyWithMeta(key, data, TRENDING_TTL, data.repos?.length ?? 0); }
const primaryKey = allData.arxiv?.['research:arxiv:v1:cs.AI::50'];
return primaryKey || { papers: [], pagination: undefined };
}
function validate(data) {
return data?.papers?.length > 0;
}
runSeed('research', 'arxiv-hn-trending', 'research:arxiv:v1:cs.AI::50', fetchAll, {
validateFn: validate,
ttlSeconds: ARXIV_TTL,
sourceVersion: 'arxiv-hn-gitter',
}).catch((err) => {
console.error('FATAL:', err.message || err);
process.exit(1);
});