Files
worldmonitor/scripts/seed-research.mjs
Elie Habib 044598346e feat(seed-contract): PR 2a — runSeed envelope dual-write + 91 seeders migrated (#3097)
* feat(seed-contract): PR 2a — runSeed envelope dual-write + 91 seeders migrated

Opt-in contract path in runSeed: when opts.declareRecords is provided, write
{_seed, data} envelope to the canonical key alongside legacy seed-meta:*
(dual-write). State machine: OK / OK_ZERO / RETRY with zeroIsValid opt.
declareRecords throws or returns non-integer → hard fail (contract violation).
extraKeys[*] support per-key declareRecords; each extra key writes its own
envelope. Legacy seeders (no declareRecords) entirely unchanged.

Migrated all 91 scripts/seed-*.mjs to contract mode. Each exports
declareRecords returning the canonical record count, and passes
schemaVersion: 1 + maxStaleMin (matched to api/health.js SEED_META, or 2.5x
interval where no registry entry exists). Contract conformance reports 84/86
seeders with full descriptor (2 pre-existing warnings).

Legacy seed-meta keys still written so unmigrated readers keep working;
follow-up slices flip health.js + readers to envelope-first.

Tests: 61/61 PR 1 tests still pass.

Next slices for PR 2:
- api/health.js registry collapse + 15 seed-bundle-*.mjs canonicalKey wiring
- reader migration (mcp, resilience, aviation, displacement, regional-snapshot)
- direct writers — ais-relay.cjs, consumer-prices-core publish.ts
- public-boundary stripSeedEnvelope + test migration

Plan: docs/plans/2026-04-14-002-fix-runseed-zero-record-lockout-plan.md

* fix(seed-contract): unwrap envelopes in internal cross-seed readers

After PR 2a enveloped 91 canonical keys as {_seed, data}, every script-side
reader that returned the raw parsed JSON started silently handing callers the
envelope instead of the bare payload. WoW baselines (bigmac, grocery-basket,
fear-greed) saw undefined .countries / .composite; seed-climate-anomalies saw
undefined .normals from climate:zone-normals:v1; seed-thermal-escalation saw
undefined .fireDetections from wildfire:fires:v1; seed-forecasts' ~40-key
pipeline batch returned envelopes for every input.

Fix: route every script-side reader through unwrapEnvelope(...).data. Legacy
bare-shape values pass through unchanged (unwrapEnvelope returns
{_seed: null, data: raw} for any non-envelope shape).

Changed:
- scripts/_seed-utils.mjs: import unwrapEnvelope; redisGet, readSeedSnapshot,
  verifySeedKey all unwrap. Exported new readCanonicalValue() helper for
  cross-seed consumers.
- 18 seed-*.mjs scripts with local redisGet-style helpers or inline fetch
  patched to unwrap via the envelope source module (subagent sweep).
- scripts/seed-forecasts.mjs pipeline batch: parse() unwraps each result.
- scripts/seed-energy-spine.mjs redisMget: unwraps each result.

Tests:
- tests/seed-utils-envelope-reads.test.mjs: 7 new cases covering envelope
  + legacy + null paths for readSeedSnapshot and verifySeedKey.
- Full seed suite: 67/67 pass (was 61, +6 new).

Addresses both of user's P1 findings on PR #3097.

* feat(seed-contract): envelope-aware reads in server + api helpers

Every RPC and public-boundary reader now automatically strips _seed from
contract-mode canonical keys. Legacy bare-shape values pass through unchanged
(unwrapEnvelope no-ops on non-envelope shapes).

Changed helpers (one-place fix — unblocks ~60 call sites):
- server/_shared/redis.ts: getRawJson, getCachedJson, getCachedJsonBatch
  unwrap by default. cachedFetchJson inherits via getCachedJson.
- api/_upstash-json.js: readJsonFromUpstash unwraps (covers api/mcp.ts
  tool responses + all its canonical-key reads).
- api/bootstrap.js: getCachedJsonBatch unwraps (public-boundary —
  clients never see envelope metadata).

Left intentionally unchanged:
- api/health.js / api/seed-health.js: read only seed-meta:* keys which
  remain bare-shape during dual-write. unwrapEnvelope already imported at
  the meta-read boundary (PR 1) as a defensive no-op.

Tests: 67/67 seed tests pass. typecheck + typecheck:api clean.

This is the blast-radius fix the PR #3097 review called out — external
readers that would otherwise see {_seed, data} after the writer side
migrated.

* fix(test): strip export keyword in vm.runInContext'd seed source

cross-source-signals-regulatory.test.mjs loads scripts/seed-cross-source-signals.mjs
via vm.runInContext, which cannot parse ESM `export` syntax. PR 2a added
`export function declareRecords` to every seeder, which broke this test's
static-analysis approach.

Fix: strip the `export` keyword from the declareRecords line in the
preprocessed source string so the function body still evaluates as a plain
declaration.

Full test:data suite: 5307/5307 pass. typecheck + typecheck:api clean.

* feat(seed-contract): consumer-prices publish.ts writes envelopes

Wrap the 5 canonical keys written by consumer-prices-core/src/jobs/publish.ts
(overview, movers:7d/30d, freshness, categories:7d/30d/90d, retailer-spread,
basket-series) in {_seed, data} envelopes. Legacy seed-meta:<key> writes
preserved for dual-write.

Inlined a buildEnvelope helper (10 lines) rather than taking a cross-package
dependency — consumer-prices-core is a standalone npm package. Documented the
four-file parity contract (mjs source, ts mirror, js edge mirror, this copy).

Contract fields: sourceVersion='consumer-prices-core-publish-v1', schemaVersion=1,
state='OK' (recordCount>0) or 'OK_ZERO' (legitimate zero).

Typecheck: no new errors in publish.ts.

* fix(seed-contract): 3 more server-side readers unwrap envelopes

Found during final audit:

- server/worldmonitor/resilience/v1/_shared.ts: resilience score reader
  parsed cached GetResilienceScoreResponse raw. Contract-mode seed-resilience-scores
  now envelopes those keys.
- server/worldmonitor/resilience/v1/get-resilience-ranking.ts: p05/p95
  interval lookup parsed raw from seed-resilience-scores' extra-key path.
- server/worldmonitor/infrastructure/v1/_shared.ts: mgetJson() used for
  count-source keys (wildfire:fires:v1, news:insights:v1) which are both
  contract-mode now.

All three now unwrap via server/_shared/seed-envelope. Legacy shapes pass
through unchanged.

Typecheck clean.

* feat(seed-contract): ais-relay.cjs direct writes produce envelopes

32 canonical-key write sites in scripts/ais-relay.cjs now produce {_seed, data}
envelopes. Inlined buildEnvelope() (CJS module can't require ESM source) +
envelopeWrite(key, data, ttlSeconds, meta) wrapper. Enveloped keys span market
bootstrap, aviation, cyber-threats, theater-posture, weather-alerts, economic
spending/fred/worldbank, tech-events, corridor-risk, usni-fleet, shipping-stress,
social:reddit, wsb-tickers, pizzint, product-catalog, chokepoint transits,
ucdp-events, satellites, oref.

Left bare (not seeded data keys): seed-meta:* (dual-write legacy),
classifyCacheKey LLM cache, notam:prev-closed-state internal state,
wm:notif:scan-dedup flags.

Updated tests/ucdp-seed-resilience.test.mjs regex to accept both upstashSet
(pre-contract) and envelopeWrite (post-contract) call patterns.

* feat(seed-contract): 15 bundle files add canonicalKey for envelope gate

54 bundle sections across 12 files now declare canonicalKey alongside the
existing seedMetaKey. _bundle-runner.mjs (from PR 1) prefers canonicalKey
when both are present — gates section runs on envelope._seed.fetchedAt
read directly from the data key, eliminating the meta-outlives-data class
of bugs.

Files touched:
- climate (5), derived-signals (2), ecb-eu (3), energy-sources (6),
  health (2), imf-extended (4), macro (10), market-backup (9),
  portwatch (4), relay-backup (2), resilience-recovery (5), static-ref (2)

Skipped (14 sections, 3 whole bundles): multi-key writers, dynamic
templated keys (displacement year-scoped), or non-runSeed orchestrators
(regional brief cron, resilience-scores' 222-country publish, validation/
benchmark scripts). These continue to use seedMetaKey or their own gate.

seedMetaKey preserved everywhere — dual-write. _bundle-runner.mjs falls
back to legacy when canonicalKey is absent.

All 15 bundles pass node --check. test:data: 5307/5307. typecheck:all: clean.

* fix(seed-contract): 4 PR #3097 review P1s — transform/declareRecords mismatches + envelope leaks

Addresses both P1 findings and the extra-key seed-meta leak surfaced in review:

1. runSeed helper-level invariant: seed-meta:* keys NEVER envelope.
   scripts/_seed-utils.mjs exports shouldEnvelopeKey(key) — returns false for
   any key starting with 'seed-meta:'. Both atomicPublish (canonical) and
   writeExtraKey (extras) gate the envelope wrap through this helper. Fixes
   seed-iea-oil-stocks' ANALYSIS_META_EXTRA_KEY silently getting enveloped,
   which broke health.js parsing the value as bare {fetchedAt, recordCount}.
   Also defends against any future manual writeExtraKey(..., envelopeMeta)
   call that happens to target a seed-meta:* key.

2. seed-token-panels canonical + extras fixed.
   publishTransform returns data.defi (the defi panel itself, shape {tokens}).
   Old declareRecords counted data.defi.tokens + data.ai.tokens + data.other.tokens
   on the transformed payload → 0 → RETRY path → canonical market:defi-tokens:v1
   never wrote, and because runSeed returned before the extraKeys loop,
   market:ai-tokens:v1 + market:other-tokens:v1 stayed stale too.
   New: declareRecords counts data.tokens on the transformed shape. AI_KEY +
   OTHER_KEY extras reuse the same function (transforms return structurally
   identical panels). Added isMain guard so test imports don't fire runSeed.

3. api/product-catalog.js cached reader unwraps envelope.
   ais-relay.cjs now envelopes product-catalog:v2 via envelopeWrite(). The
   edge reader did raw JSON.parse(result) and returned {_seed, data} to
   clients, breaking the cached path. Fix: import unwrapEnvelope from
   ./_seed-envelope.js, apply after JSON.parse. One site — :238-241 is
   downstream of getFromCache(), so the single reader fix covers both.

4. Regression lock tests/seed-contract-transform-regressions.test.mjs (11 cases):
   - shouldEnvelopeKey invariant: seed-meta:* false, canonical true
   - Token-panels declareRecords works on transformed shape (canonical + both extras)
   - Explicit repro of pre-fix buggy signature returning 0 — guards against revert
   - resolveRecordCount accepts 0, rejects non-integer
   - Product-catalog envelope unwrap returns bare shape; legacy passes through

Verification:
- npm run test:data → 5318/5318 pass (was 5307 — 11 new regressions)
- npm run typecheck:all → clean
- node --check on every modified script

iea-oil-stocks canonical declareRecords was NOT broken (user confirmed during
review — buildIndex preserves .members); only its ANALYSIS_META_EXTRA_KEY
was affected, now covered generically by commit 1's helper invariant.

* fix(seed-contract): seed-token-panels validateFn also runs on post-transform shape

Review finding: fixing declareRecords wasn't sufficient — atomicPublish() runs
validateFn(publishData) on the transformed payload too. seed-token-panels'
validate() checked data.defi/.ai/.other on the transformed {tokens} shape,
returned false, and runSeed took the early skipped-write branch (before even
reaching the declareRecords RETRY logic). Net effect: same as before the
declareRecords fix — canonical + both extras stayed stale.

Fix: validate() now checks the canonical defi panel directly (Array.isArray
(data?.tokens) && has at least one t.price > 0). AI/OTHER panels are validated
implicitly by their own extraKey declareRecords on write.

Audited the other 9 seeders with publishTransform (bls-series, bis-extended,
bis-data, gdelt-intel, trade-flows, iea-oil-stocks, jodi-gas, sanctions-pressure,
forecasts): all validateFn's correctly target the post-transform shape. Only
token-panels regressed.

Added 4 regression tests (tests/seed-contract-transform-regressions.test.mjs):
- validate accepts transformed panel with priced tokens
- validate rejects all-zero-price tokens
- validate rejects empty/missing tokens
- Explicit pre-fix repro (buggy old signature fails on transformed shape)

Verification:
- npm run test:data → 5322/5322 pass (was 5318; +4 new)
- npm run typecheck:all → clean
- node --check clean

* feat(seed-contract): add /api/seed-contract-probe validation endpoint

Single machine-readable gate for 'is PR #3097 working in production'.
Replaces the curl/jq ritual with one authenticated edge call that returns
HTTP 200 ok:true or 503 + failing check list.

What it validates:
- 8 canonical keys have {_seed, data} envelopes with required data fields
  and minRecords floors (fsi-eu, zone-normals, 3 token panels + minRecords
  guard against token-panels RETRY regression, product-catalog, wildfire,
  earthquakes).
- 2 seed-meta:* keys remain BARE (shouldEnvelopeKey invariant; guards
  against iea-oil-stocks ANALYSIS_META_EXTRA_KEY-class regressions).
- /api/product-catalog + /api/bootstrap responses contain no '_seed' leak.

Auth: x-probe-secret header must match RELAY_SHARED_SECRET (reuses existing
Vercel↔Railway internal trust boundary).

Probe logic is exported (checkProbe, checkPublicBoundary, DEFAULT_PROBES) for
hermetic testing. tests/seed-contract-probe.test.mjs covers every branch:
envelope pass/fail on field/records/shape, bare pass/fail on shape/field,
missing/malformed JSON, Redis non-2xx, boundary seed-leak detection,
DEFAULT_PROBES sanity (seed-meta invariant present, token-panels minRecords
guard present).

Usage:
  curl -H "x-probe-secret: $RELAY_SHARED_SECRET" \
       https://api.worldmonitor.app/api/seed-contract-probe

PR 3 will extend the probe with a stricter mode that asserts seed-meta:*
keys are GONE (not just bare) once legacy dual-write is removed.

Verification:
- tests/seed-contract-probe.test.mjs → 15/15 pass
- npm run test:data → 5338/5338 (was 5322; +16 new incl. conformance)
- npm run typecheck:all → clean

* fix(seed-contract): tighten probe — minRecords on AI/OTHER + cache-path source header

Review P2 findings: the probe's stated guards were weaker than advertised.

1. market:ai-tokens:v1 + market:other-tokens:v1 probes claimed to guard the
   token-panels extra-key RETRY regression but only checked shape='envelope'
   + dataHas:['tokens']. If an extra-key declareRecords regressed to 0, both
   probes would still pass because checkProbe() only inspects _seed.recordCount
   when minRecords is set. Now both enforce minRecords: 1.

2. /api/product-catalog boundary check only asserted no '_seed' leak — which
   is also true for the static fallback path. A broken cached reader
   (getFromCache returning null or throwing) could serve fallback silently
   and still pass this probe. Now:
   - api/product-catalog.js emits X-Product-Catalog-Source: cache|dodo|fallback
     on the response (the json() helper gained an optional source param wired
     to each of the three branches).
   - checkPublicBoundary declaratively requires that header's value match
     'cache' for /api/product-catalog, so a fallback-serve fails the probe
     with reason 'source:fallback!=cache' or 'source:missing!=cache'.

Test updates (tests/seed-contract-probe.test.mjs):
- Boundary check reworked to use a BOUNDARY_CHECKS config with optional
  requireSourceHeader per endpoint.
- New cases: served-from-cache passes, served-from-fallback fails with source
  mismatch, missing header fails, seed-leak still takes precedence, bad
  status fails.
- Token-panels sanity test now asserts minRecords≥1 on all 3 panels.

Verification:
- tests/seed-contract-probe.test.mjs → 17/17 pass (was 15, +2 net)
- npm run test:data → 5340/5340
- npm run typecheck:all → clean
2026-04-15 09:16:27 +04:00

354 lines
15 KiB
JavaScript
Executable File

#!/usr/bin/env node
/**
* Seed research data to Redis for 4 research endpoints:
* - listArxivPapers (cs.AI default category)
* - listHackernewsItems (top feed)
* - listTechEvents (Techmeme ICS + dev.events RSS) — relay also seeds this
* - listTrendingRepos (python, javascript, typescript daily)
*/
import { loadEnvFile, CHROME_UA, runSeed, writeExtraKeyWithMeta, sleep } from './_seed-utils.mjs';
loadEnvFile(import.meta.url);
const ARXIV_TTL = 3600;
const HN_TTL = 600;
const TECH_EVENTS_TTL = 28800; // 8h — outlives maxStaleMin:480 for health buffer
const TRENDING_TTL = 3600;
// ─── arXiv Papers ───
async function fetchArxivPapers() {
const categories = ['cs.AI', 'cs.CL', 'cs.CR'];
const results = {};
for (const cat of categories) {
const url = `https://export.arxiv.org/api/query?search_query=cat:${cat}&start=0&max_results=50`;
const resp = await fetch(url, {
headers: { Accept: 'application/xml', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(15_000),
});
if (!resp.ok) { console.warn(` arXiv ${cat}: HTTP ${resp.status}`); continue; }
const xml = await resp.text();
// Simple XML parse for arXiv entries
const papers = [];
const entryBlocks = xml.split('<entry>').slice(1);
for (const block of entryBlocks) {
const id = (block.match(/<id>([\s\S]*?)<\/id>/)?.[1] || '').trim().split('/').pop() || '';
const title = (block.match(/<title>([\s\S]*?)<\/title>/)?.[1] || '').trim().replace(/\s+/g, ' ');
const summary = (block.match(/<summary>([\s\S]*?)<\/summary>/)?.[1] || '').trim().replace(/\s+/g, ' ');
const published = block.match(/<published>([\s\S]*?)<\/published>/)?.[1]?.trim() || '';
const publishedAt = published ? new Date(published).getTime() : 0;
const urlMatch = block.match(/<link[^>]*rel="alternate"[^>]*href="([^"]+)"/);
const paperUrl = urlMatch?.[1] || `https://arxiv.org/abs/${id}`;
const authors = [];
const authorMatches = block.matchAll(/<author>\s*<name>([\s\S]*?)<\/name>/g);
for (const m of authorMatches) authors.push(m[1].trim());
const cats = [];
const catMatches = block.matchAll(/<category[^>]*term="([^"]+)"/g);
for (const m of catMatches) cats.push(m[1]);
if (title && id) papers.push({ id, title, summary, authors, categories: cats, publishedAt, url: paperUrl });
}
const cacheKey = `research:arxiv:v1:${cat}::50`;
if (papers.length > 0) {
results[cacheKey] = { papers, pagination: undefined };
}
console.log(` arXiv ${cat}: ${papers.length} papers`);
await sleep(3000); // arXiv rate limit: 1 req/3s
}
return results;
}
// ─── Hacker News ───
async function fetchHackerNews() {
const feeds = ['top', 'best'];
const results = {};
for (const feed of feeds) {
const idsResp = await fetch(`https://hacker-news.firebaseio.com/v0/${feed}stories.json`, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
});
if (!idsResp.ok) { console.warn(` HN ${feed}: HTTP ${idsResp.status}`); continue; }
const allIds = await idsResp.json();
if (!Array.isArray(allIds)) continue;
const ids = allIds.slice(0, 30);
const items = [];
for (let i = 0; i < ids.length; i += 10) {
const batch = ids.slice(i, i + 10);
const batchResults = await Promise.all(
batch.map(async (id) => {
try {
const res = await fetch(`https://hacker-news.firebaseio.com/v0/item/${id}.json`, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(5_000),
});
if (!res.ok) return null;
const raw = await res.json();
if (!raw || raw.type !== 'story') return null;
return {
id: raw.id || 0, title: raw.title || '', url: raw.url || '',
score: raw.score || 0, commentCount: raw.descendants || 0,
by: raw.by || '', submittedAt: (raw.time || 0) * 1000,
};
} catch { return null; }
}),
);
items.push(...batchResults.filter(Boolean));
}
const cacheKey = `research:hackernews:v1:${feed}:30`;
if (items.length > 0) {
results[cacheKey] = { items, pagination: undefined };
}
console.log(` HN ${feed}: ${items.length} stories`);
}
return results;
}
// ─── Tech Events (Techmeme ICS + dev.events RSS) ───
async function fetchTechEvents() {
const ICS_URL = 'https://www.techmeme.com/newsy_events.ics';
const RSS_URL = 'https://dev.events/rss.xml';
const events = [];
// Techmeme ICS
try {
const resp = await fetch(ICS_URL, {
headers: { 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(8_000),
});
if (resp.ok) {
const ics = await resp.text();
const blocks = ics.split('BEGIN:VEVENT').slice(1);
for (const block of blocks) {
const summary = block.match(/SUMMARY:(.+)/)?.[1]?.trim() || '';
const location = block.match(/LOCATION:(.+)/)?.[1]?.trim() || '';
const dtstart = block.match(/DTSTART;VALUE=DATE:(\d+)/)?.[1] || '';
const dtend = block.match(/DTEND;VALUE=DATE:(\d+)/)?.[1] || dtstart;
const url = block.match(/URL:(.+)/)?.[1]?.trim() || '';
const uid = block.match(/UID:(.+)/)?.[1]?.trim() || '';
if (!summary || !dtstart) continue;
let type = 'other';
if (summary.startsWith('Earnings:')) type = 'earnings';
else if (summary.startsWith('IPO')) type = 'ipo';
else if (location) type = 'conference';
events.push({
id: uid, title: summary, type, location,
startDate: `${dtstart.slice(0, 4)}-${dtstart.slice(4, 6)}-${dtstart.slice(6, 8)}`,
endDate: `${dtend.slice(0, 4)}-${dtend.slice(4, 6)}-${dtend.slice(6, 8)}`,
url, source: 'techmeme', description: '',
});
}
console.log(` Techmeme ICS: ${events.length} events`);
}
} catch (e) { console.warn(` Techmeme ICS: ${e.message}`); }
// dev.events RSS
const rssCount = events.length;
try {
const resp = await fetch(RSS_URL, {
headers: { 'User-Agent': CHROME_UA, Accept: 'application/rss+xml, text/xml, */*' },
signal: AbortSignal.timeout(8_000),
});
if (resp.ok) {
const rss = await resp.text();
const items = rss.matchAll(/<item>([\s\S]*?)<\/item>/g);
const today = new Date().toISOString().split('T')[0];
for (const m of items) {
const block = m[1];
const title = (block.match(/<title><!\[CDATA\[(.*?)\]\]><\/title>|<title>(.*?)<\/title>/)?.[1] ||
block.match(/<title>(.*?)<\/title>/)?.[1] || '').trim();
const link = block.match(/<link>(.*?)<\/link>/)?.[1]?.trim() || '';
const desc = (block.match(/<description><!\[CDATA\[([\s\S]*?)\]\]><\/description>/)?.[1] ||
block.match(/<description>([\s\S]*?)<\/description>/)?.[1] || '').trim();
const guid = block.match(/<guid[^>]*>(.*?)<\/guid>/)?.[1]?.trim() || '';
if (!title) continue;
const dateMatch = desc.match(/on\s+(\w+\s+\d{1,2},?\s+\d{4})/i);
let startDate = null;
if (dateMatch) { const p = new Date(dateMatch[1]); if (!Number.isNaN(p.getTime())) startDate = p.toISOString().split('T')[0]; }
if (!startDate || startDate < today) continue;
events.push({
id: guid || `dev-${title.slice(0, 20)}`, title, type: 'conference',
location: '', startDate, endDate: startDate, url: link,
source: 'dev.events', description: '',
});
}
console.log(` dev.events RSS: ${events.length - rssCount} events`);
}
} catch (e) { console.warn(` dev.events RSS: ${e.message}`); }
// Curated major conferences (must match list-tech-events.ts CURATED_EVENTS)
const now = new Date();
now.setHours(0, 0, 0, 0);
const CURATED = [
{ id: 'gitex-global-2026', title: 'GITEX Global 2026', type: 'conference', location: 'Dubai World Trade Centre, Dubai',
coords: { lat: 25.2285, lng: 55.2867, country: 'UAE', original: 'Dubai World Trade Centre, Dubai', virtual: false },
startDate: '2026-12-07', endDate: '2026-12-11', url: 'https://www.gitex.com', source: 'curated', description: "World's largest tech & startup show" },
{ id: 'token2049-dubai-2026', title: 'TOKEN2049 Dubai 2026', type: 'conference', location: 'Dubai, UAE',
coords: { lat: 25.2048, lng: 55.2708, country: 'UAE', original: 'Dubai, UAE', virtual: false },
startDate: '2026-04-29', endDate: '2026-04-30', url: 'https://www.token2049.com', source: 'curated', description: 'Premier crypto event in Dubai' },
{ id: 'collision-2026', title: 'Collision 2026', type: 'conference', location: 'Toronto, Canada',
coords: { lat: 43.6532, lng: -79.3832, country: 'Canada', original: 'Toronto, Canada', virtual: false },
startDate: '2026-06-22', endDate: '2026-06-25', url: 'https://collisionconf.com', source: 'curated', description: "North America's fastest growing tech conference" },
{ id: 'web-summit-2026', title: 'Web Summit 2026', type: 'conference', location: 'Lisbon, Portugal',
coords: { lat: 38.7223, lng: -9.1393, country: 'Portugal', original: 'Lisbon, Portugal', virtual: false },
startDate: '2026-11-02', endDate: '2026-11-05', url: 'https://websummit.com', source: 'curated', description: "The world's premier tech conference" },
];
for (const c of CURATED) { if (new Date(c.startDate) >= now) events.push(c); }
// Deduplicate
const seen = new Set();
const deduped = events.filter(e => {
const key = e.title.toLowerCase().replace(/[^a-z0-9]/g, '').slice(0, 30) + e.startDate.slice(0, 4);
if (seen.has(key)) return false;
seen.add(key);
return true;
}).sort((a, b) => a.startDate.localeCompare(b.startDate));
console.log(` Tech events total: ${deduped.length} (deduplicated)`);
return {
success: true, count: deduped.length,
conferenceCount: deduped.filter(e => e.type === 'conference').length,
mappableCount: 0, lastUpdated: new Date().toISOString(),
events: deduped, error: '',
};
}
// ─── Trending Repos ───
const OSSINSIGHT_LANG_MAP = { python: 'Python', javascript: 'JavaScript', typescript: 'TypeScript' };
async function fetchTrendingFromOSSInsight(lang) {
const ossLang = OSSINSIGHT_LANG_MAP[lang] || lang;
const resp = await fetch(
`https://api.ossinsight.io/v1/trends/repos/?language=${ossLang}&period=past_24_hours`,
{
headers: { Accept: 'application/json', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
},
);
if (!resp.ok) return null;
const json = await resp.json();
const rows = json?.data?.rows;
if (!Array.isArray(rows)) return null;
return rows.slice(0, 50).map(r => ({
fullName: r.repo_name || '', description: r.description || '',
language: r.primary_language || lang, stars: r.stars || 0,
starsToday: 0, forks: r.forks || 0,
url: r.repo_name ? `https://github.com/${r.repo_name}` : '',
}));
}
async function fetchTrendingFromGitHubSearch(lang) {
const since = new Date(Date.now() - 7 * 86400_000).toISOString().slice(0, 10);
const resp = await fetch(
`https://api.github.com/search/repositories?q=language:${lang}+created:>${since}&sort=stars&order=desc&per_page=50`,
{
headers: { Accept: 'application/vnd.github+json', 'User-Agent': CHROME_UA },
signal: AbortSignal.timeout(10_000),
},
);
if (!resp.ok) return null;
const data = await resp.json();
if (!Array.isArray(data?.items)) return null;
return data.items.map(r => ({
fullName: r.full_name, description: r.description || '',
language: r.language || '', stars: r.stargazers_count || 0,
starsToday: 0, forks: r.forks_count || 0,
url: r.html_url,
}));
}
async function fetchTrendingRepos() {
const languages = ['python', 'javascript', 'typescript'];
const results = {};
for (const lang of languages) {
try {
let repos = await fetchTrendingFromOSSInsight(lang);
if (!repos) repos = await fetchTrendingFromGitHubSearch(lang);
if (!repos || repos.length === 0) { console.warn(` Trending ${lang}: no data from any source`); continue; }
const cacheKey = `research:trending:v1:${lang}:daily:50`;
results[cacheKey] = { repos, pagination: undefined };
console.log(` Trending ${lang}: ${repos.length} repos`);
await sleep(500);
} catch (e) {
console.warn(` Trending ${lang}: ${e.message}`);
}
}
return results;
}
// ─── Main ───
let allData = null;
async function fetchAll() {
const [arxiv, hn, techEvents, trending] = await Promise.allSettled([
fetchArxivPapers(),
fetchHackerNews(),
fetchTechEvents(),
fetchTrendingRepos(),
]);
allData = {
arxiv: arxiv.status === 'fulfilled' ? arxiv.value : null,
hn: hn.status === 'fulfilled' ? hn.value : null,
techEvents: techEvents.status === 'fulfilled' ? techEvents.value : null,
trending: trending.status === 'fulfilled' ? trending.value : null,
};
if (arxiv.status === 'rejected') console.warn(` arXiv failed: ${arxiv.reason?.message || arxiv.reason}`);
if (hn.status === 'rejected') console.warn(` HN failed: ${hn.reason?.message || hn.reason}`);
if (techEvents.status === 'rejected') console.warn(` TechEvents failed: ${techEvents.reason?.message || techEvents.reason}`);
if (trending.status === 'rejected') console.warn(` Trending failed: ${trending.reason?.message || trending.reason}`);
if (!allData.arxiv && !allData.hn && !allData.trending) throw new Error('All research fetches failed');
// Write secondary keys BEFORE returning (runSeed calls process.exit after primary write)
if (allData.arxiv) {
for (const [key, data] of Object.entries(allData.arxiv)) {
if (key === 'research:arxiv:v1:cs.AI::50') continue;
await writeExtraKeyWithMeta(key, data, ARXIV_TTL, data.papers?.length ?? 0);
}
}
if (allData.hn) { for (const [key, data] of Object.entries(allData.hn)) await writeExtraKeyWithMeta(key, data, HN_TTL, data.items?.length ?? 0); }
if (allData.techEvents?.events?.length > 0) await writeExtraKeyWithMeta('research:tech-events:v1', allData.techEvents, TECH_EVENTS_TTL, allData.techEvents.events.length);
if (allData.trending) { for (const [key, data] of Object.entries(allData.trending)) await writeExtraKeyWithMeta(key, data, TRENDING_TTL, data.repos?.length ?? 0); }
const primaryKey = allData.arxiv?.['research:arxiv:v1:cs.AI::50'];
return primaryKey || { papers: [], pagination: undefined };
}
function validate(data) {
return data?.papers?.length > 0;
}
export function declareRecords(data) {
return data?.papers?.length ?? 0;
}
runSeed('research', 'arxiv-hn-trending', 'research:arxiv:v1:cs.AI::50', fetchAll, {
validateFn: validate,
ttlSeconds: ARXIV_TTL,
sourceVersion: 'arxiv-hn-gitter',
declareRecords,
schemaVersion: 1,
maxStaleMin: 150,
}).catch((err) => {
const _cause = err.cause ? ` (cause: ${err.cause.message || err.cause.code || err.cause})` : ''; console.error('FATAL:', (err.message || err) + _cause);
process.exit(1);
});