mirror of
https://github.com/koala73/worldmonitor.git
synced 2026-04-25 17:14:57 +02:00
* feat(_gdelt-fetch): curl proxy multi-retry helper for per-IP-throttled API GDELT (api.gdeltproject.org) is a public free API with strict per-IP throttling. seed-gdelt-intel currently has no proxy fallback — Railway egress IPs hit 429 storms and the seeder degrades. Probed 2026-04-16: Decodo curl egress against GDELT gives ~40% success per attempt (session-rotates IPs per call). Helper retries up to 5×; expected overall success ~92% (1 - 0.6^5). PROXY STRATEGY — CURL ONLY WITH MULTI-RETRY Differs from _yahoo-fetch.mjs (single proxy attempt) and _open-meteo-archive.mjs (CONNECT + curl cascade): - Curl-only: CONNECT not yet probed cleanly against GDELT. - Multi-retry on the curl leg: the proxy IS the rotation mechanism (each call → different egress IP), so successive attempts probe different IPs in the throttle pool. - Distinguishes retryable (HTTP 429/503 from upstream) from non-retryable (parse failure, auth, network) — bails immediately on non-retryable to avoid 5× of wasted log noise. Direct loop uses LONGER backoff than Yahoo's 5s base (10s) — GDELT's throttle window is wider than Yahoo's, so quick retries usually re-hit the same throttle. Tests (tests/gdelt-fetch.test.mjs, 13 cases — every learning from PR #3118 + #3119 + #3120 baked in): - Production defaults: curl resolver/fetcher reference equality - Production defaults: NO CONNECT leg (regression guard for unverified path) - 200 OK passthrough - 429 with no proxy → throws with HTTP 429 in message - Retry-After parsed (DI _sleep capture asserts 7000ms not retryBaseMs) - Retry-After absent → linear backoff retryBaseMs (paired branch test) - **Proxy multi-retry: 4× HTTP 429 then 5th succeeds → returns data** (asserts 5 proxy calls + 4 inter-proxy backoffs of proxyRetryBaseMs) - **Proxy non-retryable (parse failure) bails after 1 attempt** (does NOT burn all proxyMaxAttempts on a structural failure) - **Proxy retryable + non-retryable mix: retries on 429, bails on parse** - Thrown fetch error on final retry → proxy multi-retry runs (P1 guard) - All proxy attempts fail → throws with 'X/N attempts' in message + cause - Malformed JSON does NOT emit succeeded log before throw (P2 guard) - parseRetryAfterMs unit Verification: - tests/gdelt-fetch.test.mjs → 13/13 pass - node --check scripts/_gdelt-fetch.mjs → clean Phase 1 of 2. Seeder migration follows. * feat(seed-gdelt-intel): migrate to _gdelt-fetch helper Replaces direct fetch + ad-hoc retry in seed-gdelt-intel with the new fetchGdeltJson helper. Each topic call now gets: 3 direct retries (10/20/40s backoff) → 5 curl proxy attempts via Decodo session-rotating egress. Specific changes: - import fetchGdeltJson from _gdelt-fetch.mjs - fetchTopicArticles: replace fetch+retry+throw block with single await fetchGdeltJson(url, { label: topic.id }) - fetchTopicTimeline: same — best-effort try/catch returns [] on any failure (preserved). Helper still attempts proxy fallback before throwing, so a 429-throttled IP doesn't kill the timeline. - fetchWithRetry: collapsed from outer 3-retry loop with 60/120/240s backoff (which would have multiplied to 24 attempts/topic on top of helper's 8) to a thin wrapper that translates exhaustion into the {exhausted, articles:[]} shape the caller uses to drive POST_EXHAUST_DELAY_MS cooldown. - Drop CHROME_UA import (no longer used directly; helper handles it). Helper's exhausted-throw includes 'HTTP 429' substring when 429 was the upstream signal, so the existing is429 detection in fetchWithRetry continues to work without modification. Verification: - node --check scripts/seed-gdelt-intel.mjs → clean - npm run typecheck:all → clean - npm run test:data → 5382/5382 (was 5363, +13 from helper + 6 from prior PR work) Phase 2 of 2. * fix(_gdelt-fetch): proxy timeouts/network errors RETRY (rotates Decodo session) P1 from PR #3122 review: probed Decodo curl egress against GDELT (2026-04-16) gave 200/200/429/TIMEOUT/429 — TIMEOUT is part of the normal transient mix that the multi-retry design exists to absorb. Pre-fix logic only retried on substring 'HTTP 429'/'HTTP 503' matches, so a curl exec timeout (Node Error with no .status, not a SyntaxError) bailed on the first attempt. The PR's headline 'expected ~92% success with 5 attempts' was therefore not actually achievable for one of the exact failure modes that motivated the design. Reframed the proxy retryability decision around what we CAN reliably discriminate from the curl error shape: curlErr.status == number → retry only if 429/503 (curlFetch attaches .status only when curl returned a clean HTTP status) curlErr instanceof SyntaxError → bail (parse failure is structural) otherwise → RETRY (timeout, ECONNRESET, DNS, curl exec failure, CONNECT tunnel failure — all transient; rotating Decodo session usually clears them) P2 from same review: tests covered HTTP-status proxy retries + parse failures but never the timeout/thrown-error class. Added 3 tests: - proxy timeout (no .status) RETRIES → asserts proxyCalls=2 after a first-attempt ETIMEDOUT then second-attempt success - proxy ECONNRESET (no .status) RETRIES → same pattern - proxy HTTP 4xx with .status (e.g. 401 auth) does NOT retry → bails after 1 attempt Existing tests still pass — they use 'HTTP 429' Error WITHOUT .status, which now flows through the 'else: assume transient' branch and still retries. Only differences: the regex parsing is gone and curlFetch's .status property is the canonical signal. Verification: - tests/gdelt-fetch.test.mjs: 16/16 (was 13, +3) - npm run test:data: 5385/5385 (+3) - npm run typecheck:all: clean Followup commit on PR #3122. * fix(seed-gdelt-intel): timeline calls fast-fail (maxRetries:0, proxyMaxAttempts:0) P1 from PR #3122 review: fetchTopicTimeline is best-effort (returns [] on any failure), but the migration routed it through fetchGdeltJson with the helper's article-fetch defaults: 3 direct retries (10/20/40s backoff = ~70s) + 5 proxy attempts (5s base = ~20s) = ~90s worst case per call. Called 2× per topic × 6 topics = 12 calls = up to ~18 minutes of blocking on data the seeder discards on failure. Pre-helper code did a single direct fetch with no retry. Real operational regression under exactly the GDELT 429 storm conditions this PR is meant to absorb. Fix: 1. seed-gdelt-intel.mjs:fetchTopicTimeline now passes maxRetries:0, proxyMaxAttempts:0 — single direct attempt, no proxy, throws on first failure → caught, returns []. Matches pre-helper timing exactly. Article fetches keep the full retry budget; only timelines fast-fail. 2. _gdelt-fetch.mjs gate: skip the proxy block entirely when proxyMaxAttempts <= 0. Pre-fix the 'trying proxy (curl) up to 0×' log line would still emit even though the for loop runs zero times, producing a misleading line that the proxy was attempted when it wasn't. Tests (2 new): - maxRetries:0 + proxyMaxAttempts:0 → asserts directCalls=1, proxyCalls=0 even though _curlProxyResolver returns a valid auth string (proxy block must be fully bypassed). - proxyMaxAttempts:0 → captures console.log and asserts no 'trying proxy' line emitted (no misleading 'up to 0×' line). Verification: - tests/gdelt-fetch.test.mjs: 18/18 (was 16, +2) - npm run test:data: 5387/5387 (+2) - npm run typecheck:all: clean Followup commit on PR #3122. * fix(gdelt): direct parse-failure reaches proxy + timeline budget tweak + JSDoc accuracy 3 Greptile P2s on PR #3122: P2a — _gdelt-fetch.mjs:112: `resp.json()` was called outside the try/catch that guards fetch(). A 200 OK with HTML/garbage body (WAF challenge, partial response, gzip mismatch) would throw SyntaxError and escape the helper entirely — proxy fallback never ran. The proxy leg already parsed inside its own catch; making the direct leg symmetric. New regression test: direct 200 OK with malformed JSON must reach the proxy and recover. P2b — seed-gdelt-intel.mjs timeline budget bumped from 0/0 to 0/2. Best-effort timelines still fast-fail on direct 429 (no direct retries) but get 2 proxy attempts via Decodo session rotation before returning []. Worst case: ~25s/call × 12 calls = ~5 min ceiling under heavy throttling vs ~3 min with 0/0. Tradeoff: small additional time budget for a real chance to recover timeline data via proxy IP rotation. Articles still keep the full retry budget. P2c — JSDoc said 'Linear proxy backoff base' but the implementation uses a flat constant (proxyRetryBaseMs, line 156). Linear growth would not help here because Decodo rotates the session IP per call — the next attempt's success is independent of the previous wait. Doc now reads 'Fixed (constant, NOT linear) backoff' with the rationale. Verification: - tests/gdelt-fetch.test.mjs: 19/19 pass (was 18, +1) - npm run test:data: 5388/5388 (+1) - npm run typecheck:all: clean Followup commit on PR #3122. * test(gdelt): clarify helper-API vs seeder-mirror tests + add 0/2 lock Reviewer feedback on PR #3122 conflated two test classes: - Helper-API tests (lock the helper's contract for arbitrary callers using budget knobs like 0/0 — independent of any specific seeder) - Seeder-mirror tests (lock the budget the actual production caller in seed-gdelt-intel.mjs uses) Pre-fix the test file only had the 0/0 helper-API tests, with a section header that read 'Best-effort caller budgets (fast-fail)' — ambiguous about whether 0/0 was the helper API contract or the seeder's choice. Reviewer assumed seeder still used 0/0 because the tests locked it, but seed-gdelt-intel.mjs:97-98 actually uses 0/2 (per the prior P2b fix). Fixes: 1. Section header for the 0/0 tests now explicitly says these are helper-API tests and notes that seed-gdelt-intel uses 0/2 (not 0/0). Eliminates the conflation. 2. New 'Seeder-mirror: 0/2' section with 2 tests that lock the seeder's actual choice end-to-end: - 0/2 with first proxy attempt 429 + second succeeds → returns data (asserts directCalls=1, proxyCalls=2) - 0/2 with both proxy attempts failing → throws exhausted with '2/2 attempts' in message (asserts the budget propagates to the error message correctly) These tests would catch any future regression where the seeder's 0/2 choice gets reverted to 0/0 OR where the helper stops honoring the proxyMaxAttempts override. Verification: - tests/gdelt-fetch.test.mjs: 21/21 (was 19, +2) - npm run test:data: 5390/5390 (+2) - npm run typecheck:all: clean Followup commit on PR #3122.
221 lines
10 KiB
JavaScript
221 lines
10 KiB
JavaScript
// GDELT API fetch helper with curl-only Decodo proxy fallback + multi-retry.
|
||
//
|
||
// GDELT (api.gdeltproject.org) is a public free API with strict per-IP
|
||
// throttling (HTTP 429). Railway egress IPs share a small pool and hit
|
||
// 429 storms. seed-gdelt-intel currently has no proxy fallback.
|
||
//
|
||
// PROXY STRATEGY — CURL-ONLY WITH MULTI-RETRY
|
||
//
|
||
// Probed 2026-04-16:
|
||
// api.gdeltproject.org via direct (residential): 200
|
||
// api.gdeltproject.org via Decodo curl (5 attempts): 200/200/429/timeout/429
|
||
// = 2/5 success (~40%)
|
||
// api.gdeltproject.org via Decodo CONNECT: not probed cleanly
|
||
// (proxy URL format issue)
|
||
//
|
||
// Decodo's curl egress is session-rotating: each call may get a different
|
||
// IP from the pool. Some IPs are throttled by GDELT, others are not. ~40%
|
||
// per-attempt success rate; 5 attempts gives expected success ~92%
|
||
// (1 - 0.6^5 = 0.922).
|
||
//
|
||
// CONNECT path is omitted for now: not yet probed cleanly against GDELT,
|
||
// and adding an unverified leg costs time on each call. If Yahoo's
|
||
// pattern holds (CONNECT → 404 from blocked egress IPs), CONNECT for
|
||
// GDELT may behave the same. Add only after a clean Railway probe.
|
||
//
|
||
// Direct retry uses LONGER backoff than Yahoo's 5s base — GDELT's
|
||
// per-IP throttle window is wider, so quick retries usually re-hit the
|
||
// same throttle.
|
||
|
||
import { CHROME_UA, sleep, resolveProxy, curlFetch } from './_seed-utils.mjs';
|
||
|
||
const RETRYABLE_STATUSES = new Set([429, 503]);
|
||
const MAX_RETRY_AFTER_MS = 60_000;
|
||
|
||
/**
|
||
* Production defaults. Exported so tests can lock the wiring at the
|
||
* helper level. Mixing these up — e.g. swapping in resolveProxyForConnect
|
||
* — would route through an egress pool that has not been verified
|
||
* against GDELT.
|
||
*/
|
||
export const _PROXY_DEFAULTS = Object.freeze({
|
||
curlProxyResolver: resolveProxy,
|
||
curlFetcher: curlFetch,
|
||
});
|
||
|
||
export function parseRetryAfterMs(value) {
|
||
if (!value) return null;
|
||
const seconds = Number(value);
|
||
if (Number.isFinite(seconds) && seconds > 0) {
|
||
return Math.min(seconds * 1000, MAX_RETRY_AFTER_MS);
|
||
}
|
||
const retryAt = Date.parse(value);
|
||
if (Number.isFinite(retryAt)) {
|
||
return Math.min(Math.max(retryAt - Date.now(), 1000), MAX_RETRY_AFTER_MS);
|
||
}
|
||
return null;
|
||
}
|
||
|
||
/**
|
||
* Fetch JSON from a GDELT API endpoint with retry + proxy multi-retry.
|
||
*
|
||
* @param {string} url - GDELT API URL (typically
|
||
* `https://api.gdeltproject.org/api/v2/...?query=...&format=json`).
|
||
* @param {object} [opts]
|
||
* @param {string} [opts.label] - Symbol or label for log lines (default 'unknown').
|
||
* @param {number} [opts.timeoutMs] - Per-attempt timeout (default 15_000 — GDELT can be slow).
|
||
* @param {number} [opts.maxRetries] - Direct retries (default 3 → 4 attempts total).
|
||
* @param {number} [opts.retryBaseMs] - Linear direct backoff base (default 10_000 — GDELT throttle window is wider than Yahoo's).
|
||
* @param {number} [opts.proxyMaxAttempts] - Curl proxy attempts (default 5 — Decodo rotates session per call).
|
||
* @param {number} [opts.proxyRetryBaseMs] - Fixed (constant, NOT linear) backoff between proxy attempts (default 5_000). Constant because Decodo rotates the session IP per call — exponential growth wouldn't help; the next attempt's success is independent of the previous attempt's wait.
|
||
* @returns {Promise<unknown>} Parsed JSON. Throws on exhaustion.
|
||
*/
|
||
export async function fetchGdeltJson(url, opts = {}) {
|
||
const {
|
||
label = 'unknown',
|
||
timeoutMs = 15_000,
|
||
maxRetries = 3,
|
||
retryBaseMs = 10_000,
|
||
proxyMaxAttempts = 5,
|
||
proxyRetryBaseMs = 5_000,
|
||
// Test hooks. Production callers leave unset and get _PROXY_DEFAULTS.
|
||
// `_sleep` lets tests assert backoff values without sleeping in real
|
||
// time. Mirrors the seam pattern from PR #3120's _yahoo-fetch.mjs.
|
||
_curlProxyResolver = _PROXY_DEFAULTS.curlProxyResolver,
|
||
_proxyCurlFetcher = _PROXY_DEFAULTS.curlFetcher,
|
||
_sleep = sleep,
|
||
} = opts;
|
||
|
||
let lastDirectError = null;
|
||
|
||
// ─── Direct retry loop ───
|
||
for (let attempt = 0; attempt <= maxRetries; attempt++) {
|
||
let resp;
|
||
try {
|
||
resp = await fetch(url, {
|
||
headers: { 'User-Agent': CHROME_UA },
|
||
signal: AbortSignal.timeout(timeoutMs),
|
||
});
|
||
} catch (err) {
|
||
lastDirectError = err;
|
||
if (attempt < maxRetries) {
|
||
const retryMs = retryBaseMs * (attempt + 1);
|
||
console.warn(` [GDELT] ${label} ${err?.message ?? err}; retrying in ${Math.round(retryMs / 1000)}s (${attempt + 1}/${maxRetries})`);
|
||
await _sleep(retryMs);
|
||
continue;
|
||
}
|
||
// Final direct attempt threw — fall through to proxy. NEVER throw
|
||
// here (PR #3118 review: throwing bypasses the proxy path).
|
||
break;
|
||
}
|
||
|
||
if (resp.ok) {
|
||
// Guard the parse: a 200 OK with HTML/garbage body (WAF challenge,
|
||
// partial response, gzip mismatch) would otherwise throw SyntaxError
|
||
// and escape the helper entirely, bypassing the proxy fallback. The
|
||
// proxy leg already parses inside its own catch — make the direct
|
||
// leg symmetric.
|
||
try {
|
||
return await resp.json();
|
||
} catch (parseErr) {
|
||
lastDirectError = parseErr;
|
||
break;
|
||
}
|
||
}
|
||
|
||
lastDirectError = new Error(`HTTP ${resp.status}`);
|
||
|
||
if (RETRYABLE_STATUSES.has(resp.status) && attempt < maxRetries) {
|
||
const retryAfter = parseRetryAfterMs(resp.headers.get('retry-after'));
|
||
const retryMs = retryAfter ?? retryBaseMs * (attempt + 1);
|
||
console.warn(` [GDELT] ${label} ${resp.status} — waiting ${Math.round(retryMs / 1000)}s (${attempt + 1}/${maxRetries})`);
|
||
await _sleep(retryMs);
|
||
continue;
|
||
}
|
||
|
||
break;
|
||
}
|
||
|
||
// ─── Curl proxy multi-retry loop ───
|
||
// Decodo's session-rotating egress gives a different IP per call. GDELT
|
||
// throttles ~60% of attempts, so we retry until one IP isn't throttled.
|
||
// Only retry on retryable upstream status (429/503); non-retryable
|
||
// proxy errors (auth failure, malformed JSON, network) bail immediately
|
||
// since they're not transient — repeated attempts won't help.
|
||
const curlProxyAuth = _curlProxyResolver();
|
||
let lastProxyError = null;
|
||
let proxyAttemptsRun = 0;
|
||
// Skip the proxy block entirely when the caller opted out via
|
||
// proxyMaxAttempts:0 (best-effort callers that want fast-fail —
|
||
// e.g. fetchTopicTimeline in seed-gdelt-intel which discards failures).
|
||
// Avoids both the wasted log line and the no-op for loop.
|
||
if (curlProxyAuth && proxyMaxAttempts > 0) {
|
||
console.log(` [GDELT] direct exhausted on ${label} (${lastDirectError?.message ?? 'unknown'}); trying proxy (curl) up to ${proxyMaxAttempts}× (Decodo session-rotates per call)`);
|
||
for (let attempt = 1; attempt <= proxyMaxAttempts; attempt++) {
|
||
proxyAttemptsRun = attempt;
|
||
try {
|
||
// _proxyCurlFetcher (curlFetch / execFileSync) is sync today; wrap
|
||
// with await Promise.resolve so a future async refactor silently
|
||
// keeps working (Greptile P2 from PR #3119).
|
||
const text = await Promise.resolve(_proxyCurlFetcher(url, curlProxyAuth, { 'User-Agent': CHROME_UA, Accept: 'application/json' }));
|
||
// Parse BEFORE logging success so a malformed response doesn't
|
||
// emit a contradictory "succeeded" log + then throw (Greptile P2
|
||
// from PR #3120).
|
||
const parsed = JSON.parse(text);
|
||
console.log(` [GDELT] proxy (curl) succeeded for ${label} on attempt ${attempt}/${proxyMaxAttempts}`);
|
||
return parsed;
|
||
} catch (curlErr) {
|
||
lastProxyError = curlErr;
|
||
// Decide whether retrying this proxy call is worthwhile.
|
||
//
|
||
// Probed Decodo curl egress against GDELT (2026-04-16) gave
|
||
// 200 / 200 / 429 / TIMEOUT / 429 over 5 attempts. The TIMEOUT
|
||
// is part of the normal transient mix — rotating to another
|
||
// Decodo session usually clears it. The pre-fix logic only
|
||
// retried on `HTTP 429`/`503` substring matches, so a timeout
|
||
// bailed on the first attempt and defeated the multi-retry
|
||
// design. Reframed:
|
||
//
|
||
// curlErr.status = number → retry only if 429/503
|
||
// curlErr instanceof SyntaxError → bail (parse failure is
|
||
// structural, not transient)
|
||
// otherwise (timeout, ECONNRESET, DNS, curl exec failure,
|
||
// CONNECT tunnel failure) → RETRY (next Decodo
|
||
// session likely different)
|
||
//
|
||
// curlFetch attaches `.status` only when curl succeeded but the
|
||
// upstream returned non-2xx, so this property reliably
|
||
// discriminates the HTTP-status case from network/timeout cases.
|
||
const status = curlErr?.status;
|
||
const isParseFailure = curlErr instanceof SyntaxError;
|
||
let isRetryable;
|
||
if (typeof status === 'number') {
|
||
isRetryable = RETRYABLE_STATUSES.has(status);
|
||
} else if (isParseFailure) {
|
||
isRetryable = false;
|
||
} else {
|
||
// Network / timeout / curl exec error — assume transient.
|
||
isRetryable = true;
|
||
}
|
||
if (attempt < proxyMaxAttempts && isRetryable) {
|
||
const retryMs = proxyRetryBaseMs;
|
||
console.warn(` [GDELT] proxy (curl) attempt ${attempt}/${proxyMaxAttempts} failed: ${curlErr?.message ?? curlErr}; retrying in ${Math.round(retryMs / 1000)}s`);
|
||
await _sleep(retryMs);
|
||
continue;
|
||
}
|
||
// Non-retryable (parse failure, HTTP 4xx other than 429) OR last
|
||
// attempt — give up, throw exhausted with both errors.
|
||
console.warn(` [GDELT] proxy (curl) attempt ${attempt}/${proxyMaxAttempts} failed${isRetryable ? ' (last attempt)' : ' (non-retryable)'}: ${curlErr?.message ?? curlErr}`);
|
||
break;
|
||
}
|
||
}
|
||
}
|
||
|
||
throw new Error(
|
||
`GDELT retries exhausted for ${label}` +
|
||
(lastDirectError ? ` (last direct: ${lastDirectError.message})` : '') +
|
||
(lastProxyError ? ` (last proxy: ${lastProxyError.message} after ${proxyAttemptsRun}/${proxyMaxAttempts} attempts)` : ''),
|
||
lastDirectError ? { cause: lastDirectError } : (lastProxyError ? { cause: lastProxyError } : undefined),
|
||
);
|
||
}
|