Files
Sebastien Melki 1db05e6caa feat(usage): per-request Axiom telemetry pipeline (gateway + upstream attribution) (#3403)
* feat(gateway): thread Vercel Edge ctx through createDomainGateway (#3381)

PR-0 of the Axiom usage-telemetry stack. Pure infra change: no telemetry
emission yet, only the signature plumbing required for ctx.waitUntil to
exist on the hot path.

- createDomainGateway returns (req, ctx) instead of (req)
- rewriteToSebuf propagates ctx to its target gateway
- 5 alias callsites updated to pass ctx through
- ~30 [rpc].ts callsites unchanged (export default createDomainGateway(...))

Pattern reference: api/notification-channels.ts:166.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(usage): pure UsageIdentity resolver + Axiom emit primitives (#3381)

server/_shared/usage-identity.ts
- buildUsageIdentity: pure function, consumes already-resolved gateway state.
- Static ENTERPRISE_KEY_TO_CUSTOMER map (explicit, reviewable in code).
- Does not re-verify JWTs or re-validate API keys.

server/_shared/usage.ts
- buildRequestEvent / buildUpstreamEvent: allowlisted-primitive builders only.
  Never accept Request/Response — additive field leaks become structurally
  impossible.
- emitUsageEvents → ctx.waitUntil(sendToAxiom). Direct fetch, 1.5s timeout,
  no retry, gated by USAGE_TELEMETRY=1 and AXIOM_API_TOKEN.
- Sliding-window circuit breaker (5% over 5min, min 20 samples). Trips with
  one structured console.error; subsequent drops are 1%-sampled console.warn.
- Header derivers reuse Vercel/CF headers for request_id, region, country,
  reqBytes; ua_hash null unless USAGE_UA_PEPPER is set (no stable
  fingerprinting).
- Dev-only x-usage-telemetry response header for 2-second debugging.

server/_shared/auth-session.ts
- New resolveClerkSession returning { userId, orgId } in one JWT verify so
  customer_id can be Clerk org id without a second pass. resolveSessionUserId
  kept as back-compat wrapper.

No emission wiring yet — that lands in the next commit (gateway request
event + 403 + 429).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gateway): emit Axiom request events on every return path (#3381)

Wires the request-event side of the Axiom usage-telemetry stack. Behind
USAGE_TELEMETRY=1 — no-op when the env var is unset.

Emit points (each builds identity from accumulated gateway state):
- origin_403 disallowed origin → reason=origin_403
- API access subscription required (403)
- legacy bearer 401 / 403 / 401-without-bearer
- entitlement check fail-through
- endpoint rate-limit 429 → reason=rate_limit_429
- global rate-limit 429 → reason=rate_limit_429
- 405 method not allowed
- 404 not found
- 304 etag match (resolved cache tier)
- 200 GET with body (resolved cache tier, real res_bytes)
- streaming / non-GET-200 final return (res_bytes best-effort)

Identity inputs (UsageIdentityInput):
- sessionUserId / clerkOrgId from new resolveClerkSession (one JWT verify)
- isUserApiKey + userApiKeyCustomerRef from validateUserApiKey result
- enterpriseApiKey when keyCheck.valid + non-wm_ wmKey present
- widgetKey from x-widget-key header (best-effort)
- tier captured opportunistically from existing getEntitlements calls

Header derivers reuse Vercel/CF metadata (x-vercel-id, x-vercel-ip-country,
cf-ipcountry, content-length, sentry-trace) — no new geo lookup, no new
crypto on the hot path. ua_hash null unless USAGE_UA_PEPPER is set.

Dev-only x-usage-telemetry response header (ok | degraded | off) attached
on the response paths for 2-second debugging in non-production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(usage): upstream events via implicit request scope (#3381)

Closes the upstream-attribution side of the Axiom usage-telemetry stack
without requiring leaf-handler changes (per koala's review).

server/_shared/usage.ts
- AsyncLocalStorage-backed UsageScope: gateway sets it once per request,
  fetch helpers read from it lazily. Defensive import — if the runtime
  rejects node:async_hooks, scope helpers degrade to no-ops and the
  request event is unaffected.
- runWithUsageScope(scope, fn) / getUsageScope() exports.

server/gateway.ts
- Wraps matchedHandler in runWithUsageScope({ ctx, requestId, customerId,
  route, tier }) so deep fetchers can attribute upstream calls without
  threading state through every handler signature.

server/_shared/redis.ts
- cachedFetchJsonWithMeta accepts opts.usage = { provider, operation? }.
  Only the provider label is required to opt in — request_id / customer_id
  / route / tier flow implicitly from UsageScope.
- Emits on the fresh path only (cache hits don't emit; the inbound
  request event already records cache_status).
- cache_status correctly distinguishes 'miss' vs 'neg-sentinel' by
  construction, matching NEG_SENTINEL handling.
- Telemetry never throws — failures are swallowed in the lazy-import
  catch, sink itself short-circuits on USAGE_TELEMETRY=0.

server/_shared/fetch-json.ts
- New optional { provider, operation } in FetchJsonOptions. Same
  opt-in-by-provider model as cachedFetchJsonWithMeta. Auto-derives host
  from URL. Reads body via .text() so response_bytes is recorded
  (best-effort; chunked responses still report 0).

Net result: any handler that uses fetchJson or cachedFetchJsonWithMeta
gets full per-customer upstream attribution by adding two fields to the
options bag. No signature changes anywhere else.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gateway): address round-1 codex feedback on usage telemetry

- ctx is now optional on the createDomainGateway handler signature so
  direct callers (tests, non-Vercel paths) no longer crash on emit
- legacy premium bearer-token routes (resilience, shipping-v2) propagate
  session.userId into the usage accumulator so successful requests are
  attributed instead of emitting as anon
- after checkEntitlement allows a tier-gated route, re-read entitlements
  (Redis-cached + in-flight coalesced) to populate usage.tier so
  analyze-stock & co. emit the correct tier rather than 0
- domain extraction now skips a leading vN segment, so /api/v2/shipping/*
  records domain="shipping" instead of "v2"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(usage): assert telemetry payload + identity resolver + operator guide

- tests/usage-telemetry-emission.test.mts stubs globalThis.fetch to
  capture the Axiom ingest POST body and asserts the four review-flagged
  fields end-to-end through the gateway: domain on /api/v2/<svc>/* (was
  "v2"), customer_id on legacy premium bearer success (was null/anon),
  tier on entitlement-gated success via the Convex fallback path (was 0),
  plus a ctx-optional regression guard
- server/__tests__/usage-identity.test.ts unit-tests the pure
  buildUsageIdentity() resolver across every auth_kind branch, tier
  coercion, and the secret-handling invariant (raw enterprise key never
  lands in any output field)
- docs/architecture/usage-telemetry.md is the operator + dev guide:
  field reference, architecture, configuration, failure modes, local
  workflow, eight Axiom APL recipes, and runbooks for adding fields /
  new gateway return paths

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(usage): make recorder.settled robust to nested waitUntil

Promise.all(pending) snapshotted the array at call time, missing the
inner ctx.waitUntil(sendToAxiom(...)) that emitUsageEvents pushes after
the outer drain begins. Tests passed only because the fetch spy resolved
in an earlier microtask tick. Replace with a quiescence loop so the
helper survives any future async in the emit path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: trigger preview

* fix(usage): address koala #3403 review — collapse nested waitUntil, widget-key validation, neg-sentinel status, auth_* reasons

P1
- Collapse nested ctx.waitUntil at all 3 emit sites (gateway.ts emitRequest,
  fetch-json.ts, redis.ts emitUpstreamFromHook). Export sendToAxiom and call
  it directly inside the outer waitUntil so Edge runtimes don't drop the
  delivery promise after the response phase.
- Validate X-Widget-Key against WIDGET_AGENT_KEY before populating usage.widgetKey
  so unauthenticated callers can't spoof per-customer attribution.

P2
- Emit on OPTIONS preflight (new 'preflight' RequestReason).
- Gate cachedFetchJsonWithMeta upstreamStatus=200 on result != null so the
  neg-sentinel branch no longer reports as a successful upstream call.
- Extend RequestReason with auth_401/auth_403/tier_403 and replace
  reason:'ok' on every auth/tier-rejection emit path.
- Replace 32-bit FNV-1a with a two-round XOR-folded 64-bit variant in
  hashKeySync (collision space matters once widget-key adoption grows).

Verification
- tests/usage-telemetry-emission.test.mts — 6/6
- tests/premium-stock-gateway.test.mts + tests/gateway-cdn-origin-policy.test.mts — 15/15
- npx vitest run server/__tests__/usage-identity.test.ts — 13/13
- npx tsc --noEmit clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: trigger preview rebuild for AXIOM_API_TOKEN

* chore(usage): note Axiom region in ingest URL comment

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* debug(usage): unconditional logs in sendToAxiom for preview troubleshooting

Temporary — to be reverted once Axiom delivery is confirmed working in preview.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(usage): add 'live' cache tier + revert preview debug logs

- Sync UsageCacheTier with the local CacheTier in gateway.ts (main added
  'live' in PR #3402 — synthetic merge with main was failing typecheck:api).
- Revert temporary unconditional debug logs in sendToAxiom now that Axiom
  delivery is verified end-to-end on preview (event landed with all fields
  populated, including the new auth_401 reason from the koala #3403 fix).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 18:10:51 +03:00
..
2026-03-20 12:37:24 +04:00