fix(referral): stop /api/referral/me 503s on prod homepage (#3186)

* fix(referral): make /api/referral/me non-blocking to stop prod 503s

Reported in prod: every PRO homepage load was logging 'GET /api/referral/me 503' to Sentry. Root cause: a prior review required the Convex binding to block the response (rationale: don't hand users a dead share link). That turned any flaky relay call into a homepage-wide 503 for the 5-minute client cache window — every PRO user, every page reload.

Fix: dispatch registerReferralCodeInConvex via ctx.waitUntil. Response returns 200 + code + shareUrl unconditionally. Binding failures log a warning but never surface as 503. The mutation is idempotent; the next /api/referral/me fetch retries. The /pro?ref=<code> signup side reads userReferralCodes at conversion time, so a missed binding degrades to missed attribution (partial), never to blocked homepage (total).

The BRIEF_URL_SIGNING_SECRET-missing 503 path is unchanged — that's a genuine misconfig, not a flake.

Handler signature now takes ctx with waitUntil, matching api/notification-channels.ts and api/discord/oauth/callback.ts.

Regression test flipped: brief-referral-code.test.mjs previously enforced the blocking shape; now enforces the non-blocking shape + handler signature + explicit does-not-503-on-binding-failure assertion. 14/14 referral tests pass. Typecheck clean, 5706/5706 test:data, lint exit 0.

* fix(referral): narrow err in non-blocking catch instead of unsafe cast

Greptile P2 on #3186. The (err as Error).message cast was safe today (registerReferralCodeInConvex only throws Error instances) but would silently log 'undefined' if a future path ever threw a non-Error value. Swapped to instanceof narrow + String(err) fallback.
This commit is contained in:
Elie Habib
2026-04-18 23:32:48 +04:00
committed by GitHub
parent 3c47c1b222
commit d37ffb375e
2 changed files with 84 additions and 35 deletions

View File

@@ -3,21 +3,26 @@
*
* GET /api/referral/me
* Bearer-auth via Clerk JWT.
* -> 200 { code, shareUrl, invitedCount, convertedCount }
* -> 200 { code, shareUrl }
* -> 401 on missing/invalid bearer
* -> 503 if BRIEF_URL_SIGNING_SECRET is not configured (we reuse
* it as the HMAC secret for referral codes — see handler body).
*
* `code` is a deterministic 8-char hash of the Clerk userId (stable
* for the life of the account). `invitedCount` is the number of
* `registrations` rows that used this user's code as `referredBy`.
* `convertedCount` is the subset of those that later produced a PRO
* subscription — omitted in the MVP because subscription tracking
* lives in a different table and the join isn't needed for the
* share-button UX.
* for the life of the account).
*
* Stats are privacy-safe: the route returns counts only, never the
* referred users' emails or identities.
*
* Convex binding is fire-and-forget via ctx.waitUntil (see handler).
* An earlier iteration blocked on the binding and returned 503 on
* any failure — that turned a single flaky Convex call into a
* homepage-wide 503 outage for every PRO user (all homepage loads
* fetch this within the 5-minute client cache window). The mutation
* is idempotent; the next fetch re-attempts, and a receiver's
* signup at /pro?ref=<code> only needs the binding to have landed
* SOMETIME before that receiver completes signup, not on every
* share-button mount. Missed attribution beats homepage 503.
*/
export const config = { runtime: 'edge' };
@@ -36,13 +41,16 @@ const PUBLIC_BASE =
* Bind the Clerk-derived share code to the userId in Convex so that
* future /pro?ref=<code> signups can actually credit the sharer.
*
* BLOCKING (no longer fire-and-forget). If the binding can't be
* persisted the endpoint returns 503 instead of a dead share link —
* handing a user a code the waitlist mutation will never resolve is
* worse than a retryable error. The Convex mutation is idempotent,
* so client retries are safe.
* Fire-and-forget via the caller's ctx.waitUntil — never blocks the
* 200 response on this path. The mutation is idempotent, so the next
* /api/referral/me fetch (or signup-side lookup) re-attempts. A
* missed binding degrades to "receiver's signup isn't attributed"
* which is strictly less bad than the prior behaviour of 503'ing
* every PRO homepage load while Convex is slow or misconfigured.
*
* Throws on any failure; caller translates to 503.
* Resolves on success. Does NOT throw on failure — the caller relies
* on waitUntil to catch + log so a background failure can't surface
* as an unhandled rejection.
*/
async function registerReferralCodeInConvex(userId: string, code: string): Promise<void> {
const convexSite =
@@ -67,7 +75,10 @@ async function registerReferralCodeInConvex(userId: string, code: string): Promi
}
}
export default async function handler(req: Request): Promise<Response> {
export default async function handler(
req: Request,
ctx: { waitUntil: (p: Promise<unknown>) => void },
): Promise<Response> {
if (isDisallowedOrigin(req)) {
return jsonResponse({ error: 'Origin not allowed' }, 403);
}
@@ -109,18 +120,26 @@ export default async function handler(req: Request): Promise<Response> {
return jsonResponse({ error: 'service_unavailable' }, 503, cors);
}
// Bind the code to the userId in Convex BEFORE returning the share
// URL. If the binding can't be persisted (Convex outage, missing
// env, non-2xx) we must NOT hand the user a link that the waitlist
// mutation can never resolve — a dead share link is worse than a
// retryable error. The mutation is idempotent so client retries
// are safe.
try {
await registerReferralCodeInConvex(session.userId, code);
} catch (err) {
console.error('[api/referral/me] binding failed:', (err as Error).message);
return jsonResponse({ error: 'service_unavailable' }, 503, cors);
}
// Bind the code to the userId in Convex in the background so future
// /pro?ref=<code> signups can credit the sharer. FIRE-AND-FORGET
// via ctx.waitUntil — the response doesn't wait, and a binding
// failure (Convex outage, bad env, non-2xx, timeout) logs a warning
// but never turns into a 503. See module docstring for the
// rationale; an earlier blocking design caused homepage-wide
// outages on every flake. The mutation is idempotent so the next
// request retries.
ctx.waitUntil(
registerReferralCodeInConvex(session.userId, code).catch((err: unknown) => {
// Narrow rather than cast — a future path that throws a
// non-Error value must not turn this warning into "failed:
// undefined". The helper today only throws Error instances,
// so the instanceof branch is the common path.
console.warn(
'[api/referral/me] binding failed (non-blocking):',
err instanceof Error ? err.message : String(err),
);
}),
);
// No invite/conversion count is returned on the response. The
// waitlist path (userReferralCredits) now credits correctly, but

View File

@@ -114,21 +114,51 @@ describe('referral attribution resolves Clerk codes (waitlist path)', () => {
assert.match(src, /\.index\("by_referrer_email",\s*\["referrerUserId",\s*"refereeEmail"\]\)/);
});
it('/api/referral/me blocks on the binding + returns 503 on failure (not a dead share link)', async () => {
it('/api/referral/me fires the Convex binding non-blocking and never 503s on binding failure', async () => {
const { readFileSync } = await import('node:fs');
const { fileURLToPath } = await import('node:url');
const { dirname, resolve } = await import('node:path');
const __d = dirname(fileURLToPath(import.meta.url));
const src = readFileSync(resolve(__d, '../api/referral/me.ts'), 'utf-8');
assert.match(src, /registerReferralCodeInConvex/, 'helper must exist');
// REGRESSION: the earlier head used ctx.waitUntil(...) which
// returned a 200 + share link to the user even if the binding
// silently failed (missing env, non-2xx, network). That handed
// users dead links. Binding must block the response.
assert.match(src, /await registerReferralCodeInConvex/, 'binding must be awaited, not fire-and-forget');
assert.doesNotMatch(src, /ctx\.waitUntil\(registerReferralCodeInConvex/, 'must NOT use waitUntil for the binding');
// Failure path must return 503 rather than a 200 with a dead link.
assert.match(src, /binding failed[\s\S]{0,200}service_unavailable[\s\S]{0,50}503/, 'binding failure returns 503');
// CURRENT CONTRACT: binding is fire-and-forget via ctx.waitUntil.
//
// An earlier iteration (the "await + 503 on failure" shape this
// test used to enforce) turned a flaky Convex relay call into a
// homepage-wide 503 for every PRO user — the homepage fetches
// this endpoint on mount, so one bad Convex response broke the
// 5-minute cache window for everyone. The mutation is idempotent
// and the /pro?ref=<code> signup side re-reads at conversion
// time, so a missed binding degrades to missed attribution
// rather than outright breakage.
assert.match(
src,
/ctx\.waitUntil\(\s*registerReferralCodeInConvex/,
'binding must be dispatched via ctx.waitUntil (non-blocking)',
);
// Handler must accept a second ctx arg with waitUntil — matches
// the notification-channels + discord-oauth handler shapes.
assert.match(
src,
/export\s+default\s+async\s+function\s+handler\s*\(\s*req:\s*Request,\s*ctx:\s*\{\s*waitUntil:/,
'handler signature must take ctx with waitUntil',
);
// MUST NOT 503 on a binding failure — the whole point of the
// non-blocking shape. Handler must not mention "binding failed"
// anywhere near a 503 response code.
assert.doesNotMatch(
src,
/binding failed[\s\S]{0,200}service_unavailable[\s\S]{0,50}503/,
'binding failure must not return 503',
);
// BRIEF_URL_SIGNING_SECRET missing still legitimately 503s
// (different codepath; we can't mint a code without the secret).
// That's intentional, and unrelated to the Convex binding.
assert.match(
src,
/BRIEF_URL_SIGNING_SECRET is not configured[\s\S]{0,200}503/,
'missing signing secret still 503s',
);
assert.match(src, /\/relay\/register-referral-code/, 'must POST to the Convex HTTP action');
});