fix: stop Windows Max-plan drain loop — terminate session on restart-guard trip, tighten RestartGuard, catch OAuth expiry

Discord report: Windows 11 user's Max-plan usage was being consumed in the background. Root cause is a regression introduced alongside the v12.3.3 RestartGuard: (1) when the guard trips in the HTTP-layer crash-recovery path, pending messages were left in 'pending' state and auto-replayed by processPendingQueues on the next daemon start; (2) any single successful message cleared the restart-window decay, so a 9-of-10 failure regime never tripped the guard; (3) OAuth-token expiry errors are not in the unrecoverable list and loop forever on Max-plan subscriptions. SessionRoutes.ts: on restart-guard trip, abandon pending messages and remove the session (mirrors WorkerService.terminateSession which is private). Pattern copied from the sibling block at SessionRoutes.ts:123. RestartGuard.ts: require 5 consecutive recordSuccess() calls before the window-decay path becomes eligible; add a terminal lifetime cap of 50 total restarts that never resets. Public API unchanged. worker-service.ts: extract unrecoverablePatterns into a testable helper module and add OAuth/OpenRouter-auth patterns: 'OAuth token expired', 'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401', 'OpenRouter API error: 403'. Bare '401' deliberately not added (too broad). Investigation and phased plan docs included for reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 17:15:04 +02:00 · 2026-04-20 20:02:18 -07:00
parent 49ab404c08
commit a172c6e4ca
6 changed files with 722 additions and 31 deletions
--- a/INVESTIGATION-windows-max-plan-drain.md
+++ b/INVESTIGATION-windows-max-plan-drain.md
@@ -0,0 +1,163 @@
+# Investigation: Windows 11 Max-Plan Usage Drain (v12.3.3+ regression)
+
+**Branch:** `investigate/windows-infinite-loop-usage-drain`
+**Date:** 2026-04-20
+**Reporter:** Discord user (Win11 + Claude Code CLI + claude-mem latest)
+**Symptoms:**
+- "Going into a loop with Claude"
+- "Consuming entire Max plan usage in the background"
+- "Continuous failed Python hooks" in the morning after updating
+- Issue stops only after uninstalling claude-mem
+
+## TL;DR
+
+Between v12.3.2 (last "safe") and v12.3.7 (current), the flat 3-restart cap on
+SDK generator crashes was replaced with a time-windowed `RestartGuard` that
+**resets its decay window on any successful message**. On Windows, where the
+MCP loopback and Claude-executable resolution have additional failure modes
+(observation 71051 — *"MCP loopback failure causes 91.6% session failure rate"*),
+the worker enters a slow-drip crash loop that never trips the new guard, and
+burns the user's Max-plan OAuth token on each restart's Claude Agent SDK call.
+
+## Root-cause chain
+
+1. **Auth path** — `src/shared/EnvManager.ts:215` builds an isolated env for
+   every SDK subprocess spawn. When the user has no `ANTHROPIC_API_KEY` in
+   `~/.claude-mem/.env`, line 255 passes through `CLAUDE_CODE_OAUTH_TOKEN` from
+   the parent Claude Code session. **Every worker-driven Claude call is billed
+   against the user's Max subscription.** That is by design, but it's the
+   blast-radius multiplier for every other bug in the chain.
+
+2. **Automatic replay on every worker start** —
+   `src/services/worker-service.ts:592` calls `processPendingQueues(50)` during
+   worker initialization. It re-spawns an SDK subprocess for every session
+   with `status IN ('pending', 'processing')` messages
+   (`PendingMessageStore.getSessionsWithPendingMessages()`, line 447). A single
+   accumulated backlog from a previous failed run becomes a fresh storm of
+   Claude calls on the next daemon restart.
+
+3. **RestartGuard is too permissive for slow-drip failures** —
+   `src/services/worker/RestartGuard.ts`:
+   - `MAX_WINDOWED_RESTARTS = 10` restarts per 60 s
+   - `DECAY_AFTER_SUCCESS_MS = 5 min` — on *any* `recordSuccess()` call, the
+     restart timestamp array is wiped
+   - `ResponseProcessor.ts:211` calls `recordSuccess()` after *any* batch where
+     messages were confirmed
+   If 1-of-11 SDK invocations succeeds, the window never fills, the decay
+   clears history, and restarts continue indefinitely. At the observed 91.6 %
+   MCP failure rate this is exactly the regime the user is in.
+
+4. **Two divergent crash-recovery paths** —
+   - `src/services/worker-service.ts:822-857` — on restart-guard trip calls
+     `terminateSession()` which calls `pendingStore.markAllSessionMessagesAbandoned()`
+     (PendingMessageStore.ts:293). Correct behavior.
+   - `src/services/worker/http/routes/SessionRoutes.ts:318-330` — on restart-
+     guard trip only `session.abortController.abort()`. **Messages remain in
+     `pending` state** (explicitly acknowledged in the log message on line 325).
+     Next worker startup's `processPendingQueues()` grabs them again, starting
+     the loop over.
+
+5. **Exponential-backoff ceiling amplifies damage** —
+   `SessionRoutes.ts:348` and `worker-service.ts` both cap backoff at 8 s after
+   4+ restarts. Steady state is ~7 restarts/minute ≈ 10 000 SDK invocations/day
+   before the guard trips — if it ever does (see #3).
+
+6. **OAuth-expiry has no special handling** — the `unrecoverablePatterns` list
+   (`worker-service.ts:713-727`) matches on `'Invalid API key'`, `'API_KEY_INVALID'`,
+   `'API key expired'`, `'API key not valid'`. None of these match OAuth-token
+   failures. An expired/revoked `CLAUDE_CODE_OAUTH_TOKEN` produces errors that
+   the worker treats as transient and retries. Observation 55605 records PR
+   #1180 as a prior "OAuth Token Expiry Infinite Retry Loop" fix — the same
+   class of bug has re-surfaced against a new token type.
+
+## "Failed Python hooks"
+
+The user's wording is a misattribution. claude-mem's hooks are TypeScript
+compiled to `plugin/scripts/*-hook.cjs` and run via Bun/Node. However:
+
+- `uv` (Python toolchain) is installed for ChromaDB, and the `ChromaMcpManager`
+  spawns a Python process for vector sync.
+- When Chroma sync fails, errors surface in the hook's stderr alongside the
+  real Bun/Node hook failure.
+- On Windows the message the user most likely saw was the Chroma uv/Python
+  subprocess failing, conflated with the hook wrapper's own failure output.
+
+It is **not** that hooks themselves are Python. The underlying bug is the
+worker's SDK retry loop.
+
+## Windows-specific amplifiers
+
+- `SDKAgent.ts:466-473` — `where claude.cmd` resolution is tried first on Win32.
+  Any environment where `PATHEXT` or `where` behaves oddly (mingw, Git Bash
+  with stripped PATH) returns an error that is caught silently and falls
+  through to the generic "Claude executable not found". That string IS in
+  `unrecoverablePatterns`, so it should abort cleanly — but only if the SDK
+  surfaces it. In practice, transient subprocess spawn races on Windows surface
+  as generic errors that **don't** match the list, and then restart.
+- `worker-spawner.ts:39` — the Windows spawn-cooldown lock (2 min) only
+  suppresses *daemon* spawn attempts, not the SDK subprocess spawn-storm
+  described here.
+- `hook-constants.ts:30-34` — Windows gets a hook-timeout multiplier. Longer
+  hook windows = more time for the crash loop to run per session.
+
+## What changed vs v12.3.2
+
+```
+src/services/worker/RestartGuard.ts              | 70 ++++++ (NEW)
+src/services/worker/http/routes/SessionRoutes.ts | 24 +/-    (MAX=3 → RestartGuard)
+src/services/worker-service.ts                   | 28 +/-    (MAX=3 → RestartGuard)
+src/services/sqlite/PendingMessageStore.ts       | 19 ++     (clearFailed → clearFailedOlderThan(1h))
+```
+
+The regression is squarely in the restart-guard swap. The old flat counter
+would have tripped after 3 crashes and stopped the SDK spawn, regardless of
+whether anything eventually succeeded.
+
+## Recommended fixes (in priority order)
+
+1. **SessionRoutes.ts restart-guard trip must call `terminateSession`** (or
+   `markAllSessionMessagesAbandoned`) — mirror the behavior in
+   `worker-service.ts:837`. Today it explicitly leaves messages pending, which
+   guarantees re-replay on daemon restart.
+
+2. **Tighten RestartGuard decay** — either
+   - require N consecutive successes before decay, not a single one, or
+   - track a separate failure rate; if fail-rate > 50 % over the window, trip
+     regardless of `recordSuccess()` calls.
+
+3. **Add OAuth-expiry to `unrecoverablePatterns`** — common SDK error strings
+   from expired OAuth tokens (`Unauthorized`, `OAuth token expired`,
+   `token has been revoked`, 401 responses) should be treated the same as
+   `'Invalid API key'`.
+
+4. **Cap absolute restart count per session-lifetime** — RestartGuard caps per
+   window but has no absolute ceiling. A hard cap (e.g. 50 restarts regardless
+   of window) protects users from the decay-loop regime.
+
+5. **Kill-switch** — a `CLAUDE_MEM_PAUSE_WORKER` setting the user can flip
+   without uninstalling, so the next Discord user isn't forced to uninstall
+   to stop the bleeding. Hook entrypoints would short-circuit if set.
+
+6. **Telemetry on worker startup** — emit a warning if
+   `processPendingQueues()` finds > N orphaned sessions or > M orphaned
+   messages. Today the auto-recovery is silent for backlogs of any size.
+
+## Files to touch for the fix
+
+- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — call
+  `terminateSession` instead of bare `abort()`.
+- `src/services/worker/RestartGuard.ts` — stricter decay semantics + absolute
+  cap.
+- `src/services/worker-service.ts:713-727` — extend `unrecoverablePatterns`.
+- `src/shared/SettingsDefaultsManager.ts` — `CLAUDE_MEM_PAUSE_WORKER` flag.
+
+## What the user should do *right now*
+
+Until a fix ships, the Discord user's mitigation (uninstall) is correct. As
+a less-drastic workaround:
+
+1. Stop the worker: `curl -X POST http://localhost:37777/api/shutdown` (or
+   kill the `bun` process in Task Manager).
+2. Delete `~/.claude-mem/claude-mem.db-wal` and empty the `pending_messages`
+   table via sqlite3 to break any stored replay queue.
+3. Remove the plugin from `~/.claude.json` until the fix ships.
--- a/PLAN-windows-max-plan-drain-fix.md
+++ b/PLAN-windows-max-plan-drain-fix.md
@@ -0,0 +1,413 @@
+# Plan: Fix Windows Max-Plan Drain (v12.3.3+ regression)
+
+Context: `INVESTIGATION-windows-max-plan-drain.md` in repo root. Branch:
+`investigate/windows-infinite-loop-usage-drain`. Test runner: `bun test`.
+
+## Phase 0 — Documentation Discovery (COMPLETED)
+
+Findings consolidated from three discovery subagents:
+
+### Codebase APIs available (verified)
+
+- **`SessionRoutes` already has** `this.sessionManager`, `this.eventBroadcaster`,
+  `this.workerService` (`src/services/worker/http/routes/SessionRoutes.ts:34-49`).
+- **`WorkerService.terminateSession` is `private`** (`worker-service.ts:964`) —
+  cannot call from SessionRoutes. Must replicate the three-step pattern
+  inline. Reference implementation: `worker-service.ts:964-976`.
+- **Abandonment pattern already used in SessionRoutes** at
+  `SessionRoutes.ts:123-125` — `pendingStore.markAllSessionMessagesAbandoned()`
+  + `sessionManager.removeSessionImmediate()`. Copy this pattern.
+- **`SessionManager.removeSessionImmediate`** is public
+  (`SessionManager.ts:453-468`) and fires `onSessionDeletedCallback` →
+  `broadcastProcessingStatus()` automatically.
+- **`SessionEventBroadcaster.broadcastSessionCompleted`** exists
+  (`events/SessionEventBroadcaster.ts:73-82`) — `worker-service.ts:951`
+  does call it after `terminateSession`, so mirror that for parity.
+
+### `RestartGuard` facts
+
+- Full impl is only 70 lines (`src/services/worker/RestartGuard.ts`).
+- **No existing tests** for RestartGuard.
+- Public surface: `recordRestart(): boolean`, `recordSuccess(): void`,
+  getters `restartsInWindow`, `windowMs`, `maxRestarts`.
+- Call sites (must remain compatible):
+  - `SessionRoutes.ts:314-315, 322-324, 336-337`
+  - `worker-service.ts:824-825, 832-834, 854`
+  - `ResponseProcessor.ts:211` — `recordSuccess()` called only after a batch is
+    confirmed to storage.
+- Field: `ActiveSession.restartGuard?: RestartGuard` (`worker-types.ts:39`).
+
+### `unrecoverablePatterns` facts
+
+- Location: `worker-service.ts:713-727`. Match is
+  `unrecoverablePatterns.some(p => errorMessage.includes(p))`
+  — simple substring on `(error as Error)?.message`.
+- **DO NOT add bare `'401'`** — matches request IDs, log lines, etc. Use
+  agent-prefixed forms instead.
+- Gap found: **OpenRouter 401 is not currently caught**
+  (`OpenRouterAgent.ts:458`). Mirror the Gemini pattern for consistency.
+- Existing test file that covers transient/terminal classification:
+  `tests/worker/agents/fallback-error-handler.test.ts` (line 75-77 asserts
+  `'401 Unauthorized'` is terminal). No direct tests exist for
+  `unrecoverablePatterns` — add them.
+
+### Anti-patterns to avoid
+
+- ❌ Making `terminateSession` public and calling it from SessionRoutes — couples
+  the HTTP layer to WorkerService internals further; the three-step pattern is
+  already public API surface.
+- ❌ Adding bare `'401'` to `unrecoverablePatterns` — too broad.
+- ❌ Refactoring RestartGuard public API — call sites depend on getters.
+- ❌ Creating new test framework config — project already uses `bun test`.
+
+---
+
+## Phase 1 — Fix SessionRoutes restart-guard trip (stops the drip)
+
+**Goal:** When the windowed restart guard trips in the HTTP-layer crash-recovery
+path, abandon pending messages instead of leaving them in `'pending'` state, so
+the next worker startup's `processPendingQueues()` does not replay them.
+
+**Target file:** `src/services/worker/http/routes/SessionRoutes.ts`
+
+**Doc reference (pattern to copy, verbatim except for the reason string):**
+
+1. `SessionRoutes.ts:123-125` — existing abandonment pattern in the same class.
+2. `worker-service.ts:964-976` — the private `terminateSession` body to
+   replicate (logger message + reason + broadcastSessionCompleted call).
+3. `worker-service.ts:951` — confirms `broadcastSessionCompleted` is called
+   after abandonment in the sibling code path.
+
+### Task 1.1 — Replace `abort()`-only trip with full abandonment
+
+At `SessionRoutes.ts:326-329`, replace:
+
+```ts
+// Don't restart - abort to prevent further API calls
+session.abortController.abort();
+return;
+```
+
+With (copy-adapt from the pattern at SessionRoutes.ts:123-125 + the
+log shape from worker-service.ts:968-972):
+
+```ts
+// Restart guard tripped — abandon pending messages so the next worker
+// startup's processPendingQueues() does NOT replay them, and kill the
+// subprocess. Mirrors WorkerService.terminateSession() semantics.
+// (Cannot call terminateSession directly — it's private in WorkerService.)
+session.abortController.abort();
+const pendingStore = this.sessionManager.getPendingMessageStore();
+const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
+logger.info('SYSTEM', 'Session terminated', {
+  sessionId: sessionDbId,
+  reason: 'max_restarts_exceeded',
+  abandonedMessages: abandoned
+});
+this.sessionManager.removeSessionImmediate(sessionDbId);
+this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
+return;
+```
+
+Also update the log at `SessionRoutes.ts:319-326`: change the `action:` string
+from `'Messages remain in pending state.'` to
+`'Pending messages abandoned to prevent replay loop.'`.
+
+### Verification
+
+- `grep -n 'Messages remain in pending state' src/` → no matches.
+- `grep -n 'markAllSessionMessagesAbandoned' src/services/worker/http/routes/SessionRoutes.ts`
+  → should now find TWO call sites (existing L124 + new one in restart-trip).
+- Manual trace: after trip, `pending_messages.status` for that session is
+  `'failed'` (or whatever `markAllSessionMessagesAbandoned` sets it to —
+  confirm by reading `PendingMessageStore.ts:293`).
+
+### Anti-pattern guard
+
+- Don't make `terminateSession` public. Replicate inline — the three calls are
+  already public API.
+- Don't skip `abortController.abort()` before abandonment — the SDK subprocess
+  must be killed or it will keep consuming Claude.
+
+---
+
+## Phase 2 — Tighten RestartGuard (stops the loop from getting far enough to drip)
+
+**Goal:** Require N=5 consecutive `recordSuccess()` calls before clearing the
+restart history, and add an absolute lifetime cap of 50 restarts that, once
+tripped, never resets.
+
+**Target file:** `src/services/worker/RestartGuard.ts`
+
+**Doc reference:**
+
+- Current impl quoted in full in Phase 0. Consumers at:
+  - `SessionRoutes.ts:314-315, 322-324, 336-337`
+  - `worker-service.ts:824-825, 832-834, 854`
+  - `ResponseProcessor.ts:211`
+- Public surface must remain compatible: `recordRestart(): boolean`,
+  `recordSuccess(): void`, getters `restartsInWindow`, `windowMs`,
+  `maxRestarts`.
+
+### Task 2.1 — Add constants and private fields
+
+Add to top of file:
+
+```ts
+const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
+const ABSOLUTE_LIFETIME_RESTART_CAP = 50;
+```
+
+Add to class:
+
+```ts
+private consecutiveSuccessCount = 0;
+private totalRestartsAllTime = 0;
+```
+
+### Task 2.2 — Rewrite `recordRestart()` semantics
+
+Behavior:
+
+1. Increment `totalRestartsAllTime`.
+2. If `totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP`, return `false`
+   immediately (never resets — terminal).
+3. Any restart resets `consecutiveSuccessCount` to 0 (a failure interrupts the
+   success streak).
+4. Existing decay check: only fire if
+   `lastSuccessfulProcessing !== null` **AND** we had
+   `consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY` the
+   last time `recordSuccess` was called **AND** 5 min elapsed. To make the
+   logic clean: only gate decay on an existing flag `decayEligible` that is
+   set to `true` by `recordSuccess` once the streak threshold is reached.
+5. Otherwise preserve current window-based logic.
+
+Simplest structure:
+
+```ts
+recordRestart(): boolean {
+  this.totalRestartsAllTime += 1;
+  this.consecutiveSuccessCount = 0;  // streak broken
+  if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
+    return false;  // terminal — lifetime cap reached
+  }
+
+  const now = Date.now();
+  if (this.decayEligible
+      && this.lastSuccessfulProcessing !== null
+      && now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
+    this.restartTimestamps = [];
+    this.lastSuccessfulProcessing = null;
+    this.decayEligible = false;
+  }
+
+  this.restartTimestamps = this.restartTimestamps.filter(
+    ts => now - ts < RESTART_WINDOW_MS
+  );
+  this.restartTimestamps.push(now);
+  return this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS;
+}
+```
+
+### Task 2.3 — Update `recordSuccess()` to require N consecutive calls
+
+```ts
+recordSuccess(): void {
+  this.consecutiveSuccessCount += 1;
+  this.lastSuccessfulProcessing = Date.now();
+  if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
+    this.decayEligible = true;
+  }
+}
+```
+
+Add private field `decayEligible = false`.
+
+### Task 2.4 — Add introspection getters for logging (optional but useful)
+
+```ts
+get totalRestarts(): number { return this.totalRestartsAllTime; }
+get lifetimeCap(): number { return ABSOLUTE_LIFETIME_RESTART_CAP; }
+```
+
+### Task 2.5 — Update log statements at trip sites to include new fields
+
+- `SessionRoutes.ts:319-326`: add `totalRestarts: session.restartGuard.totalRestarts`
+  and `lifetimeCap: session.restartGuard.lifetimeCap` to the error log payload.
+- `worker-service.ts:828-835`: same additions.
+
+### Verification
+
+- `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4).
+- Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st →
+  blocked (lifetime cap persists).
+- Behavior check: 4 successes then restart → full window still counted.
+  5th success then restart 5 min later → window cleared.
+
+### Anti-pattern guard
+
+- Don't make `recordSuccess` reset `totalRestartsAllTime` — the lifetime cap is
+  meant to be terminal.
+- Don't silently change the meaning of the existing getters — add new ones.
+
+---
+
+## Phase 3 — Extend `unrecoverablePatterns` (handle OAuth expiry)
+
+**Goal:** When Max-plan OAuth tokens expire/are revoked, treat the SDK error as
+unrecoverable so no restart is attempted.
+
+**Target file:** `src/services/worker-service.ts:713-727`
+
+**Doc reference:**
+
+- Current array + match logic quoted in Phase 0 discovery.
+- `OpenRouterAgent.ts:458` — error throw format is
+  `` `OpenRouter API error: ${response.status} - ${errorText}` ``.
+- `tests/worker/agents/fallback-error-handler.test.ts:75-77` — existing
+  assertion `'401 Unauthorized'` is terminal, confirming it is safe to treat
+  as unrecoverable.
+
+### Task 3.1 — Add OAuth / OpenRouter auth strings
+
+Add these entries (in a clear grouping comment) to the array at
+`worker-service.ts:713-727`:
+
+```ts
+// OAuth / subscription-token expiry (Max plan users) — matches SDK
+// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
+// is no longer valid.
+'OAuth token expired',
+'token has been revoked',
+'Unauthorized',
+// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
+'OpenRouter API error: 401',
+'OpenRouter API error: 403',
+```
+
+**Do not add bare `'401'`** — too broad. Anti-pattern confirmed by discovery.
+
+### Verification
+
+- `grep -n "'Unauthorized'" src/services/worker-service.ts` → should match.
+- Run `bun test tests/worker/worker-service-unrecoverable.test.ts` (new file —
+  see Phase 4) — assertions cover each new pattern.
+- Run `bun test tests/worker/agents/fallback-error-handler.test.ts` — must
+  still pass (no regressions).
+
+### Anti-pattern guard
+
+- No regex or `startsWith` — keep substring match semantics to avoid breaking
+  existing entries.
+- Don't unpack `error.code` or nested JSON bodies in this phase — discovery
+  confirmed all current agent error throws embed the status/message in
+  `.message`.
+
+---
+
+## Phase 4 — Tests (no tests existed for RestartGuard or unrecoverablePatterns)
+
+**Goal:** Prevent regression on all three fixes.
+
+**Test framework:** `bun test` (see `package.json` scripts).
+
+### Task 4.1 — Create `tests/worker/RestartGuard.test.ts`
+
+Test suites (doc: `src/services/worker/RestartGuard.ts` as modified in Phase 2):
+
+- `recordRestart respects window`: push 10 restarts in <60s → allowed; 11th →
+  blocked.
+- `recordSuccess requires N consecutive before decay`: 4 successes then
+  restart 6 min later → window still populated; 5 successes then restart 6 min
+  later → window cleared.
+- `restart breaks success streak`: 3 successes → 1 restart → 4 more successes
+  (total 7 successes with gap) → streak counter is 4 at end, not 7 (decay
+  NOT yet eligible).
+- `lifetime cap is terminal`: 50 restarts → OK; 51st → blocked even if window
+  is empty; recordSuccess cannot un-block it.
+- `getters return expected values`: `totalRestarts`, `lifetimeCap`,
+  `restartsInWindow`, `maxRestarts`, `windowMs`.
+
+Use `Date.now` mocking via Bun's `spyOn(Date, 'now')` or an injected clock.
+Check project-local patterns first — if other tests in `tests/worker/` already
+use a specific clock helper, copy that pattern. Otherwise straight `spyOn`.
+
+### Task 4.2 — Create `tests/worker/worker-service-unrecoverable.test.ts`
+
+Focused unit test on the `.some(p => errorMessage.includes(p))` matcher for
+each new pattern. If extracting `unrecoverablePatterns` into a standalone
+exported helper makes testing easier, do that as part of this task (small
+refactor — pull the array + predicate into a named export at the top of
+`worker-service.ts` and import it in the `.catch` block). Verify:
+
+- Each of `'OAuth token expired'`, `'token has been revoked'`,
+  `'Unauthorized'`, `'OpenRouter API error: 401'`,
+  `'OpenRouter API error: 403'` matches a realistic error message.
+- Bare `'401'` by itself (e.g., `"request-id-401xyz"`) does NOT match — this
+  test locks in the decision not to add bare `'401'`.
+- All PRE-EXISTING patterns still match realistic messages (no regressions).
+
+### Task 4.3 — Add SessionRoutes restart-trip integration test
+
+If a test file already exists for `SessionRoutes` in `tests/worker/http/` or
+`tests/server/`, extend it. Otherwise create
+`tests/worker/http/SessionRoutes.restart-trip.test.ts`:
+
+- Set up a session with pending messages and a `RestartGuard` pre-loaded near
+  the window cap.
+- Trigger the restart path (mock `pendingStore.getPendingCount` > 0).
+- Assert: `pendingStore.markAllSessionMessagesAbandoned` was called with the
+  session ID; `sessionManager.removeSessionImmediate` was called;
+  `eventBroadcaster.broadcastSessionCompleted` was called.
+
+If SessionRoutes has no existing test harness, keep scope small: lift the
+restart-trip branch into a private helper method first, export it for
+testing, and assert on the helper. Don't build a full HTTP harness for this.
+
+### Verification
+
+- `bun test tests/worker/RestartGuard.test.ts` — green.
+- `bun test tests/worker/worker-service-unrecoverable.test.ts` — green.
+- `bun test tests/worker/http/SessionRoutes.restart-trip.test.ts` — green.
+- `bun test` (full suite) — no regressions.
+
+### Anti-pattern guard
+
+- Don't stub Claude Agent SDK in these tests — they are pure unit tests.
+- Don't test behavior that would require spinning up the real worker daemon.
+
+---
+
+## Phase 5 — Final Verification
+
+Run the full validation sequence:
+
+1. `bun run build` — builds cleanly (no TS errors).
+2. `bun test` — full suite green.
+3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase
+   is gone from the codebase).
+4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches.
+5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts`
+   — both match.
+6. Sanity read of the diff — each of the three changes is isolated and local,
+   no refactor creep.
+7. Update `CHANGELOG.md` — actually, per repo CLAUDE.md the changelog is
+   auto-generated. **Do not edit.**
+8. Stage & commit on the existing branch
+   `investigate/windows-infinite-loop-usage-drain`. Commit message:
+   `fix: stop Max-plan drain loop on Windows (RestartGuard + SessionRoutes + OAuth patterns)`.
+
+### Anti-pattern guard
+
+- Do not touch `CHANGELOG.md` (auto-generated per CLAUDE.md).
+- Do not bump version here — release flow handles it.
+- Do not merge to main — the user wants a PR.
+
+---
+
+## Execution note
+
+Each implementation phase (1, 2, 3, 4) is self-contained and cites the exact
+files/lines discovered in Phase 0. Phases 1, 2, 3 can run in parallel;
+Phase 4 depends on Phase 2 (RestartGuard API shape) and Phase 3 (optional
+helper export from worker-service). Phase 5 depends on all prior phases.
--- a/src/services/worker-service.ts
+++ b/src/services/worker-service.ts
@@ -29,6 +29,7 @@ import { sanitizeEnv } from '../supervisor/env-sanitizer.js';
 // transitively pulls in the SQLite database layer via ChromaSync/DatabaseManager.
 import { ensureWorkerStarted as ensureWorkerStartedShared } from './worker-spawner.js';
 import { RestartGuard } from './worker/RestartGuard.js';
+import { isUnrecoverableError } from './worker/unrecoverable-patterns.js';

 // Re-export for backward compatibility — canonical implementation in shared/plugin-state.ts
 export { isPluginDisabledInClaudeSettings } from '../shared/plugin-state.js';
@@ -709,23 +710,10 @@ export class WorkerService {
        const errorMessage = (error as Error)?.message || '';

        // Detect unrecoverable errors that should NOT trigger restart
-        // These errors will fail immediately on retry, causing infinite loops
-        const unrecoverablePatterns = [
-          'Claude executable not found',
-          'CLAUDE_CODE_PATH',
-          'ENOENT',
-          'spawn',
-          'Invalid API key',
-          'API_KEY_INVALID',
-          'API key expired',
-          'API key not valid',
-          'PERMISSION_DENIED',
-          'Gemini API error: 400',
-          'Gemini API error: 401',
-          'Gemini API error: 403',
-          'FOREIGN KEY constraint failed',
-        ];
-        if (unrecoverablePatterns.some(pattern => errorMessage.includes(pattern))) {
+        // These errors will fail immediately on retry, causing infinite loops.
+        // Pattern list lives in ./worker/unrecoverable-patterns.ts so the matcher
+        // is unit-testable without importing this full module.
+        if (isUnrecoverableError(errorMessage)) {
          hadUnrecoverableError = true;
          this.lastAiInteraction = {
            timestamp: Date.now(),
@@ -831,7 +819,9 @@ export class WorkerService {
              pendingCount,
              restartsInWindow: session.restartGuard.restartsInWindow,
              windowMs: session.restartGuard.windowMs,
-              maxRestarts: session.restartGuard.maxRestarts
+              maxRestarts: session.restartGuard.maxRestarts,
+              totalRestarts: session.restartGuard.totalRestarts,
+              lifetimeCap: session.restartGuard.lifetimeCap
            });
            session.consecutiveRestarts = 0;
            this.terminateSession(session.sessionDbId, 'max_restarts_exceeded');
--- a/src/services/worker/RestartGuard.ts
+++ b/src/services/worker/RestartGuard.ts
@@ -3,28 +3,51 @@
 * Prevents tight-loop restarts (bug) while allowing legitimate occasional restarts
 * over long sessions. Replaces the flat consecutiveRestarts counter that stranded
 * pending messages after just 3 restarts over any timeframe (#2053).
+ *
+ * Additional guards (Phase 2 of windows-max-plan-drain-fix):
+ *   - Decay of windowed history only fires after N consecutive successes, so a
+ *     single fluky success cannot clear the runaway-loop detector.
+ *   - Absolute lifetime cap: once total restarts cross the cap, the guard trips
+ *     permanently and cannot be reset by later successes.
 */

 const RESTART_WINDOW_MS = 60_000;      // Only count restarts within last 60 seconds
 const MAX_WINDOWED_RESTARTS = 10;      // 10 restarts in 60s = runaway loop
 const DECAY_AFTER_SUCCESS_MS = 5 * 60_000; // Clear history after 5min of uninterrupted success
+const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
+const ABSOLUTE_LIFETIME_RESTART_CAP = 50;

 export class RestartGuard {
  private restartTimestamps: number[] = [];
  private lastSuccessfulProcessing: number | null = null;
+  private consecutiveSuccessCount = 0;
+  private totalRestartsAllTime = 0;
+  private decayEligible = false;

  /**
   * Record a restart and check if the guard should trip.
   * @returns true if the restart is ALLOWED, false if it should be BLOCKED
   */
  recordRestart(): boolean {
+    this.totalRestartsAllTime += 1;
+    this.consecutiveSuccessCount = 0; // streak broken by any restart
+
+    // Terminal: lifetime cap reached — never resets, even if successes follow.
+    if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
+      return false;
+    }
+
    const now = Date.now();

-    // Decay: clear history only after real success + 5min of uninterrupted success
-    if (this.lastSuccessfulProcessing !== null
+    // Decay: only fires if we accumulated the required consecutive successes
+    // AND 5min has elapsed since the last success. One-off successes cannot
+    // clear the windowed-restart history.
+    if (this.decayEligible
+        && this.lastSuccessfulProcessing !== null
        && now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
      this.restartTimestamps = [];
      this.lastSuccessfulProcessing = null;
+      this.decayEligible = false;
    }

    // Prune old timestamps outside the window
@@ -41,9 +64,15 @@ export class RestartGuard {

  /**
   * Call when a message is successfully processed to update the success timestamp.
+   * Requires REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY consecutive calls before
+   * the restart-window decay path can fire.
   */
  recordSuccess(): void {
+    this.consecutiveSuccessCount += 1;
    this.lastSuccessfulProcessing = Date.now();
+    if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
+      this.decayEligible = true;
+    }
  }

  /**
@@ -67,4 +96,19 @@ export class RestartGuard {
  get maxRestarts(): number {
    return MAX_WINDOWED_RESTARTS;
  }
+
+  /**
+   * Total restarts counted for the lifetime of this guard (for logging).
+   * Never decreases; compared against `lifetimeCap` to decide terminal trip.
+   */
+  get totalRestarts(): number {
+    return this.totalRestartsAllTime;
+  }
+
+  /**
+   * Absolute lifetime restart cap (for logging).
+   */
+  get lifetimeCap(): number {
+    return ABSOLUTE_LIFETIME_RESTART_CAP;
+  }
 }
--- a/src/services/worker/http/routes/SessionRoutes.ts
+++ b/src/services/worker/http/routes/SessionRoutes.ts
@@ -97,6 +97,49 @@ export class SessionRoutes extends BaseRouteHandler {
  private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099)
  private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590)

+  /**
+   * Handle the restart-guard-tripped branch of crash recovery.
+   *
+   * When the windowed restart guard (or lifetime cap) trips, we must:
+   *   1. abort the SDK subprocess so it stops consuming Claude
+   *   2. mark pending messages abandoned so the next worker startup's
+   *      processPendingQueues() does NOT replay them (drain loop fix)
+   *   3. remove the in-memory session and broadcast completion so the UI
+   *      reflects the terminal state
+   *
+   * Mirrors WorkerService.terminateSession() semantics — cannot call that
+   * method directly because it's private. Extracted into a helper so it's
+   * unit-testable without standing up a full HTTP harness (see
+   * tests/worker/http/SessionRoutes.restart-trip.test.ts).
+   */
+  private handleRestartGuardTripped(
+    sessionDbId: number,
+    pendingCount: number,
+    session: { abortController: AbortController; restartGuard?: RestartGuard }
+  ): void {
+    const restartGuard = session.restartGuard;
+    logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
+      sessionId: sessionDbId,
+      pendingCount,
+      restartsInWindow: restartGuard?.restartsInWindow,
+      windowMs: restartGuard?.windowMs,
+      maxRestarts: restartGuard?.maxRestarts,
+      totalRestarts: restartGuard?.totalRestarts,
+      lifetimeCap: restartGuard?.lifetimeCap,
+      action: 'Pending messages abandoned to prevent replay loop.'
+    });
+    session.abortController.abort();
+    const pendingStore = this.sessionManager.getPendingMessageStore();
+    const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
+    logger.info('SYSTEM', 'Session terminated', {
+      sessionId: sessionDbId,
+      reason: 'max_restarts_exceeded',
+      abandonedMessages: abandoned
+    });
+    this.sessionManager.removeSessionImmediate(sessionDbId);
+    this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
+  }
+
  private ensureGeneratorRunning(sessionDbId: number, source: string): void {
    const session = this.sessionManager.getSession(sessionDbId);
    if (!session) return;
@@ -316,16 +359,7 @@ export class SessionRoutes extends BaseRouteHandler {
            session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging

            if (!restartAllowed) {
-              logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
-                sessionId: sessionDbId,
-                pendingCount,
-                restartsInWindow: session.restartGuard.restartsInWindow,
-                windowMs: session.restartGuard.windowMs,
-                maxRestarts: session.restartGuard.maxRestarts,
-                action: 'Generator will NOT restart. Check logs for root cause. Messages remain in pending state.'
-              });
-              // Don't restart - abort to prevent further API calls
-              session.abortController.abort();
+              this.handleRestartGuardTripped(sessionDbId, pendingCount, session);
              return;
            }

@@ -334,7 +368,9 @@ export class SessionRoutes extends BaseRouteHandler {
              pendingCount,
              consecutiveRestarts: session.consecutiveRestarts,
              restartsInWindow: session.restartGuard!.restartsInWindow,
-              maxRestarts: session.restartGuard!.maxRestarts
+              maxRestarts: session.restartGuard!.maxRestarts,
+              totalRestarts: session.restartGuard!.totalRestarts,
+              lifetimeCap: session.restartGuard!.lifetimeCap
            });

            // Abort OLD controller before replacing to prevent child process leaks
--- a/src/services/worker/unrecoverable-patterns.ts
+++ b/src/services/worker/unrecoverable-patterns.ts
@@ -0,0 +1,45 @@
+/**
+ * Unrecoverable error patterns
+ *
+ * Error messages matching any of these substrings should NOT trigger a restart
+ * of the SDK generator. Retrying them will fail identically and drain API
+ * budget / Max-plan usage (see windows-max-plan-drain-fix plan, Phase 3).
+ *
+ * Matching is a plain substring check via `String.prototype.includes`, so
+ * patterns must be specific enough to avoid false positives (e.g. bare `'401'`
+ * would match request IDs and is forbidden — use agent-prefixed forms).
+ */
+
+export const UNRECOVERABLE_ERROR_PATTERNS: readonly string[] = [
+  'Claude executable not found',
+  'CLAUDE_CODE_PATH',
+  'ENOENT',
+  'spawn',
+  'Invalid API key',
+  'API_KEY_INVALID',
+  'API key expired',
+  'API key not valid',
+  'PERMISSION_DENIED',
+  'Gemini API error: 400',
+  'Gemini API error: 401',
+  'Gemini API error: 403',
+  // OAuth / subscription-token expiry (Max plan users) — matches SDK
+  // subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
+  // is no longer valid.
+  'OAuth token expired',
+  'token has been revoked',
+  'Unauthorized',
+  // Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
+  'OpenRouter API error: 401',
+  'OpenRouter API error: 403',
+  'FOREIGN KEY constraint failed',
+];
+
+/**
+ * Returns true if the given error message matches any unrecoverable pattern.
+ * Accepts an empty / non-string input and returns false in that case.
+ */
+export function isUnrecoverableError(errorMessage: string | null | undefined): boolean {
+  if (!errorMessage) return false;
+  return UNRECOVERABLE_ERROR_PATTERNS.some(pattern => errorMessage.includes(pattern));
+}