fix: stop Windows Max-plan drain loop — terminate session on restart-guard trip, tighten RestartGuard, catch OAuth expiry

Discord report: Windows 11 user's Max-plan usage was being consumed in the
background. Root cause is a regression introduced alongside the v12.3.3
RestartGuard: (1) when the guard trips in the HTTP-layer crash-recovery
path, pending messages were left in 'pending' state and auto-replayed by
processPendingQueues on the next daemon start; (2) any single successful
message cleared the restart-window decay, so a 9-of-10 failure regime
never tripped the guard; (3) OAuth-token expiry errors are not in the
unrecoverable list and loop forever on Max-plan subscriptions.

SessionRoutes.ts: on restart-guard trip, abandon pending messages and
remove the session (mirrors WorkerService.terminateSession which is
private). Pattern copied from the sibling block at SessionRoutes.ts:123.

RestartGuard.ts: require 5 consecutive recordSuccess() calls before the
window-decay path becomes eligible; add a terminal lifetime cap of 50
total restarts that never resets. Public API unchanged.

worker-service.ts: extract unrecoverablePatterns into a testable helper
module and add OAuth/OpenRouter-auth patterns: 'OAuth token expired',
'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401',
'OpenRouter API error: 403'. Bare '401' deliberately not added (too broad).

Investigation and phased plan docs included for reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2026-04-20 20:02:18 -07:00
parent 49ab404c08
commit a172c6e4ca
6 changed files with 722 additions and 31 deletions

View File

@@ -0,0 +1,163 @@
# Investigation: Windows 11 Max-Plan Usage Drain (v12.3.3+ regression)
**Branch:** `investigate/windows-infinite-loop-usage-drain`
**Date:** 2026-04-20
**Reporter:** Discord user (Win11 + Claude Code CLI + claude-mem latest)
**Symptoms:**
- "Going into a loop with Claude"
- "Consuming entire Max plan usage in the background"
- "Continuous failed Python hooks" in the morning after updating
- Issue stops only after uninstalling claude-mem
## TL;DR
Between v12.3.2 (last "safe") and v12.3.7 (current), the flat 3-restart cap on
SDK generator crashes was replaced with a time-windowed `RestartGuard` that
**resets its decay window on any successful message**. On Windows, where the
MCP loopback and Claude-executable resolution have additional failure modes
(observation 71051 — *"MCP loopback failure causes 91.6% session failure rate"*),
the worker enters a slow-drip crash loop that never trips the new guard, and
burns the user's Max-plan OAuth token on each restart's Claude Agent SDK call.
## Root-cause chain
1. **Auth path**`src/shared/EnvManager.ts:215` builds an isolated env for
every SDK subprocess spawn. When the user has no `ANTHROPIC_API_KEY` in
`~/.claude-mem/.env`, line 255 passes through `CLAUDE_CODE_OAUTH_TOKEN` from
the parent Claude Code session. **Every worker-driven Claude call is billed
against the user's Max subscription.** That is by design, but it's the
blast-radius multiplier for every other bug in the chain.
2. **Automatic replay on every worker start**
`src/services/worker-service.ts:592` calls `processPendingQueues(50)` during
worker initialization. It re-spawns an SDK subprocess for every session
with `status IN ('pending', 'processing')` messages
(`PendingMessageStore.getSessionsWithPendingMessages()`, line 447). A single
accumulated backlog from a previous failed run becomes a fresh storm of
Claude calls on the next daemon restart.
3. **RestartGuard is too permissive for slow-drip failures**
`src/services/worker/RestartGuard.ts`:
- `MAX_WINDOWED_RESTARTS = 10` restarts per 60 s
- `DECAY_AFTER_SUCCESS_MS = 5 min` — on *any* `recordSuccess()` call, the
restart timestamp array is wiped
- `ResponseProcessor.ts:211` calls `recordSuccess()` after *any* batch where
messages were confirmed
If 1-of-11 SDK invocations succeeds, the window never fills, the decay
clears history, and restarts continue indefinitely. At the observed 91.6 %
MCP failure rate this is exactly the regime the user is in.
4. **Two divergent crash-recovery paths**
- `src/services/worker-service.ts:822-857` — on restart-guard trip calls
`terminateSession()` which calls `pendingStore.markAllSessionMessagesAbandoned()`
(PendingMessageStore.ts:293). Correct behavior.
- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — on restart-
guard trip only `session.abortController.abort()`. **Messages remain in
`pending` state** (explicitly acknowledged in the log message on line 325).
Next worker startup's `processPendingQueues()` grabs them again, starting
the loop over.
5. **Exponential-backoff ceiling amplifies damage**
`SessionRoutes.ts:348` and `worker-service.ts` both cap backoff at 8 s after
4+ restarts. Steady state is ~7 restarts/minute ≈ 10 000 SDK invocations/day
before the guard trips — if it ever does (see #3).
6. **OAuth-expiry has no special handling** — the `unrecoverablePatterns` list
(`worker-service.ts:713-727`) matches on `'Invalid API key'`, `'API_KEY_INVALID'`,
`'API key expired'`, `'API key not valid'`. None of these match OAuth-token
failures. An expired/revoked `CLAUDE_CODE_OAUTH_TOKEN` produces errors that
the worker treats as transient and retries. Observation 55605 records PR
#1180 as a prior "OAuth Token Expiry Infinite Retry Loop" fix — the same
class of bug has re-surfaced against a new token type.
## "Failed Python hooks"
The user's wording is a misattribution. claude-mem's hooks are TypeScript
compiled to `plugin/scripts/*-hook.cjs` and run via Bun/Node. However:
- `uv` (Python toolchain) is installed for ChromaDB, and the `ChromaMcpManager`
spawns a Python process for vector sync.
- When Chroma sync fails, errors surface in the hook's stderr alongside the
real Bun/Node hook failure.
- On Windows the message the user most likely saw was the Chroma uv/Python
subprocess failing, conflated with the hook wrapper's own failure output.
It is **not** that hooks themselves are Python. The underlying bug is the
worker's SDK retry loop.
## Windows-specific amplifiers
- `SDKAgent.ts:466-473``where claude.cmd` resolution is tried first on Win32.
Any environment where `PATHEXT` or `where` behaves oddly (mingw, Git Bash
with stripped PATH) returns an error that is caught silently and falls
through to the generic "Claude executable not found". That string IS in
`unrecoverablePatterns`, so it should abort cleanly — but only if the SDK
surfaces it. In practice, transient subprocess spawn races on Windows surface
as generic errors that **don't** match the list, and then restart.
- `worker-spawner.ts:39` — the Windows spawn-cooldown lock (2 min) only
suppresses *daemon* spawn attempts, not the SDK subprocess spawn-storm
described here.
- `hook-constants.ts:30-34` — Windows gets a hook-timeout multiplier. Longer
hook windows = more time for the crash loop to run per session.
## What changed vs v12.3.2
```
src/services/worker/RestartGuard.ts | 70 ++++++ (NEW)
src/services/worker/http/routes/SessionRoutes.ts | 24 +/- (MAX=3 → RestartGuard)
src/services/worker-service.ts | 28 +/- (MAX=3 → RestartGuard)
src/services/sqlite/PendingMessageStore.ts | 19 ++ (clearFailed → clearFailedOlderThan(1h))
```
The regression is squarely in the restart-guard swap. The old flat counter
would have tripped after 3 crashes and stopped the SDK spawn, regardless of
whether anything eventually succeeded.
## Recommended fixes (in priority order)
1. **SessionRoutes.ts restart-guard trip must call `terminateSession`** (or
`markAllSessionMessagesAbandoned`) — mirror the behavior in
`worker-service.ts:837`. Today it explicitly leaves messages pending, which
guarantees re-replay on daemon restart.
2. **Tighten RestartGuard decay** — either
- require N consecutive successes before decay, not a single one, or
- track a separate failure rate; if fail-rate > 50 % over the window, trip
regardless of `recordSuccess()` calls.
3. **Add OAuth-expiry to `unrecoverablePatterns`** — common SDK error strings
from expired OAuth tokens (`Unauthorized`, `OAuth token expired`,
`token has been revoked`, 401 responses) should be treated the same as
`'Invalid API key'`.
4. **Cap absolute restart count per session-lifetime** — RestartGuard caps per
window but has no absolute ceiling. A hard cap (e.g. 50 restarts regardless
of window) protects users from the decay-loop regime.
5. **Kill-switch** — a `CLAUDE_MEM_PAUSE_WORKER` setting the user can flip
without uninstalling, so the next Discord user isn't forced to uninstall
to stop the bleeding. Hook entrypoints would short-circuit if set.
6. **Telemetry on worker startup** — emit a warning if
`processPendingQueues()` finds > N orphaned sessions or > M orphaned
messages. Today the auto-recovery is silent for backlogs of any size.
## Files to touch for the fix
- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — call
`terminateSession` instead of bare `abort()`.
- `src/services/worker/RestartGuard.ts` — stricter decay semantics + absolute
cap.
- `src/services/worker-service.ts:713-727` — extend `unrecoverablePatterns`.
- `src/shared/SettingsDefaultsManager.ts``CLAUDE_MEM_PAUSE_WORKER` flag.
## What the user should do *right now*
Until a fix ships, the Discord user's mitigation (uninstall) is correct. As
a less-drastic workaround:
1. Stop the worker: `curl -X POST http://localhost:37777/api/shutdown` (or
kill the `bun` process in Task Manager).
2. Delete `~/.claude-mem/claude-mem.db-wal` and empty the `pending_messages`
table via sqlite3 to break any stored replay queue.
3. Remove the plugin from `~/.claude.json` until the fix ships.

View File

@@ -0,0 +1,413 @@
# Plan: Fix Windows Max-Plan Drain (v12.3.3+ regression)
Context: `INVESTIGATION-windows-max-plan-drain.md` in repo root. Branch:
`investigate/windows-infinite-loop-usage-drain`. Test runner: `bun test`.
## Phase 0 — Documentation Discovery (COMPLETED)
Findings consolidated from three discovery subagents:
### Codebase APIs available (verified)
- **`SessionRoutes` already has** `this.sessionManager`, `this.eventBroadcaster`,
`this.workerService` (`src/services/worker/http/routes/SessionRoutes.ts:34-49`).
- **`WorkerService.terminateSession` is `private`** (`worker-service.ts:964`) —
cannot call from SessionRoutes. Must replicate the three-step pattern
inline. Reference implementation: `worker-service.ts:964-976`.
- **Abandonment pattern already used in SessionRoutes** at
`SessionRoutes.ts:123-125``pendingStore.markAllSessionMessagesAbandoned()`
+ `sessionManager.removeSessionImmediate()`. Copy this pattern.
- **`SessionManager.removeSessionImmediate`** is public
(`SessionManager.ts:453-468`) and fires `onSessionDeletedCallback`
`broadcastProcessingStatus()` automatically.
- **`SessionEventBroadcaster.broadcastSessionCompleted`** exists
(`events/SessionEventBroadcaster.ts:73-82`) — `worker-service.ts:951`
does call it after `terminateSession`, so mirror that for parity.
### `RestartGuard` facts
- Full impl is only 70 lines (`src/services/worker/RestartGuard.ts`).
- **No existing tests** for RestartGuard.
- Public surface: `recordRestart(): boolean`, `recordSuccess(): void`,
getters `restartsInWindow`, `windowMs`, `maxRestarts`.
- Call sites (must remain compatible):
- `SessionRoutes.ts:314-315, 322-324, 336-337`
- `worker-service.ts:824-825, 832-834, 854`
- `ResponseProcessor.ts:211``recordSuccess()` called only after a batch is
confirmed to storage.
- Field: `ActiveSession.restartGuard?: RestartGuard` (`worker-types.ts:39`).
### `unrecoverablePatterns` facts
- Location: `worker-service.ts:713-727`. Match is
`unrecoverablePatterns.some(p => errorMessage.includes(p))`
— simple substring on `(error as Error)?.message`.
- **DO NOT add bare `'401'`** — matches request IDs, log lines, etc. Use
agent-prefixed forms instead.
- Gap found: **OpenRouter 401 is not currently caught**
(`OpenRouterAgent.ts:458`). Mirror the Gemini pattern for consistency.
- Existing test file that covers transient/terminal classification:
`tests/worker/agents/fallback-error-handler.test.ts` (line 75-77 asserts
`'401 Unauthorized'` is terminal). No direct tests exist for
`unrecoverablePatterns` — add them.
### Anti-patterns to avoid
- ❌ Making `terminateSession` public and calling it from SessionRoutes — couples
the HTTP layer to WorkerService internals further; the three-step pattern is
already public API surface.
- ❌ Adding bare `'401'` to `unrecoverablePatterns` — too broad.
- ❌ Refactoring RestartGuard public API — call sites depend on getters.
- ❌ Creating new test framework config — project already uses `bun test`.
---
## Phase 1 — Fix SessionRoutes restart-guard trip (stops the drip)
**Goal:** When the windowed restart guard trips in the HTTP-layer crash-recovery
path, abandon pending messages instead of leaving them in `'pending'` state, so
the next worker startup's `processPendingQueues()` does not replay them.
**Target file:** `src/services/worker/http/routes/SessionRoutes.ts`
**Doc reference (pattern to copy, verbatim except for the reason string):**
1. `SessionRoutes.ts:123-125` — existing abandonment pattern in the same class.
2. `worker-service.ts:964-976` — the private `terminateSession` body to
replicate (logger message + reason + broadcastSessionCompleted call).
3. `worker-service.ts:951` — confirms `broadcastSessionCompleted` is called
after abandonment in the sibling code path.
### Task 1.1 — Replace `abort()`-only trip with full abandonment
At `SessionRoutes.ts:326-329`, replace:
```ts
// Don't restart - abort to prevent further API calls
session.abortController.abort();
return;
```
With (copy-adapt from the pattern at SessionRoutes.ts:123-125 + the
log shape from worker-service.ts:968-972):
```ts
// Restart guard tripped — abandon pending messages so the next worker
// startup's processPendingQueues() does NOT replay them, and kill the
// subprocess. Mirrors WorkerService.terminateSession() semantics.
// (Cannot call terminateSession directly — it's private in WorkerService.)
session.abortController.abort();
const pendingStore = this.sessionManager.getPendingMessageStore();
const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
logger.info('SYSTEM', 'Session terminated', {
sessionId: sessionDbId,
reason: 'max_restarts_exceeded',
abandonedMessages: abandoned
});
this.sessionManager.removeSessionImmediate(sessionDbId);
this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
return;
```
Also update the log at `SessionRoutes.ts:319-326`: change the `action:` string
from `'Messages remain in pending state.'` to
`'Pending messages abandoned to prevent replay loop.'`.
### Verification
- `grep -n 'Messages remain in pending state' src/` → no matches.
- `grep -n 'markAllSessionMessagesAbandoned' src/services/worker/http/routes/SessionRoutes.ts`
→ should now find TWO call sites (existing L124 + new one in restart-trip).
- Manual trace: after trip, `pending_messages.status` for that session is
`'failed'` (or whatever `markAllSessionMessagesAbandoned` sets it to —
confirm by reading `PendingMessageStore.ts:293`).
### Anti-pattern guard
- Don't make `terminateSession` public. Replicate inline — the three calls are
already public API.
- Don't skip `abortController.abort()` before abandonment — the SDK subprocess
must be killed or it will keep consuming Claude.
---
## Phase 2 — Tighten RestartGuard (stops the loop from getting far enough to drip)
**Goal:** Require N=5 consecutive `recordSuccess()` calls before clearing the
restart history, and add an absolute lifetime cap of 50 restarts that, once
tripped, never resets.
**Target file:** `src/services/worker/RestartGuard.ts`
**Doc reference:**
- Current impl quoted in full in Phase 0. Consumers at:
- `SessionRoutes.ts:314-315, 322-324, 336-337`
- `worker-service.ts:824-825, 832-834, 854`
- `ResponseProcessor.ts:211`
- Public surface must remain compatible: `recordRestart(): boolean`,
`recordSuccess(): void`, getters `restartsInWindow`, `windowMs`,
`maxRestarts`.
### Task 2.1 — Add constants and private fields
Add to top of file:
```ts
const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
const ABSOLUTE_LIFETIME_RESTART_CAP = 50;
```
Add to class:
```ts
private consecutiveSuccessCount = 0;
private totalRestartsAllTime = 0;
```
### Task 2.2 — Rewrite `recordRestart()` semantics
Behavior:
1. Increment `totalRestartsAllTime`.
2. If `totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP`, return `false`
immediately (never resets — terminal).
3. Any restart resets `consecutiveSuccessCount` to 0 (a failure interrupts the
success streak).
4. Existing decay check: only fire if
`lastSuccessfulProcessing !== null` **AND** we had
`consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY` the
last time `recordSuccess` was called **AND** 5 min elapsed. To make the
logic clean: only gate decay on an existing flag `decayEligible` that is
set to `true` by `recordSuccess` once the streak threshold is reached.
5. Otherwise preserve current window-based logic.
Simplest structure:
```ts
recordRestart(): boolean {
this.totalRestartsAllTime += 1;
this.consecutiveSuccessCount = 0; // streak broken
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
return false; // terminal — lifetime cap reached
}
const now = Date.now();
if (this.decayEligible
&& this.lastSuccessfulProcessing !== null
&& now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
this.restartTimestamps = [];
this.lastSuccessfulProcessing = null;
this.decayEligible = false;
}
this.restartTimestamps = this.restartTimestamps.filter(
ts => now - ts < RESTART_WINDOW_MS
);
this.restartTimestamps.push(now);
return this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS;
}
```
### Task 2.3 — Update `recordSuccess()` to require N consecutive calls
```ts
recordSuccess(): void {
this.consecutiveSuccessCount += 1;
this.lastSuccessfulProcessing = Date.now();
if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
this.decayEligible = true;
}
}
```
Add private field `decayEligible = false`.
### Task 2.4 — Add introspection getters for logging (optional but useful)
```ts
get totalRestarts(): number { return this.totalRestartsAllTime; }
get lifetimeCap(): number { return ABSOLUTE_LIFETIME_RESTART_CAP; }
```
### Task 2.5 — Update log statements at trip sites to include new fields
- `SessionRoutes.ts:319-326`: add `totalRestarts: session.restartGuard.totalRestarts`
and `lifetimeCap: session.restartGuard.lifetimeCap` to the error log payload.
- `worker-service.ts:828-835`: same additions.
### Verification
- `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4).
- Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st →
blocked (lifetime cap persists).
- Behavior check: 4 successes then restart → full window still counted.
5th success then restart 5 min later → window cleared.
### Anti-pattern guard
- Don't make `recordSuccess` reset `totalRestartsAllTime` — the lifetime cap is
meant to be terminal.
- Don't silently change the meaning of the existing getters — add new ones.
---
## Phase 3 — Extend `unrecoverablePatterns` (handle OAuth expiry)
**Goal:** When Max-plan OAuth tokens expire/are revoked, treat the SDK error as
unrecoverable so no restart is attempted.
**Target file:** `src/services/worker-service.ts:713-727`
**Doc reference:**
- Current array + match logic quoted in Phase 0 discovery.
- `OpenRouterAgent.ts:458` — error throw format is
`` `OpenRouter API error: ${response.status} - ${errorText}` ``.
- `tests/worker/agents/fallback-error-handler.test.ts:75-77` — existing
assertion `'401 Unauthorized'` is terminal, confirming it is safe to treat
as unrecoverable.
### Task 3.1 — Add OAuth / OpenRouter auth strings
Add these entries (in a clear grouping comment) to the array at
`worker-service.ts:713-727`:
```ts
// OAuth / subscription-token expiry (Max plan users) — matches SDK
// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
// is no longer valid.
'OAuth token expired',
'token has been revoked',
'Unauthorized',
// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
'OpenRouter API error: 401',
'OpenRouter API error: 403',
```
**Do not add bare `'401'`** — too broad. Anti-pattern confirmed by discovery.
### Verification
- `grep -n "'Unauthorized'" src/services/worker-service.ts` → should match.
- Run `bun test tests/worker/worker-service-unrecoverable.test.ts` (new file —
see Phase 4) — assertions cover each new pattern.
- Run `bun test tests/worker/agents/fallback-error-handler.test.ts` — must
still pass (no regressions).
### Anti-pattern guard
- No regex or `startsWith` — keep substring match semantics to avoid breaking
existing entries.
- Don't unpack `error.code` or nested JSON bodies in this phase — discovery
confirmed all current agent error throws embed the status/message in
`.message`.
---
## Phase 4 — Tests (no tests existed for RestartGuard or unrecoverablePatterns)
**Goal:** Prevent regression on all three fixes.
**Test framework:** `bun test` (see `package.json` scripts).
### Task 4.1 — Create `tests/worker/RestartGuard.test.ts`
Test suites (doc: `src/services/worker/RestartGuard.ts` as modified in Phase 2):
- `recordRestart respects window`: push 10 restarts in <60s → allowed; 11th →
blocked.
- `recordSuccess requires N consecutive before decay`: 4 successes then
restart 6 min later → window still populated; 5 successes then restart 6 min
later → window cleared.
- `restart breaks success streak`: 3 successes → 1 restart → 4 more successes
(total 7 successes with gap) → streak counter is 4 at end, not 7 (decay
NOT yet eligible).
- `lifetime cap is terminal`: 50 restarts → OK; 51st → blocked even if window
is empty; recordSuccess cannot un-block it.
- `getters return expected values`: `totalRestarts`, `lifetimeCap`,
`restartsInWindow`, `maxRestarts`, `windowMs`.
Use `Date.now` mocking via Bun's `spyOn(Date, 'now')` or an injected clock.
Check project-local patterns first — if other tests in `tests/worker/` already
use a specific clock helper, copy that pattern. Otherwise straight `spyOn`.
### Task 4.2 — Create `tests/worker/worker-service-unrecoverable.test.ts`
Focused unit test on the `.some(p => errorMessage.includes(p))` matcher for
each new pattern. If extracting `unrecoverablePatterns` into a standalone
exported helper makes testing easier, do that as part of this task (small
refactor — pull the array + predicate into a named export at the top of
`worker-service.ts` and import it in the `.catch` block). Verify:
- Each of `'OAuth token expired'`, `'token has been revoked'`,
`'Unauthorized'`, `'OpenRouter API error: 401'`,
`'OpenRouter API error: 403'` matches a realistic error message.
- Bare `'401'` by itself (e.g., `"request-id-401xyz"`) does NOT match — this
test locks in the decision not to add bare `'401'`.
- All PRE-EXISTING patterns still match realistic messages (no regressions).
### Task 4.3 — Add SessionRoutes restart-trip integration test
If a test file already exists for `SessionRoutes` in `tests/worker/http/` or
`tests/server/`, extend it. Otherwise create
`tests/worker/http/SessionRoutes.restart-trip.test.ts`:
- Set up a session with pending messages and a `RestartGuard` pre-loaded near
the window cap.
- Trigger the restart path (mock `pendingStore.getPendingCount` > 0).
- Assert: `pendingStore.markAllSessionMessagesAbandoned` was called with the
session ID; `sessionManager.removeSessionImmediate` was called;
`eventBroadcaster.broadcastSessionCompleted` was called.
If SessionRoutes has no existing test harness, keep scope small: lift the
restart-trip branch into a private helper method first, export it for
testing, and assert on the helper. Don't build a full HTTP harness for this.
### Verification
- `bun test tests/worker/RestartGuard.test.ts` — green.
- `bun test tests/worker/worker-service-unrecoverable.test.ts` — green.
- `bun test tests/worker/http/SessionRoutes.restart-trip.test.ts` — green.
- `bun test` (full suite) — no regressions.
### Anti-pattern guard
- Don't stub Claude Agent SDK in these tests — they are pure unit tests.
- Don't test behavior that would require spinning up the real worker daemon.
---
## Phase 5 — Final Verification
Run the full validation sequence:
1. `bun run build` — builds cleanly (no TS errors).
2. `bun test` — full suite green.
3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase
is gone from the codebase).
4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches.
5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts`
— both match.
6. Sanity read of the diff — each of the three changes is isolated and local,
no refactor creep.
7. Update `CHANGELOG.md` — actually, per repo CLAUDE.md the changelog is
auto-generated. **Do not edit.**
8. Stage & commit on the existing branch
`investigate/windows-infinite-loop-usage-drain`. Commit message:
`fix: stop Max-plan drain loop on Windows (RestartGuard + SessionRoutes + OAuth patterns)`.
### Anti-pattern guard
- Do not touch `CHANGELOG.md` (auto-generated per CLAUDE.md).
- Do not bump version here — release flow handles it.
- Do not merge to main — the user wants a PR.
---
## Execution note
Each implementation phase (1, 2, 3, 4) is self-contained and cites the exact
files/lines discovered in Phase 0. Phases 1, 2, 3 can run in parallel;
Phase 4 depends on Phase 2 (RestartGuard API shape) and Phase 3 (optional
helper export from worker-service). Phase 5 depends on all prior phases.

View File

@@ -29,6 +29,7 @@ import { sanitizeEnv } from '../supervisor/env-sanitizer.js';
// transitively pulls in the SQLite database layer via ChromaSync/DatabaseManager.
import { ensureWorkerStarted as ensureWorkerStartedShared } from './worker-spawner.js';
import { RestartGuard } from './worker/RestartGuard.js';
import { isUnrecoverableError } from './worker/unrecoverable-patterns.js';
// Re-export for backward compatibility — canonical implementation in shared/plugin-state.ts
export { isPluginDisabledInClaudeSettings } from '../shared/plugin-state.js';
@@ -709,23 +710,10 @@ export class WorkerService {
const errorMessage = (error as Error)?.message || '';
// Detect unrecoverable errors that should NOT trigger restart
// These errors will fail immediately on retry, causing infinite loops
const unrecoverablePatterns = [
'Claude executable not found',
'CLAUDE_CODE_PATH',
'ENOENT',
'spawn',
'Invalid API key',
'API_KEY_INVALID',
'API key expired',
'API key not valid',
'PERMISSION_DENIED',
'Gemini API error: 400',
'Gemini API error: 401',
'Gemini API error: 403',
'FOREIGN KEY constraint failed',
];
if (unrecoverablePatterns.some(pattern => errorMessage.includes(pattern))) {
// These errors will fail immediately on retry, causing infinite loops.
// Pattern list lives in ./worker/unrecoverable-patterns.ts so the matcher
// is unit-testable without importing this full module.
if (isUnrecoverableError(errorMessage)) {
hadUnrecoverableError = true;
this.lastAiInteraction = {
timestamp: Date.now(),
@@ -831,7 +819,9 @@ export class WorkerService {
pendingCount,
restartsInWindow: session.restartGuard.restartsInWindow,
windowMs: session.restartGuard.windowMs,
maxRestarts: session.restartGuard.maxRestarts
maxRestarts: session.restartGuard.maxRestarts,
totalRestarts: session.restartGuard.totalRestarts,
lifetimeCap: session.restartGuard.lifetimeCap
});
session.consecutiveRestarts = 0;
this.terminateSession(session.sessionDbId, 'max_restarts_exceeded');

View File

@@ -3,28 +3,51 @@
* Prevents tight-loop restarts (bug) while allowing legitimate occasional restarts
* over long sessions. Replaces the flat consecutiveRestarts counter that stranded
* pending messages after just 3 restarts over any timeframe (#2053).
*
* Additional guards (Phase 2 of windows-max-plan-drain-fix):
* - Decay of windowed history only fires after N consecutive successes, so a
* single fluky success cannot clear the runaway-loop detector.
* - Absolute lifetime cap: once total restarts cross the cap, the guard trips
* permanently and cannot be reset by later successes.
*/
const RESTART_WINDOW_MS = 60_000; // Only count restarts within last 60 seconds
const MAX_WINDOWED_RESTARTS = 10; // 10 restarts in 60s = runaway loop
const DECAY_AFTER_SUCCESS_MS = 5 * 60_000; // Clear history after 5min of uninterrupted success
const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
const ABSOLUTE_LIFETIME_RESTART_CAP = 50;
export class RestartGuard {
private restartTimestamps: number[] = [];
private lastSuccessfulProcessing: number | null = null;
private consecutiveSuccessCount = 0;
private totalRestartsAllTime = 0;
private decayEligible = false;
/**
* Record a restart and check if the guard should trip.
* @returns true if the restart is ALLOWED, false if it should be BLOCKED
*/
recordRestart(): boolean {
this.totalRestartsAllTime += 1;
this.consecutiveSuccessCount = 0; // streak broken by any restart
// Terminal: lifetime cap reached — never resets, even if successes follow.
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
return false;
}
const now = Date.now();
// Decay: clear history only after real success + 5min of uninterrupted success
if (this.lastSuccessfulProcessing !== null
// Decay: only fires if we accumulated the required consecutive successes
// AND 5min has elapsed since the last success. One-off successes cannot
// clear the windowed-restart history.
if (this.decayEligible
&& this.lastSuccessfulProcessing !== null
&& now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
this.restartTimestamps = [];
this.lastSuccessfulProcessing = null;
this.decayEligible = false;
}
// Prune old timestamps outside the window
@@ -41,9 +64,15 @@ export class RestartGuard {
/**
* Call when a message is successfully processed to update the success timestamp.
* Requires REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY consecutive calls before
* the restart-window decay path can fire.
*/
recordSuccess(): void {
this.consecutiveSuccessCount += 1;
this.lastSuccessfulProcessing = Date.now();
if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
this.decayEligible = true;
}
}
/**
@@ -67,4 +96,19 @@ export class RestartGuard {
get maxRestarts(): number {
return MAX_WINDOWED_RESTARTS;
}
/**
* Total restarts counted for the lifetime of this guard (for logging).
* Never decreases; compared against `lifetimeCap` to decide terminal trip.
*/
get totalRestarts(): number {
return this.totalRestartsAllTime;
}
/**
* Absolute lifetime restart cap (for logging).
*/
get lifetimeCap(): number {
return ABSOLUTE_LIFETIME_RESTART_CAP;
}
}

View File

@@ -97,6 +97,49 @@ export class SessionRoutes extends BaseRouteHandler {
private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099)
private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590)
/**
* Handle the restart-guard-tripped branch of crash recovery.
*
* When the windowed restart guard (or lifetime cap) trips, we must:
* 1. abort the SDK subprocess so it stops consuming Claude
* 2. mark pending messages abandoned so the next worker startup's
* processPendingQueues() does NOT replay them (drain loop fix)
* 3. remove the in-memory session and broadcast completion so the UI
* reflects the terminal state
*
* Mirrors WorkerService.terminateSession() semantics — cannot call that
* method directly because it's private. Extracted into a helper so it's
* unit-testable without standing up a full HTTP harness (see
* tests/worker/http/SessionRoutes.restart-trip.test.ts).
*/
private handleRestartGuardTripped(
sessionDbId: number,
pendingCount: number,
session: { abortController: AbortController; restartGuard?: RestartGuard }
): void {
const restartGuard = session.restartGuard;
logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
sessionId: sessionDbId,
pendingCount,
restartsInWindow: restartGuard?.restartsInWindow,
windowMs: restartGuard?.windowMs,
maxRestarts: restartGuard?.maxRestarts,
totalRestarts: restartGuard?.totalRestarts,
lifetimeCap: restartGuard?.lifetimeCap,
action: 'Pending messages abandoned to prevent replay loop.'
});
session.abortController.abort();
const pendingStore = this.sessionManager.getPendingMessageStore();
const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
logger.info('SYSTEM', 'Session terminated', {
sessionId: sessionDbId,
reason: 'max_restarts_exceeded',
abandonedMessages: abandoned
});
this.sessionManager.removeSessionImmediate(sessionDbId);
this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
}
private ensureGeneratorRunning(sessionDbId: number, source: string): void {
const session = this.sessionManager.getSession(sessionDbId);
if (!session) return;
@@ -316,16 +359,7 @@ export class SessionRoutes extends BaseRouteHandler {
session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging
if (!restartAllowed) {
logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
sessionId: sessionDbId,
pendingCount,
restartsInWindow: session.restartGuard.restartsInWindow,
windowMs: session.restartGuard.windowMs,
maxRestarts: session.restartGuard.maxRestarts,
action: 'Generator will NOT restart. Check logs for root cause. Messages remain in pending state.'
});
// Don't restart - abort to prevent further API calls
session.abortController.abort();
this.handleRestartGuardTripped(sessionDbId, pendingCount, session);
return;
}
@@ -334,7 +368,9 @@ export class SessionRoutes extends BaseRouteHandler {
pendingCount,
consecutiveRestarts: session.consecutiveRestarts,
restartsInWindow: session.restartGuard!.restartsInWindow,
maxRestarts: session.restartGuard!.maxRestarts
maxRestarts: session.restartGuard!.maxRestarts,
totalRestarts: session.restartGuard!.totalRestarts,
lifetimeCap: session.restartGuard!.lifetimeCap
});
// Abort OLD controller before replacing to prevent child process leaks

View File

@@ -0,0 +1,45 @@
/**
* Unrecoverable error patterns
*
* Error messages matching any of these substrings should NOT trigger a restart
* of the SDK generator. Retrying them will fail identically and drain API
* budget / Max-plan usage (see windows-max-plan-drain-fix plan, Phase 3).
*
* Matching is a plain substring check via `String.prototype.includes`, so
* patterns must be specific enough to avoid false positives (e.g. bare `'401'`
* would match request IDs and is forbidden — use agent-prefixed forms).
*/
export const UNRECOVERABLE_ERROR_PATTERNS: readonly string[] = [
'Claude executable not found',
'CLAUDE_CODE_PATH',
'ENOENT',
'spawn',
'Invalid API key',
'API_KEY_INVALID',
'API key expired',
'API key not valid',
'PERMISSION_DENIED',
'Gemini API error: 400',
'Gemini API error: 401',
'Gemini API error: 403',
// OAuth / subscription-token expiry (Max plan users) — matches SDK
// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
// is no longer valid.
'OAuth token expired',
'token has been revoked',
'Unauthorized',
// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
'OpenRouter API error: 401',
'OpenRouter API error: 403',
'FOREIGN KEY constraint failed',
];
/**
* Returns true if the given error message matches any unrecoverable pattern.
* Accepts an empty / non-string input and returns false in that case.
*/
export function isUnrecoverableError(errorMessage: string | null | undefined): boolean {
if (!errorMessage) return false;
return UNRECOVERABLE_ERROR_PATTERNS.some(pattern => errorMessage.includes(pattern));
}