mirror of
https://github.com/thedotmack/claude-mem
synced 2026-04-25 17:15:04 +02:00
fix: stop Windows Max-plan drain loop — terminate session on restart-guard trip, tighten RestartGuard, catch OAuth expiry
Discord report: Windows 11 user's Max-plan usage was being consumed in the background. Root cause is a regression introduced alongside the v12.3.3 RestartGuard: (1) when the guard trips in the HTTP-layer crash-recovery path, pending messages were left in 'pending' state and auto-replayed by processPendingQueues on the next daemon start; (2) any single successful message cleared the restart-window decay, so a 9-of-10 failure regime never tripped the guard; (3) OAuth-token expiry errors are not in the unrecoverable list and loop forever on Max-plan subscriptions. SessionRoutes.ts: on restart-guard trip, abandon pending messages and remove the session (mirrors WorkerService.terminateSession which is private). Pattern copied from the sibling block at SessionRoutes.ts:123. RestartGuard.ts: require 5 consecutive recordSuccess() calls before the window-decay path becomes eligible; add a terminal lifetime cap of 50 total restarts that never resets. Public API unchanged. worker-service.ts: extract unrecoverablePatterns into a testable helper module and add OAuth/OpenRouter-auth patterns: 'OAuth token expired', 'token has been revoked', 'Unauthorized', 'OpenRouter API error: 401', 'OpenRouter API error: 403'. Bare '401' deliberately not added (too broad). Investigation and phased plan docs included for reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
163
INVESTIGATION-windows-max-plan-drain.md
Normal file
163
INVESTIGATION-windows-max-plan-drain.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# Investigation: Windows 11 Max-Plan Usage Drain (v12.3.3+ regression)
|
||||
|
||||
**Branch:** `investigate/windows-infinite-loop-usage-drain`
|
||||
**Date:** 2026-04-20
|
||||
**Reporter:** Discord user (Win11 + Claude Code CLI + claude-mem latest)
|
||||
**Symptoms:**
|
||||
- "Going into a loop with Claude"
|
||||
- "Consuming entire Max plan usage in the background"
|
||||
- "Continuous failed Python hooks" in the morning after updating
|
||||
- Issue stops only after uninstalling claude-mem
|
||||
|
||||
## TL;DR
|
||||
|
||||
Between v12.3.2 (last "safe") and v12.3.7 (current), the flat 3-restart cap on
|
||||
SDK generator crashes was replaced with a time-windowed `RestartGuard` that
|
||||
**resets its decay window on any successful message**. On Windows, where the
|
||||
MCP loopback and Claude-executable resolution have additional failure modes
|
||||
(observation 71051 — *"MCP loopback failure causes 91.6% session failure rate"*),
|
||||
the worker enters a slow-drip crash loop that never trips the new guard, and
|
||||
burns the user's Max-plan OAuth token on each restart's Claude Agent SDK call.
|
||||
|
||||
## Root-cause chain
|
||||
|
||||
1. **Auth path** — `src/shared/EnvManager.ts:215` builds an isolated env for
|
||||
every SDK subprocess spawn. When the user has no `ANTHROPIC_API_KEY` in
|
||||
`~/.claude-mem/.env`, line 255 passes through `CLAUDE_CODE_OAUTH_TOKEN` from
|
||||
the parent Claude Code session. **Every worker-driven Claude call is billed
|
||||
against the user's Max subscription.** That is by design, but it's the
|
||||
blast-radius multiplier for every other bug in the chain.
|
||||
|
||||
2. **Automatic replay on every worker start** —
|
||||
`src/services/worker-service.ts:592` calls `processPendingQueues(50)` during
|
||||
worker initialization. It re-spawns an SDK subprocess for every session
|
||||
with `status IN ('pending', 'processing')` messages
|
||||
(`PendingMessageStore.getSessionsWithPendingMessages()`, line 447). A single
|
||||
accumulated backlog from a previous failed run becomes a fresh storm of
|
||||
Claude calls on the next daemon restart.
|
||||
|
||||
3. **RestartGuard is too permissive for slow-drip failures** —
|
||||
`src/services/worker/RestartGuard.ts`:
|
||||
- `MAX_WINDOWED_RESTARTS = 10` restarts per 60 s
|
||||
- `DECAY_AFTER_SUCCESS_MS = 5 min` — on *any* `recordSuccess()` call, the
|
||||
restart timestamp array is wiped
|
||||
- `ResponseProcessor.ts:211` calls `recordSuccess()` after *any* batch where
|
||||
messages were confirmed
|
||||
If 1-of-11 SDK invocations succeeds, the window never fills, the decay
|
||||
clears history, and restarts continue indefinitely. At the observed 91.6 %
|
||||
MCP failure rate this is exactly the regime the user is in.
|
||||
|
||||
4. **Two divergent crash-recovery paths** —
|
||||
- `src/services/worker-service.ts:822-857` — on restart-guard trip calls
|
||||
`terminateSession()` which calls `pendingStore.markAllSessionMessagesAbandoned()`
|
||||
(PendingMessageStore.ts:293). Correct behavior.
|
||||
- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — on restart-
|
||||
guard trip only `session.abortController.abort()`. **Messages remain in
|
||||
`pending` state** (explicitly acknowledged in the log message on line 325).
|
||||
Next worker startup's `processPendingQueues()` grabs them again, starting
|
||||
the loop over.
|
||||
|
||||
5. **Exponential-backoff ceiling amplifies damage** —
|
||||
`SessionRoutes.ts:348` and `worker-service.ts` both cap backoff at 8 s after
|
||||
4+ restarts. Steady state is ~7 restarts/minute ≈ 10 000 SDK invocations/day
|
||||
before the guard trips — if it ever does (see #3).
|
||||
|
||||
6. **OAuth-expiry has no special handling** — the `unrecoverablePatterns` list
|
||||
(`worker-service.ts:713-727`) matches on `'Invalid API key'`, `'API_KEY_INVALID'`,
|
||||
`'API key expired'`, `'API key not valid'`. None of these match OAuth-token
|
||||
failures. An expired/revoked `CLAUDE_CODE_OAUTH_TOKEN` produces errors that
|
||||
the worker treats as transient and retries. Observation 55605 records PR
|
||||
#1180 as a prior "OAuth Token Expiry Infinite Retry Loop" fix — the same
|
||||
class of bug has re-surfaced against a new token type.
|
||||
|
||||
## "Failed Python hooks"
|
||||
|
||||
The user's wording is a misattribution. claude-mem's hooks are TypeScript
|
||||
compiled to `plugin/scripts/*-hook.cjs` and run via Bun/Node. However:
|
||||
|
||||
- `uv` (Python toolchain) is installed for ChromaDB, and the `ChromaMcpManager`
|
||||
spawns a Python process for vector sync.
|
||||
- When Chroma sync fails, errors surface in the hook's stderr alongside the
|
||||
real Bun/Node hook failure.
|
||||
- On Windows the message the user most likely saw was the Chroma uv/Python
|
||||
subprocess failing, conflated with the hook wrapper's own failure output.
|
||||
|
||||
It is **not** that hooks themselves are Python. The underlying bug is the
|
||||
worker's SDK retry loop.
|
||||
|
||||
## Windows-specific amplifiers
|
||||
|
||||
- `SDKAgent.ts:466-473` — `where claude.cmd` resolution is tried first on Win32.
|
||||
Any environment where `PATHEXT` or `where` behaves oddly (mingw, Git Bash
|
||||
with stripped PATH) returns an error that is caught silently and falls
|
||||
through to the generic "Claude executable not found". That string IS in
|
||||
`unrecoverablePatterns`, so it should abort cleanly — but only if the SDK
|
||||
surfaces it. In practice, transient subprocess spawn races on Windows surface
|
||||
as generic errors that **don't** match the list, and then restart.
|
||||
- `worker-spawner.ts:39` — the Windows spawn-cooldown lock (2 min) only
|
||||
suppresses *daemon* spawn attempts, not the SDK subprocess spawn-storm
|
||||
described here.
|
||||
- `hook-constants.ts:30-34` — Windows gets a hook-timeout multiplier. Longer
|
||||
hook windows = more time for the crash loop to run per session.
|
||||
|
||||
## What changed vs v12.3.2
|
||||
|
||||
```
|
||||
src/services/worker/RestartGuard.ts | 70 ++++++ (NEW)
|
||||
src/services/worker/http/routes/SessionRoutes.ts | 24 +/- (MAX=3 → RestartGuard)
|
||||
src/services/worker-service.ts | 28 +/- (MAX=3 → RestartGuard)
|
||||
src/services/sqlite/PendingMessageStore.ts | 19 ++ (clearFailed → clearFailedOlderThan(1h))
|
||||
```
|
||||
|
||||
The regression is squarely in the restart-guard swap. The old flat counter
|
||||
would have tripped after 3 crashes and stopped the SDK spawn, regardless of
|
||||
whether anything eventually succeeded.
|
||||
|
||||
## Recommended fixes (in priority order)
|
||||
|
||||
1. **SessionRoutes.ts restart-guard trip must call `terminateSession`** (or
|
||||
`markAllSessionMessagesAbandoned`) — mirror the behavior in
|
||||
`worker-service.ts:837`. Today it explicitly leaves messages pending, which
|
||||
guarantees re-replay on daemon restart.
|
||||
|
||||
2. **Tighten RestartGuard decay** — either
|
||||
- require N consecutive successes before decay, not a single one, or
|
||||
- track a separate failure rate; if fail-rate > 50 % over the window, trip
|
||||
regardless of `recordSuccess()` calls.
|
||||
|
||||
3. **Add OAuth-expiry to `unrecoverablePatterns`** — common SDK error strings
|
||||
from expired OAuth tokens (`Unauthorized`, `OAuth token expired`,
|
||||
`token has been revoked`, 401 responses) should be treated the same as
|
||||
`'Invalid API key'`.
|
||||
|
||||
4. **Cap absolute restart count per session-lifetime** — RestartGuard caps per
|
||||
window but has no absolute ceiling. A hard cap (e.g. 50 restarts regardless
|
||||
of window) protects users from the decay-loop regime.
|
||||
|
||||
5. **Kill-switch** — a `CLAUDE_MEM_PAUSE_WORKER` setting the user can flip
|
||||
without uninstalling, so the next Discord user isn't forced to uninstall
|
||||
to stop the bleeding. Hook entrypoints would short-circuit if set.
|
||||
|
||||
6. **Telemetry on worker startup** — emit a warning if
|
||||
`processPendingQueues()` finds > N orphaned sessions or > M orphaned
|
||||
messages. Today the auto-recovery is silent for backlogs of any size.
|
||||
|
||||
## Files to touch for the fix
|
||||
|
||||
- `src/services/worker/http/routes/SessionRoutes.ts:318-330` — call
|
||||
`terminateSession` instead of bare `abort()`.
|
||||
- `src/services/worker/RestartGuard.ts` — stricter decay semantics + absolute
|
||||
cap.
|
||||
- `src/services/worker-service.ts:713-727` — extend `unrecoverablePatterns`.
|
||||
- `src/shared/SettingsDefaultsManager.ts` — `CLAUDE_MEM_PAUSE_WORKER` flag.
|
||||
|
||||
## What the user should do *right now*
|
||||
|
||||
Until a fix ships, the Discord user's mitigation (uninstall) is correct. As
|
||||
a less-drastic workaround:
|
||||
|
||||
1. Stop the worker: `curl -X POST http://localhost:37777/api/shutdown` (or
|
||||
kill the `bun` process in Task Manager).
|
||||
2. Delete `~/.claude-mem/claude-mem.db-wal` and empty the `pending_messages`
|
||||
table via sqlite3 to break any stored replay queue.
|
||||
3. Remove the plugin from `~/.claude.json` until the fix ships.
|
||||
413
PLAN-windows-max-plan-drain-fix.md
Normal file
413
PLAN-windows-max-plan-drain-fix.md
Normal file
@@ -0,0 +1,413 @@
|
||||
# Plan: Fix Windows Max-Plan Drain (v12.3.3+ regression)
|
||||
|
||||
Context: `INVESTIGATION-windows-max-plan-drain.md` in repo root. Branch:
|
||||
`investigate/windows-infinite-loop-usage-drain`. Test runner: `bun test`.
|
||||
|
||||
## Phase 0 — Documentation Discovery (COMPLETED)
|
||||
|
||||
Findings consolidated from three discovery subagents:
|
||||
|
||||
### Codebase APIs available (verified)
|
||||
|
||||
- **`SessionRoutes` already has** `this.sessionManager`, `this.eventBroadcaster`,
|
||||
`this.workerService` (`src/services/worker/http/routes/SessionRoutes.ts:34-49`).
|
||||
- **`WorkerService.terminateSession` is `private`** (`worker-service.ts:964`) —
|
||||
cannot call from SessionRoutes. Must replicate the three-step pattern
|
||||
inline. Reference implementation: `worker-service.ts:964-976`.
|
||||
- **Abandonment pattern already used in SessionRoutes** at
|
||||
`SessionRoutes.ts:123-125` — `pendingStore.markAllSessionMessagesAbandoned()`
|
||||
+ `sessionManager.removeSessionImmediate()`. Copy this pattern.
|
||||
- **`SessionManager.removeSessionImmediate`** is public
|
||||
(`SessionManager.ts:453-468`) and fires `onSessionDeletedCallback` →
|
||||
`broadcastProcessingStatus()` automatically.
|
||||
- **`SessionEventBroadcaster.broadcastSessionCompleted`** exists
|
||||
(`events/SessionEventBroadcaster.ts:73-82`) — `worker-service.ts:951`
|
||||
does call it after `terminateSession`, so mirror that for parity.
|
||||
|
||||
### `RestartGuard` facts
|
||||
|
||||
- Full impl is only 70 lines (`src/services/worker/RestartGuard.ts`).
|
||||
- **No existing tests** for RestartGuard.
|
||||
- Public surface: `recordRestart(): boolean`, `recordSuccess(): void`,
|
||||
getters `restartsInWindow`, `windowMs`, `maxRestarts`.
|
||||
- Call sites (must remain compatible):
|
||||
- `SessionRoutes.ts:314-315, 322-324, 336-337`
|
||||
- `worker-service.ts:824-825, 832-834, 854`
|
||||
- `ResponseProcessor.ts:211` — `recordSuccess()` called only after a batch is
|
||||
confirmed to storage.
|
||||
- Field: `ActiveSession.restartGuard?: RestartGuard` (`worker-types.ts:39`).
|
||||
|
||||
### `unrecoverablePatterns` facts
|
||||
|
||||
- Location: `worker-service.ts:713-727`. Match is
|
||||
`unrecoverablePatterns.some(p => errorMessage.includes(p))`
|
||||
— simple substring on `(error as Error)?.message`.
|
||||
- **DO NOT add bare `'401'`** — matches request IDs, log lines, etc. Use
|
||||
agent-prefixed forms instead.
|
||||
- Gap found: **OpenRouter 401 is not currently caught**
|
||||
(`OpenRouterAgent.ts:458`). Mirror the Gemini pattern for consistency.
|
||||
- Existing test file that covers transient/terminal classification:
|
||||
`tests/worker/agents/fallback-error-handler.test.ts` (line 75-77 asserts
|
||||
`'401 Unauthorized'` is terminal). No direct tests exist for
|
||||
`unrecoverablePatterns` — add them.
|
||||
|
||||
### Anti-patterns to avoid
|
||||
|
||||
- ❌ Making `terminateSession` public and calling it from SessionRoutes — couples
|
||||
the HTTP layer to WorkerService internals further; the three-step pattern is
|
||||
already public API surface.
|
||||
- ❌ Adding bare `'401'` to `unrecoverablePatterns` — too broad.
|
||||
- ❌ Refactoring RestartGuard public API — call sites depend on getters.
|
||||
- ❌ Creating new test framework config — project already uses `bun test`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Fix SessionRoutes restart-guard trip (stops the drip)
|
||||
|
||||
**Goal:** When the windowed restart guard trips in the HTTP-layer crash-recovery
|
||||
path, abandon pending messages instead of leaving them in `'pending'` state, so
|
||||
the next worker startup's `processPendingQueues()` does not replay them.
|
||||
|
||||
**Target file:** `src/services/worker/http/routes/SessionRoutes.ts`
|
||||
|
||||
**Doc reference (pattern to copy, verbatim except for the reason string):**
|
||||
|
||||
1. `SessionRoutes.ts:123-125` — existing abandonment pattern in the same class.
|
||||
2. `worker-service.ts:964-976` — the private `terminateSession` body to
|
||||
replicate (logger message + reason + broadcastSessionCompleted call).
|
||||
3. `worker-service.ts:951` — confirms `broadcastSessionCompleted` is called
|
||||
after abandonment in the sibling code path.
|
||||
|
||||
### Task 1.1 — Replace `abort()`-only trip with full abandonment
|
||||
|
||||
At `SessionRoutes.ts:326-329`, replace:
|
||||
|
||||
```ts
|
||||
// Don't restart - abort to prevent further API calls
|
||||
session.abortController.abort();
|
||||
return;
|
||||
```
|
||||
|
||||
With (copy-adapt from the pattern at SessionRoutes.ts:123-125 + the
|
||||
log shape from worker-service.ts:968-972):
|
||||
|
||||
```ts
|
||||
// Restart guard tripped — abandon pending messages so the next worker
|
||||
// startup's processPendingQueues() does NOT replay them, and kill the
|
||||
// subprocess. Mirrors WorkerService.terminateSession() semantics.
|
||||
// (Cannot call terminateSession directly — it's private in WorkerService.)
|
||||
session.abortController.abort();
|
||||
const pendingStore = this.sessionManager.getPendingMessageStore();
|
||||
const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
|
||||
logger.info('SYSTEM', 'Session terminated', {
|
||||
sessionId: sessionDbId,
|
||||
reason: 'max_restarts_exceeded',
|
||||
abandonedMessages: abandoned
|
||||
});
|
||||
this.sessionManager.removeSessionImmediate(sessionDbId);
|
||||
this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
|
||||
return;
|
||||
```
|
||||
|
||||
Also update the log at `SessionRoutes.ts:319-326`: change the `action:` string
|
||||
from `'Messages remain in pending state.'` to
|
||||
`'Pending messages abandoned to prevent replay loop.'`.
|
||||
|
||||
### Verification
|
||||
|
||||
- `grep -n 'Messages remain in pending state' src/` → no matches.
|
||||
- `grep -n 'markAllSessionMessagesAbandoned' src/services/worker/http/routes/SessionRoutes.ts`
|
||||
→ should now find TWO call sites (existing L124 + new one in restart-trip).
|
||||
- Manual trace: after trip, `pending_messages.status` for that session is
|
||||
`'failed'` (or whatever `markAllSessionMessagesAbandoned` sets it to —
|
||||
confirm by reading `PendingMessageStore.ts:293`).
|
||||
|
||||
### Anti-pattern guard
|
||||
|
||||
- Don't make `terminateSession` public. Replicate inline — the three calls are
|
||||
already public API.
|
||||
- Don't skip `abortController.abort()` before abandonment — the SDK subprocess
|
||||
must be killed or it will keep consuming Claude.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Tighten RestartGuard (stops the loop from getting far enough to drip)
|
||||
|
||||
**Goal:** Require N=5 consecutive `recordSuccess()` calls before clearing the
|
||||
restart history, and add an absolute lifetime cap of 50 restarts that, once
|
||||
tripped, never resets.
|
||||
|
||||
**Target file:** `src/services/worker/RestartGuard.ts`
|
||||
|
||||
**Doc reference:**
|
||||
|
||||
- Current impl quoted in full in Phase 0. Consumers at:
|
||||
- `SessionRoutes.ts:314-315, 322-324, 336-337`
|
||||
- `worker-service.ts:824-825, 832-834, 854`
|
||||
- `ResponseProcessor.ts:211`
|
||||
- Public surface must remain compatible: `recordRestart(): boolean`,
|
||||
`recordSuccess(): void`, getters `restartsInWindow`, `windowMs`,
|
||||
`maxRestarts`.
|
||||
|
||||
### Task 2.1 — Add constants and private fields
|
||||
|
||||
Add to top of file:
|
||||
|
||||
```ts
|
||||
const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
|
||||
const ABSOLUTE_LIFETIME_RESTART_CAP = 50;
|
||||
```
|
||||
|
||||
Add to class:
|
||||
|
||||
```ts
|
||||
private consecutiveSuccessCount = 0;
|
||||
private totalRestartsAllTime = 0;
|
||||
```
|
||||
|
||||
### Task 2.2 — Rewrite `recordRestart()` semantics
|
||||
|
||||
Behavior:
|
||||
|
||||
1. Increment `totalRestartsAllTime`.
|
||||
2. If `totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP`, return `false`
|
||||
immediately (never resets — terminal).
|
||||
3. Any restart resets `consecutiveSuccessCount` to 0 (a failure interrupts the
|
||||
success streak).
|
||||
4. Existing decay check: only fire if
|
||||
`lastSuccessfulProcessing !== null` **AND** we had
|
||||
`consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY` the
|
||||
last time `recordSuccess` was called **AND** 5 min elapsed. To make the
|
||||
logic clean: only gate decay on an existing flag `decayEligible` that is
|
||||
set to `true` by `recordSuccess` once the streak threshold is reached.
|
||||
5. Otherwise preserve current window-based logic.
|
||||
|
||||
Simplest structure:
|
||||
|
||||
```ts
|
||||
recordRestart(): boolean {
|
||||
this.totalRestartsAllTime += 1;
|
||||
this.consecutiveSuccessCount = 0; // streak broken
|
||||
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
|
||||
return false; // terminal — lifetime cap reached
|
||||
}
|
||||
|
||||
const now = Date.now();
|
||||
if (this.decayEligible
|
||||
&& this.lastSuccessfulProcessing !== null
|
||||
&& now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
|
||||
this.restartTimestamps = [];
|
||||
this.lastSuccessfulProcessing = null;
|
||||
this.decayEligible = false;
|
||||
}
|
||||
|
||||
this.restartTimestamps = this.restartTimestamps.filter(
|
||||
ts => now - ts < RESTART_WINDOW_MS
|
||||
);
|
||||
this.restartTimestamps.push(now);
|
||||
return this.restartTimestamps.length <= MAX_WINDOWED_RESTARTS;
|
||||
}
|
||||
```
|
||||
|
||||
### Task 2.3 — Update `recordSuccess()` to require N consecutive calls
|
||||
|
||||
```ts
|
||||
recordSuccess(): void {
|
||||
this.consecutiveSuccessCount += 1;
|
||||
this.lastSuccessfulProcessing = Date.now();
|
||||
if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
|
||||
this.decayEligible = true;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Add private field `decayEligible = false`.
|
||||
|
||||
### Task 2.4 — Add introspection getters for logging (optional but useful)
|
||||
|
||||
```ts
|
||||
get totalRestarts(): number { return this.totalRestartsAllTime; }
|
||||
get lifetimeCap(): number { return ABSOLUTE_LIFETIME_RESTART_CAP; }
|
||||
```
|
||||
|
||||
### Task 2.5 — Update log statements at trip sites to include new fields
|
||||
|
||||
- `SessionRoutes.ts:319-326`: add `totalRestarts: session.restartGuard.totalRestarts`
|
||||
and `lifetimeCap: session.restartGuard.lifetimeCap` to the error log payload.
|
||||
- `worker-service.ts:828-835`: same additions.
|
||||
|
||||
### Verification
|
||||
|
||||
- `bun test tests/worker/RestartGuard.test.ts` (new file — see Phase 4).
|
||||
- Behavior check: 49 failed restarts → still allowed; 50th → blocked; 51st →
|
||||
blocked (lifetime cap persists).
|
||||
- Behavior check: 4 successes then restart → full window still counted.
|
||||
5th success then restart 5 min later → window cleared.
|
||||
|
||||
### Anti-pattern guard
|
||||
|
||||
- Don't make `recordSuccess` reset `totalRestartsAllTime` — the lifetime cap is
|
||||
meant to be terminal.
|
||||
- Don't silently change the meaning of the existing getters — add new ones.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Extend `unrecoverablePatterns` (handle OAuth expiry)
|
||||
|
||||
**Goal:** When Max-plan OAuth tokens expire/are revoked, treat the SDK error as
|
||||
unrecoverable so no restart is attempted.
|
||||
|
||||
**Target file:** `src/services/worker-service.ts:713-727`
|
||||
|
||||
**Doc reference:**
|
||||
|
||||
- Current array + match logic quoted in Phase 0 discovery.
|
||||
- `OpenRouterAgent.ts:458` — error throw format is
|
||||
`` `OpenRouter API error: ${response.status} - ${errorText}` ``.
|
||||
- `tests/worker/agents/fallback-error-handler.test.ts:75-77` — existing
|
||||
assertion `'401 Unauthorized'` is terminal, confirming it is safe to treat
|
||||
as unrecoverable.
|
||||
|
||||
### Task 3.1 — Add OAuth / OpenRouter auth strings
|
||||
|
||||
Add these entries (in a clear grouping comment) to the array at
|
||||
`worker-service.ts:713-727`:
|
||||
|
||||
```ts
|
||||
// OAuth / subscription-token expiry (Max plan users) — matches SDK
|
||||
// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
|
||||
// is no longer valid.
|
||||
'OAuth token expired',
|
||||
'token has been revoked',
|
||||
'Unauthorized',
|
||||
// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
|
||||
'OpenRouter API error: 401',
|
||||
'OpenRouter API error: 403',
|
||||
```
|
||||
|
||||
**Do not add bare `'401'`** — too broad. Anti-pattern confirmed by discovery.
|
||||
|
||||
### Verification
|
||||
|
||||
- `grep -n "'Unauthorized'" src/services/worker-service.ts` → should match.
|
||||
- Run `bun test tests/worker/worker-service-unrecoverable.test.ts` (new file —
|
||||
see Phase 4) — assertions cover each new pattern.
|
||||
- Run `bun test tests/worker/agents/fallback-error-handler.test.ts` — must
|
||||
still pass (no regressions).
|
||||
|
||||
### Anti-pattern guard
|
||||
|
||||
- No regex or `startsWith` — keep substring match semantics to avoid breaking
|
||||
existing entries.
|
||||
- Don't unpack `error.code` or nested JSON bodies in this phase — discovery
|
||||
confirmed all current agent error throws embed the status/message in
|
||||
`.message`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Tests (no tests existed for RestartGuard or unrecoverablePatterns)
|
||||
|
||||
**Goal:** Prevent regression on all three fixes.
|
||||
|
||||
**Test framework:** `bun test` (see `package.json` scripts).
|
||||
|
||||
### Task 4.1 — Create `tests/worker/RestartGuard.test.ts`
|
||||
|
||||
Test suites (doc: `src/services/worker/RestartGuard.ts` as modified in Phase 2):
|
||||
|
||||
- `recordRestart respects window`: push 10 restarts in <60s → allowed; 11th →
|
||||
blocked.
|
||||
- `recordSuccess requires N consecutive before decay`: 4 successes then
|
||||
restart 6 min later → window still populated; 5 successes then restart 6 min
|
||||
later → window cleared.
|
||||
- `restart breaks success streak`: 3 successes → 1 restart → 4 more successes
|
||||
(total 7 successes with gap) → streak counter is 4 at end, not 7 (decay
|
||||
NOT yet eligible).
|
||||
- `lifetime cap is terminal`: 50 restarts → OK; 51st → blocked even if window
|
||||
is empty; recordSuccess cannot un-block it.
|
||||
- `getters return expected values`: `totalRestarts`, `lifetimeCap`,
|
||||
`restartsInWindow`, `maxRestarts`, `windowMs`.
|
||||
|
||||
Use `Date.now` mocking via Bun's `spyOn(Date, 'now')` or an injected clock.
|
||||
Check project-local patterns first — if other tests in `tests/worker/` already
|
||||
use a specific clock helper, copy that pattern. Otherwise straight `spyOn`.
|
||||
|
||||
### Task 4.2 — Create `tests/worker/worker-service-unrecoverable.test.ts`
|
||||
|
||||
Focused unit test on the `.some(p => errorMessage.includes(p))` matcher for
|
||||
each new pattern. If extracting `unrecoverablePatterns` into a standalone
|
||||
exported helper makes testing easier, do that as part of this task (small
|
||||
refactor — pull the array + predicate into a named export at the top of
|
||||
`worker-service.ts` and import it in the `.catch` block). Verify:
|
||||
|
||||
- Each of `'OAuth token expired'`, `'token has been revoked'`,
|
||||
`'Unauthorized'`, `'OpenRouter API error: 401'`,
|
||||
`'OpenRouter API error: 403'` matches a realistic error message.
|
||||
- Bare `'401'` by itself (e.g., `"request-id-401xyz"`) does NOT match — this
|
||||
test locks in the decision not to add bare `'401'`.
|
||||
- All PRE-EXISTING patterns still match realistic messages (no regressions).
|
||||
|
||||
### Task 4.3 — Add SessionRoutes restart-trip integration test
|
||||
|
||||
If a test file already exists for `SessionRoutes` in `tests/worker/http/` or
|
||||
`tests/server/`, extend it. Otherwise create
|
||||
`tests/worker/http/SessionRoutes.restart-trip.test.ts`:
|
||||
|
||||
- Set up a session with pending messages and a `RestartGuard` pre-loaded near
|
||||
the window cap.
|
||||
- Trigger the restart path (mock `pendingStore.getPendingCount` > 0).
|
||||
- Assert: `pendingStore.markAllSessionMessagesAbandoned` was called with the
|
||||
session ID; `sessionManager.removeSessionImmediate` was called;
|
||||
`eventBroadcaster.broadcastSessionCompleted` was called.
|
||||
|
||||
If SessionRoutes has no existing test harness, keep scope small: lift the
|
||||
restart-trip branch into a private helper method first, export it for
|
||||
testing, and assert on the helper. Don't build a full HTTP harness for this.
|
||||
|
||||
### Verification
|
||||
|
||||
- `bun test tests/worker/RestartGuard.test.ts` — green.
|
||||
- `bun test tests/worker/worker-service-unrecoverable.test.ts` — green.
|
||||
- `bun test tests/worker/http/SessionRoutes.restart-trip.test.ts` — green.
|
||||
- `bun test` (full suite) — no regressions.
|
||||
|
||||
### Anti-pattern guard
|
||||
|
||||
- Don't stub Claude Agent SDK in these tests — they are pure unit tests.
|
||||
- Don't test behavior that would require spinning up the real worker daemon.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Final Verification
|
||||
|
||||
Run the full validation sequence:
|
||||
|
||||
1. `bun run build` — builds cleanly (no TS errors).
|
||||
2. `bun test` — full suite green.
|
||||
3. `grep -rn 'Messages remain in pending state' src/` — no matches (the phrase
|
||||
is gone from the codebase).
|
||||
4. `grep -n "'OAuth token expired'" src/services/worker-service.ts` — matches.
|
||||
5. `grep -n 'ABSOLUTE_LIFETIME_RESTART_CAP\|REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY' src/services/worker/RestartGuard.ts`
|
||||
— both match.
|
||||
6. Sanity read of the diff — each of the three changes is isolated and local,
|
||||
no refactor creep.
|
||||
7. Update `CHANGELOG.md` — actually, per repo CLAUDE.md the changelog is
|
||||
auto-generated. **Do not edit.**
|
||||
8. Stage & commit on the existing branch
|
||||
`investigate/windows-infinite-loop-usage-drain`. Commit message:
|
||||
`fix: stop Max-plan drain loop on Windows (RestartGuard + SessionRoutes + OAuth patterns)`.
|
||||
|
||||
### Anti-pattern guard
|
||||
|
||||
- Do not touch `CHANGELOG.md` (auto-generated per CLAUDE.md).
|
||||
- Do not bump version here — release flow handles it.
|
||||
- Do not merge to main — the user wants a PR.
|
||||
|
||||
---
|
||||
|
||||
## Execution note
|
||||
|
||||
Each implementation phase (1, 2, 3, 4) is self-contained and cites the exact
|
||||
files/lines discovered in Phase 0. Phases 1, 2, 3 can run in parallel;
|
||||
Phase 4 depends on Phase 2 (RestartGuard API shape) and Phase 3 (optional
|
||||
helper export from worker-service). Phase 5 depends on all prior phases.
|
||||
@@ -29,6 +29,7 @@ import { sanitizeEnv } from '../supervisor/env-sanitizer.js';
|
||||
// transitively pulls in the SQLite database layer via ChromaSync/DatabaseManager.
|
||||
import { ensureWorkerStarted as ensureWorkerStartedShared } from './worker-spawner.js';
|
||||
import { RestartGuard } from './worker/RestartGuard.js';
|
||||
import { isUnrecoverableError } from './worker/unrecoverable-patterns.js';
|
||||
|
||||
// Re-export for backward compatibility — canonical implementation in shared/plugin-state.ts
|
||||
export { isPluginDisabledInClaudeSettings } from '../shared/plugin-state.js';
|
||||
@@ -709,23 +710,10 @@ export class WorkerService {
|
||||
const errorMessage = (error as Error)?.message || '';
|
||||
|
||||
// Detect unrecoverable errors that should NOT trigger restart
|
||||
// These errors will fail immediately on retry, causing infinite loops
|
||||
const unrecoverablePatterns = [
|
||||
'Claude executable not found',
|
||||
'CLAUDE_CODE_PATH',
|
||||
'ENOENT',
|
||||
'spawn',
|
||||
'Invalid API key',
|
||||
'API_KEY_INVALID',
|
||||
'API key expired',
|
||||
'API key not valid',
|
||||
'PERMISSION_DENIED',
|
||||
'Gemini API error: 400',
|
||||
'Gemini API error: 401',
|
||||
'Gemini API error: 403',
|
||||
'FOREIGN KEY constraint failed',
|
||||
];
|
||||
if (unrecoverablePatterns.some(pattern => errorMessage.includes(pattern))) {
|
||||
// These errors will fail immediately on retry, causing infinite loops.
|
||||
// Pattern list lives in ./worker/unrecoverable-patterns.ts so the matcher
|
||||
// is unit-testable without importing this full module.
|
||||
if (isUnrecoverableError(errorMessage)) {
|
||||
hadUnrecoverableError = true;
|
||||
this.lastAiInteraction = {
|
||||
timestamp: Date.now(),
|
||||
@@ -831,7 +819,9 @@ export class WorkerService {
|
||||
pendingCount,
|
||||
restartsInWindow: session.restartGuard.restartsInWindow,
|
||||
windowMs: session.restartGuard.windowMs,
|
||||
maxRestarts: session.restartGuard.maxRestarts
|
||||
maxRestarts: session.restartGuard.maxRestarts,
|
||||
totalRestarts: session.restartGuard.totalRestarts,
|
||||
lifetimeCap: session.restartGuard.lifetimeCap
|
||||
});
|
||||
session.consecutiveRestarts = 0;
|
||||
this.terminateSession(session.sessionDbId, 'max_restarts_exceeded');
|
||||
|
||||
@@ -3,28 +3,51 @@
|
||||
* Prevents tight-loop restarts (bug) while allowing legitimate occasional restarts
|
||||
* over long sessions. Replaces the flat consecutiveRestarts counter that stranded
|
||||
* pending messages after just 3 restarts over any timeframe (#2053).
|
||||
*
|
||||
* Additional guards (Phase 2 of windows-max-plan-drain-fix):
|
||||
* - Decay of windowed history only fires after N consecutive successes, so a
|
||||
* single fluky success cannot clear the runaway-loop detector.
|
||||
* - Absolute lifetime cap: once total restarts cross the cap, the guard trips
|
||||
* permanently and cannot be reset by later successes.
|
||||
*/
|
||||
|
||||
const RESTART_WINDOW_MS = 60_000; // Only count restarts within last 60 seconds
|
||||
const MAX_WINDOWED_RESTARTS = 10; // 10 restarts in 60s = runaway loop
|
||||
const DECAY_AFTER_SUCCESS_MS = 5 * 60_000; // Clear history after 5min of uninterrupted success
|
||||
const REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY = 5;
|
||||
const ABSOLUTE_LIFETIME_RESTART_CAP = 50;
|
||||
|
||||
export class RestartGuard {
|
||||
private restartTimestamps: number[] = [];
|
||||
private lastSuccessfulProcessing: number | null = null;
|
||||
private consecutiveSuccessCount = 0;
|
||||
private totalRestartsAllTime = 0;
|
||||
private decayEligible = false;
|
||||
|
||||
/**
|
||||
* Record a restart and check if the guard should trip.
|
||||
* @returns true if the restart is ALLOWED, false if it should be BLOCKED
|
||||
*/
|
||||
recordRestart(): boolean {
|
||||
this.totalRestartsAllTime += 1;
|
||||
this.consecutiveSuccessCount = 0; // streak broken by any restart
|
||||
|
||||
// Terminal: lifetime cap reached — never resets, even if successes follow.
|
||||
if (this.totalRestartsAllTime > ABSOLUTE_LIFETIME_RESTART_CAP) {
|
||||
return false;
|
||||
}
|
||||
|
||||
const now = Date.now();
|
||||
|
||||
// Decay: clear history only after real success + 5min of uninterrupted success
|
||||
if (this.lastSuccessfulProcessing !== null
|
||||
// Decay: only fires if we accumulated the required consecutive successes
|
||||
// AND 5min has elapsed since the last success. One-off successes cannot
|
||||
// clear the windowed-restart history.
|
||||
if (this.decayEligible
|
||||
&& this.lastSuccessfulProcessing !== null
|
||||
&& now - this.lastSuccessfulProcessing >= DECAY_AFTER_SUCCESS_MS) {
|
||||
this.restartTimestamps = [];
|
||||
this.lastSuccessfulProcessing = null;
|
||||
this.decayEligible = false;
|
||||
}
|
||||
|
||||
// Prune old timestamps outside the window
|
||||
@@ -41,9 +64,15 @@ export class RestartGuard {
|
||||
|
||||
/**
|
||||
* Call when a message is successfully processed to update the success timestamp.
|
||||
* Requires REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY consecutive calls before
|
||||
* the restart-window decay path can fire.
|
||||
*/
|
||||
recordSuccess(): void {
|
||||
this.consecutiveSuccessCount += 1;
|
||||
this.lastSuccessfulProcessing = Date.now();
|
||||
if (this.consecutiveSuccessCount >= REQUIRED_CONSECUTIVE_SUCCESSES_FOR_DECAY) {
|
||||
this.decayEligible = true;
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
@@ -67,4 +96,19 @@ export class RestartGuard {
|
||||
get maxRestarts(): number {
|
||||
return MAX_WINDOWED_RESTARTS;
|
||||
}
|
||||
|
||||
/**
|
||||
* Total restarts counted for the lifetime of this guard (for logging).
|
||||
* Never decreases; compared against `lifetimeCap` to decide terminal trip.
|
||||
*/
|
||||
get totalRestarts(): number {
|
||||
return this.totalRestartsAllTime;
|
||||
}
|
||||
|
||||
/**
|
||||
* Absolute lifetime restart cap (for logging).
|
||||
*/
|
||||
get lifetimeCap(): number {
|
||||
return ABSOLUTE_LIFETIME_RESTART_CAP;
|
||||
}
|
||||
}
|
||||
|
||||
@@ -97,6 +97,49 @@ export class SessionRoutes extends BaseRouteHandler {
|
||||
private static readonly STALE_GENERATOR_THRESHOLD_MS = 30_000; // 30 seconds (#1099)
|
||||
private static readonly MAX_SESSION_WALL_CLOCK_MS = 4 * 60 * 60 * 1000; // 4 hours (#1590)
|
||||
|
||||
/**
|
||||
* Handle the restart-guard-tripped branch of crash recovery.
|
||||
*
|
||||
* When the windowed restart guard (or lifetime cap) trips, we must:
|
||||
* 1. abort the SDK subprocess so it stops consuming Claude
|
||||
* 2. mark pending messages abandoned so the next worker startup's
|
||||
* processPendingQueues() does NOT replay them (drain loop fix)
|
||||
* 3. remove the in-memory session and broadcast completion so the UI
|
||||
* reflects the terminal state
|
||||
*
|
||||
* Mirrors WorkerService.terminateSession() semantics — cannot call that
|
||||
* method directly because it's private. Extracted into a helper so it's
|
||||
* unit-testable without standing up a full HTTP harness (see
|
||||
* tests/worker/http/SessionRoutes.restart-trip.test.ts).
|
||||
*/
|
||||
private handleRestartGuardTripped(
|
||||
sessionDbId: number,
|
||||
pendingCount: number,
|
||||
session: { abortController: AbortController; restartGuard?: RestartGuard }
|
||||
): void {
|
||||
const restartGuard = session.restartGuard;
|
||||
logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
|
||||
sessionId: sessionDbId,
|
||||
pendingCount,
|
||||
restartsInWindow: restartGuard?.restartsInWindow,
|
||||
windowMs: restartGuard?.windowMs,
|
||||
maxRestarts: restartGuard?.maxRestarts,
|
||||
totalRestarts: restartGuard?.totalRestarts,
|
||||
lifetimeCap: restartGuard?.lifetimeCap,
|
||||
action: 'Pending messages abandoned to prevent replay loop.'
|
||||
});
|
||||
session.abortController.abort();
|
||||
const pendingStore = this.sessionManager.getPendingMessageStore();
|
||||
const abandoned = pendingStore.markAllSessionMessagesAbandoned(sessionDbId);
|
||||
logger.info('SYSTEM', 'Session terminated', {
|
||||
sessionId: sessionDbId,
|
||||
reason: 'max_restarts_exceeded',
|
||||
abandonedMessages: abandoned
|
||||
});
|
||||
this.sessionManager.removeSessionImmediate(sessionDbId);
|
||||
this.eventBroadcaster.broadcastSessionCompleted(sessionDbId);
|
||||
}
|
||||
|
||||
private ensureGeneratorRunning(sessionDbId: number, source: string): void {
|
||||
const session = this.sessionManager.getSession(sessionDbId);
|
||||
if (!session) return;
|
||||
@@ -316,16 +359,7 @@ export class SessionRoutes extends BaseRouteHandler {
|
||||
session.consecutiveRestarts = (session.consecutiveRestarts || 0) + 1; // Keep for logging
|
||||
|
||||
if (!restartAllowed) {
|
||||
logger.error('SESSION', `CRITICAL: Restart guard tripped — too many restarts in window, stopping to prevent runaway costs`, {
|
||||
sessionId: sessionDbId,
|
||||
pendingCount,
|
||||
restartsInWindow: session.restartGuard.restartsInWindow,
|
||||
windowMs: session.restartGuard.windowMs,
|
||||
maxRestarts: session.restartGuard.maxRestarts,
|
||||
action: 'Generator will NOT restart. Check logs for root cause. Messages remain in pending state.'
|
||||
});
|
||||
// Don't restart - abort to prevent further API calls
|
||||
session.abortController.abort();
|
||||
this.handleRestartGuardTripped(sessionDbId, pendingCount, session);
|
||||
return;
|
||||
}
|
||||
|
||||
@@ -334,7 +368,9 @@ export class SessionRoutes extends BaseRouteHandler {
|
||||
pendingCount,
|
||||
consecutiveRestarts: session.consecutiveRestarts,
|
||||
restartsInWindow: session.restartGuard!.restartsInWindow,
|
||||
maxRestarts: session.restartGuard!.maxRestarts
|
||||
maxRestarts: session.restartGuard!.maxRestarts,
|
||||
totalRestarts: session.restartGuard!.totalRestarts,
|
||||
lifetimeCap: session.restartGuard!.lifetimeCap
|
||||
});
|
||||
|
||||
// Abort OLD controller before replacing to prevent child process leaks
|
||||
|
||||
45
src/services/worker/unrecoverable-patterns.ts
Normal file
45
src/services/worker/unrecoverable-patterns.ts
Normal file
@@ -0,0 +1,45 @@
|
||||
/**
|
||||
* Unrecoverable error patterns
|
||||
*
|
||||
* Error messages matching any of these substrings should NOT trigger a restart
|
||||
* of the SDK generator. Retrying them will fail identically and drain API
|
||||
* budget / Max-plan usage (see windows-max-plan-drain-fix plan, Phase 3).
|
||||
*
|
||||
* Matching is a plain substring check via `String.prototype.includes`, so
|
||||
* patterns must be specific enough to avoid false positives (e.g. bare `'401'`
|
||||
* would match request IDs and is forbidden — use agent-prefixed forms).
|
||||
*/
|
||||
|
||||
export const UNRECOVERABLE_ERROR_PATTERNS: readonly string[] = [
|
||||
'Claude executable not found',
|
||||
'CLAUDE_CODE_PATH',
|
||||
'ENOENT',
|
||||
'spawn',
|
||||
'Invalid API key',
|
||||
'API_KEY_INVALID',
|
||||
'API key expired',
|
||||
'API key not valid',
|
||||
'PERMISSION_DENIED',
|
||||
'Gemini API error: 400',
|
||||
'Gemini API error: 401',
|
||||
'Gemini API error: 403',
|
||||
// OAuth / subscription-token expiry (Max plan users) — matches SDK
|
||||
// subprocess error messages when the inherited CLAUDE_CODE_OAUTH_TOKEN
|
||||
// is no longer valid.
|
||||
'OAuth token expired',
|
||||
'token has been revoked',
|
||||
'Unauthorized',
|
||||
// Parallel to 'Gemini API error: 401' — catches OpenRouter OAuth failures.
|
||||
'OpenRouter API error: 401',
|
||||
'OpenRouter API error: 403',
|
||||
'FOREIGN KEY constraint failed',
|
||||
];
|
||||
|
||||
/**
|
||||
* Returns true if the given error message matches any unrecoverable pattern.
|
||||
* Accepts an empty / non-string input and returns false in that case.
|
||||
*/
|
||||
export function isUnrecoverableError(errorMessage: string | null | undefined): boolean {
|
||||
if (!errorMessage) return false;
|
||||
return UNRECOVERABLE_ERROR_PATTERNS.some(pattern => errorMessage.includes(pattern));
|
||||
}
|
||||
Reference in New Issue
Block a user