Files
claude-mem/docs/reports/issue-603-worker-daemon-leaks-child-processes.md
Alex Newman 2659ec3231 fix: Claude Code 2.1.1 compatibility + log-level audit + path validation fixes (#614)
* Refactor CLAUDE.md and related files for December 2025 updates

- Updated CLAUDE.md in src/services/worker with new entries for December 2025, including changes to Search.ts, GeminiAgent.ts, SDKAgent.ts, and SessionManager.ts.
- Revised CLAUDE.md in src/shared to reflect updates and new entries for December 2025, including paths.ts and worker-utils.ts.
- Modified hook-constants.ts to clarify exit codes and their behaviors.
- Added comprehensive hooks reference documentation for Claude Code, detailing usage, events, and examples.
- Created initial CLAUDE.md files in various directories to track recent activity.

* fix: Merge user-message-hook output into context-hook hookSpecificOutput

- Add footer message to additionalContext in context-hook.ts
- Remove user-message-hook from SessionStart hooks array
- Fixes issue where stderr+exit(1) approach was silently discarded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update logs and documentation for recent plugin and worker service changes

- Added detailed logs for worker service activities from Dec 10, 2025 to Jan 7, 2026, including initialization patterns, cleanup confirmations, and diagnostic logging.
- Updated plugin documentation with recent activities, including plugin synchronization and configuration changes from Dec 3, 2025 to Jan 7, 2026.
- Enhanced the context hook and worker service logs to reflect improvements and fixes in the plugin architecture.
- Documented the migration and verification processes for the Claude memory system and its integration with the marketplace.

* Refactor hooks architecture and remove deprecated user-message-hook

- Updated hook configurations in CLAUDE.md and hooks.json to reflect changes in session start behavior.
- Removed user-message-hook functionality as it is no longer utilized in Claude Code 2.1.0; context is now injected silently.
- Enhanced context-hook to handle session context injection without user-visible messages.
- Cleaned up documentation across multiple files to align with the new hook structure and removed references to obsolete hooks.
- Adjusted timing and command execution for hooks to improve performance and reliability.

* fix: Address PR #610 review issues

- Replace USER_MESSAGE_ONLY test with BLOCKING_ERROR test in hook-constants.test.ts
- Standardize Claude Code 2.1.0 note wording across all three documentation files
- Exclude deprecated user-message-hook.ts from logger-usage-standards test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove hardcoded fake token counts from context injection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Address PR #610 review issues by fixing test files, standardizing documentation notes, and verifying code quality improvements.

* fix: Add path validation to CLAUDE.md distribution to prevent invalid directory creation

- Add isValidPathForClaudeMd() function to reject invalid paths:
  - Tilde paths (~) that Node.js doesn't expand
  - URLs (http://, https://)
  - Paths with spaces (likely command text or PR references)
  - Paths with # (GitHub issue/PR references)
  - Relative paths that escape project boundary

- Integrate validation in updateFolderClaudeMdFiles loop
- Add 6 unit tests for path validation
- Update .gitignore to prevent accidental commit of malformed directories
- Clean up existing invalid directories (~/, PR #610..., git diff..., https:)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: Implement path validation in CLAUDE.md generation to prevent invalid directory creation

- Added `isValidPathForClaudeMd()` function to validate file paths in `src/utils/claude-md-utils.ts`.
- Integrated path validation in `updateFolderClaudeMdFiles` to skip invalid paths.
- Added 6 new unit tests in `tests/utils/claude-md-utils.test.ts` to cover various rejection cases.
- Updated `.gitignore` to prevent tracking of invalid directories.
- Cleaned up existing invalid directories in the repository.

* feat: Promote critical WARN logs to ERROR level across codebase

Comprehensive log-level audit promoting 38+ WARN messages to ERROR for
improved debugging and incident response:

- Parser: observation type errors, data contamination
- SDK/Agents: empty init responses (Gemini, OpenRouter)
- Worker/Queue: session recovery, auto-recovery failures
- Chroma: sync failures, search failures (now treated as critical)
- SQLite: search failures (primary data store)
- Session/Generator: failures, missing context
- Infrastructure: shutdown, process management failures
- File Operations: CLAUDE.md updates, config reads
- Branch Management: recovery checkout failures

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: Address PR #614 review issues

- Remove incorrectly tracked tilde-prefixed files from git
- Fix absolute path validation to check projectRoot boundaries
- Add test coverage for absolute path validation edge cases

Closes review issues:
- Issue 1: ~/ prefixed files removed from tracking
- Issue 3: Absolute paths now validated against projectRoot
- Issue 4: Added 3 new test cases for absolute path scenarios

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* build assets and context

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 23:34:20 -05:00

14 KiB

Technical Report: Worker Daemon Child Process Leak

Issue: #603 - Bug: worker-service daemon leaks child claude processes Author: raulk Created: 2026-01-07 Report Version: 1.0 Severity: Critical Priority: P0 - Immediate attention required


1. Executive Summary

The worker-service.cjs --daemon process spawns Claude subagent processes via the Claude Agent SDK that are not being properly terminated when their tasks complete. Over the course of normal usage (6+ hours), this results in the accumulation of orphaned child processes that consume significant system memory.

Key Findings:

  • 121 orphaned claude processes accumulated over ~6 hours
  • Total memory consumption: ~44GB RSS
  • Average memory per process: ~372MB
  • Root cause: Missing child process cleanup after SDK query completion
  • The issue affects Linux systems and potentially all platforms

Recommendation: Implement explicit child process tracking and cleanup in the SDK agent lifecycle, and add process reaping on generator completion.


2. Problem Analysis

2.1 Observed Behavior

The reporter documented the following scenario:

Parent daemon process (running 7+ hours):

PID     PPID  RSS(KB)  ELAPSED   COMMAND
4118969 1     161656   07:28:16  bun ~/.claude/plugins/cache/thedotmack/claude-mem/9.0.0/scripts/worker-service.cjs --daemon

Sample of leaked children (121 total, all parented to daemon):

PID   PPID    RSS(KB)  ELAPSED   COMMAND
1927  4118969 377308   06:21:16  claude --output-format stream-json --verbose --input-format stream-json --model claude-sonnet-4-5 --disallowedTools Bash,Read,Write,Edit,Grep,Glob,WebFetch,WebSearch,Task,NotebookEdit,AskUserQuestion,TodoWrite --setting-sources --permission-mode default
2834  4118969 384716   06:20:44  claude --output-format stream-json [...]
3988  4118969 381844   06:20:15  claude --output-format stream-json --resume <session-id> [...]
5938  4118969 382816   06:19:37  claude --output-format stream-json --resume <session-id> [...]
11503 4118969 381276   06:16:12  claude --output-format stream-json --resume <session-id> [...]

2.2 Reproduction Steps

  1. Use claude-mem normally throughout a work session
  2. Run: ps -o pid,ppid,rss,etime --no-headers | awk '$2 == '$(pgrep -f worker-service.cjs)
  3. Count grows over time without bound

2.3 Expected Behavior

Child claude processes should terminate when their task completes, or the daemon should reap them.


3. Technical Details

3.1 Architecture Overview

The claude-mem worker service uses a modular architecture:

WorkerService (worker-service.ts)
    |
    +-- SDKAgent (SDKAgent.ts)
    |       |
    |       +-- query() from @anthropic-ai/claude-agent-sdk
    |               |
    |               +-- Spawns `claude` CLI subprocess
    |
    +-- SessionManager (SessionManager.ts)
    |       |
    |       +-- Manages active sessions
    |       +-- Event-driven message queues
    |
    +-- ProcessManager (ProcessManager.ts)
            |
            +-- Child process enumeration
            +-- Graceful shutdown cleanup

3.2 SDK Agent Child Process Spawning

The SDKAgent.startSession() method invokes the Claude Agent SDK's query() function:

// src/services/worker/SDKAgent.ts (lines 100-114)
const queryResult = query({
  prompt: messageGenerator,
  options: {
    model: modelId,
    ...(hasRealMemorySessionId && session.lastPromptNumber > 1 && { resume: session.memorySessionId }),
    disallowedTools,
    abortController: session.abortController,
    pathToClaudeCodeExecutable: claudePath
  }
});

The query() function internally spawns a claude CLI subprocess with the parameters visible in the leaked process list:

  • --output-format stream-json
  • --verbose
  • --input-format stream-json
  • --model claude-sonnet-4-5
  • --disallowedTools ...
  • --setting-sources
  • --permission-mode default

3.3 Session Lifecycle

Sessions are managed through the following flow:

  1. Initialization: SessionRoutes.handleSessionInit() creates a session and starts a generator
  2. Processing: SDKAgent.startSession() runs the query loop, processing messages from the queue
  3. Completion: Generator promise resolves, triggering cleanup in finally block

The relevant generator lifecycle code in SessionRoutes.ts (lines 137-216):

session.generatorPromise = agent.startSession(session, this.workerService)
  .catch(error => { /* error handling */ })
  .finally(() => {
    session.generatorPromise = null;
    session.currentProvider = null;
    this.workerService.broadcastProcessingStatus();

    // Crash recovery logic...
    if (!wasAborted) {
      // Check for pending work and potentially restart
    }
  });

3.4 Graceful Shutdown Implementation

The existing shutdown mechanism in GracefulShutdown.ts (lines 49-90) does handle child processes, but only during daemon shutdown:

export async function performGracefulShutdown(config: GracefulShutdownConfig): Promise<void> {
  // STEP 1: Enumerate all child processes BEFORE we start closing things
  const childPids = await getChildProcesses(process.pid);

  // ... other cleanup steps ...

  // STEP 6: Force kill any remaining child processes (Windows zombie port fix)
  if (childPids.length > 0) {
    for (const pid of childPids) {
      await forceKillProcess(pid);
    }
    await waitForProcessesExit(childPids, 5000);
  }
}

Critical Gap: This cleanup only runs when the daemon itself shuts down, not when individual SDK sessions complete.


4. Impact Assessment

4.1 Resource Consumption

Metric Value
Leaked processes 121
Total RSS ~44GB
Average per process ~372MB
Accumulation rate ~20 processes/hour
Time to exhaustion (64GB system) ~3 hours

4.2 System Effects

  1. Memory Exhaustion: Systems with limited RAM will experience OOM conditions
  2. Performance Degradation: Swap thrashing as memory fills
  3. Process Table Pollution: Maximum PID limits may be approached
  4. User Experience: System becomes unresponsive during extended sessions

4.3 Affected Platforms

  • Linux (confirmed): Ubuntu reported by issue author
  • macOS (likely): Same process spawning mechanism
  • Windows (potentially different): Uses different child process tracking

5. Root Cause Analysis

5.1 Primary Root Cause

The SDK's query() function spawns a child claude process that is not being explicitly terminated when the async iterator completes.

The SDKAgent.startSession() method:

  1. Creates an async generator via query()
  2. Iterates over messages via for await (const message of queryResult)
  3. When iteration completes (naturally or via abort), the generator resolves
  4. No explicit cleanup of the underlying child process occurs

5.2 Contributing Factors

  1. No Child Process Tracking: The codebase does not maintain a registry of spawned child processes during normal operation - only during shutdown enumeration.

  2. AbortController Not Triggering Process Kill: While sessions have an abortController, signaling abort to the SDK iterator does not guarantee the underlying claude process terminates.

  3. Generator Finally Block Missing Process Cleanup: The finally block in SessionRoutes.startGeneratorWithProvider() handles state cleanup but does not explicitly kill child processes.

  4. SDK Abstraction Hiding Process Details: The @anthropic-ai/claude-agent-sdk abstracts the subprocess management, making it difficult to access and terminate the child process directly.

5.3 Code Path Analysis

User Session Complete
        |
        v
SDKAgent.startSession() completes for-await loop
        |
        v
Generator promise resolves
        |
        v
SessionRoutes finally block executes
        |
        +-- session.generatorPromise = null
        +-- session.currentProvider = null
        +-- broadcastProcessingStatus()
        +-- Check pending work
        |
        v
[MISSING] Child process termination
        |
        v
Claude subprocess continues running (LEAKED)

6.1 Solution A: SDK-Level Child Process Tracking (Preferred)

Add explicit child process tracking to the SDKAgent class:

// src/services/worker/SDKAgent.ts

export class SDKAgent {
  private activeChildProcesses: Map<number, { pid: number, sessionDbId: number }> = new Map();

  async startSession(session: ActiveSession, worker?: WorkerRef): Promise<void> {
    // Before query(), track that we're about to spawn
    const queryResult = query({...});

    // After first message, capture the PID if available
    // Note: May require SDK modification to expose PID

    try {
      for await (const message of queryResult) {
        // ... existing message handling
      }
    } finally {
      // Cleanup: Kill any child process for this session
      this.cleanupSessionProcess(session.sessionDbId);
    }
  }

  private cleanupSessionProcess(sessionDbId: number): void {
    // Find and terminate process for this session
    // Requires either SDK enhancement or platform-specific process enumeration
  }
}

Challenges: The SDK does not currently expose the child process PID.

6.2 Solution B: Session-Level Process Enumeration and Cleanup

Add process cleanup to the session completion flow:

// src/services/worker/http/routes/SessionRoutes.ts

private startGeneratorWithProvider(session, provider, source): void {
  const parentPid = process.pid;
  const preExistingPids = new Set(await getChildProcessesForSession(parentPid, 'claude'));

  session.generatorPromise = agent.startSession(session, this.workerService)
    .finally(async () => {
      // Find new child processes that appeared during this session
      const currentPids = await getChildProcessesForSession(parentPid, 'claude');
      const newPids = currentPids.filter(pid => !preExistingPids.has(pid));

      // Terminate orphaned processes
      for (const pid of newPids) {
        await forceKillProcess(pid);
      }

      // ... existing cleanup
    });
}

6.3 Solution C: Periodic Orphan Reaper (Mitigation)

Add a background task that periodically identifies and terminates leaked processes:

// src/services/worker/OrphanReaper.ts

export class OrphanReaper {
  private interval: NodeJS.Timer | null = null;

  start(intervalMs: number = 60000): void {
    this.interval = setInterval(async () => {
      const orphans = await this.findOrphanedClaudeProcesses();
      for (const pid of orphans) {
        await forceKillProcess(pid);
      }
    }, intervalMs);
  }

  private async findOrphanedClaudeProcesses(): Promise<number[]> {
    // Find claude processes parented to the worker daemon
    // that have been running longer than expected (e.g., > 30 minutes)
  }
}

Pros: Works without SDK modifications Cons: Reactive rather than proactive; processes leak for up to interval duration

6.4 Solution D: Request SDK Enhancement

File an issue with the Claude Agent SDK requesting:

  1. Exposure of child process PID in query result
  2. Built-in cleanup on iterator completion
  3. Explicit close() or terminate() method
  1. Immediate (P0): Implement Solution C (Orphan Reaper) as a mitigation
  2. Short-term (P1): Implement Solution B (Session-Level Cleanup)
  3. Medium-term (P2): Pursue Solution D (SDK Enhancement) with Anthropic
  4. Long-term (P3): Implement Solution A once SDK provides PID access

7. Priority/Severity Assessment

7.1 Severity: Critical

  • Data Loss: No
  • System Instability: Yes - memory exhaustion
  • User Impact: High - system becomes unusable
  • Scope: All users with extended sessions

7.2 Priority: P0 - Immediate

  • Frequency: Every session creates leaked processes
  • Accumulation: Unbounded growth
  • Workaround: Manual daemon restart (disruptive)
  • Business Impact: Renders product unusable for long sessions

7.3 Effort Estimate

Solution Effort Risk
Orphan Reaper (C) 2-4 hours Low
Session Cleanup (B) 4-8 hours Medium
SDK Enhancement (D) External dependency -
Full Tracking (A) 8-16 hours Medium

8. References

  • Issue: https://github.com/thedotmack/claude-mem/issues/603
  • Source Files:
    • /src/services/worker/SDKAgent.ts - SDK query invocation
    • /src/services/worker/SessionManager.ts - Session lifecycle
    • /src/services/worker/http/routes/SessionRoutes.ts - Generator management
    • /src/services/infrastructure/ProcessManager.ts - Process utilities
    • /src/services/infrastructure/GracefulShutdown.ts - Shutdown cleanup
  • Related Code:
    • @anthropic-ai/claude-agent-sdk - External SDK spawning processes

9. Appendix: Process Enumeration Reference

Current getChildProcesses Implementation

// src/services/infrastructure/ProcessManager.ts
export async function getChildProcesses(parentPid: number): Promise<number[]> {
  if (process.platform !== 'win32') {
    return [];  // NOTE: Only implemented for Windows!
  }

  // Windows implementation using wmic
  const cmd = `wmic process where "parentprocessid=${parentPid}" get processid /format:list`;
  // ...
}

Critical Finding: The getChildProcesses function is currently Windows-only and returns an empty array on Linux/macOS. This means the Linux user reporting the issue has no built-in cleanup mechanism.

Required Fix for Linux/macOS

export async function getChildProcesses(parentPid: number): Promise<number[]> {
  if (process.platform === 'win32') {
    // Existing Windows implementation
  } else {
    // Unix implementation
    const { stdout } = await execAsync(`pgrep -P ${parentPid}`);
    return stdout.trim().split('\n').map(Number).filter(n => !isNaN(n));
  }
}

Report prepared by Claude Code analysis of codebase and issue #603