Files
claude-mem/docs/reports/issue-586-feature-request-unknown.md
Alex Newman 2659ec3231 fix: Claude Code 2.1.1 compatibility + log-level audit + path validation fixes (#614)
* Refactor CLAUDE.md and related files for December 2025 updates

- Updated CLAUDE.md in src/services/worker with new entries for December 2025, including changes to Search.ts, GeminiAgent.ts, SDKAgent.ts, and SessionManager.ts.
- Revised CLAUDE.md in src/shared to reflect updates and new entries for December 2025, including paths.ts and worker-utils.ts.
- Modified hook-constants.ts to clarify exit codes and their behaviors.
- Added comprehensive hooks reference documentation for Claude Code, detailing usage, events, and examples.
- Created initial CLAUDE.md files in various directories to track recent activity.

* fix: Merge user-message-hook output into context-hook hookSpecificOutput

- Add footer message to additionalContext in context-hook.ts
- Remove user-message-hook from SessionStart hooks array
- Fixes issue where stderr+exit(1) approach was silently discarded

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Update logs and documentation for recent plugin and worker service changes

- Added detailed logs for worker service activities from Dec 10, 2025 to Jan 7, 2026, including initialization patterns, cleanup confirmations, and diagnostic logging.
- Updated plugin documentation with recent activities, including plugin synchronization and configuration changes from Dec 3, 2025 to Jan 7, 2026.
- Enhanced the context hook and worker service logs to reflect improvements and fixes in the plugin architecture.
- Documented the migration and verification processes for the Claude memory system and its integration with the marketplace.

* Refactor hooks architecture and remove deprecated user-message-hook

- Updated hook configurations in CLAUDE.md and hooks.json to reflect changes in session start behavior.
- Removed user-message-hook functionality as it is no longer utilized in Claude Code 2.1.0; context is now injected silently.
- Enhanced context-hook to handle session context injection without user-visible messages.
- Cleaned up documentation across multiple files to align with the new hook structure and removed references to obsolete hooks.
- Adjusted timing and command execution for hooks to improve performance and reliability.

* fix: Address PR #610 review issues

- Replace USER_MESSAGE_ONLY test with BLOCKING_ERROR test in hook-constants.test.ts
- Standardize Claude Code 2.1.0 note wording across all three documentation files
- Exclude deprecated user-message-hook.ts from logger-usage-standards test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* fix: Remove hardcoded fake token counts from context injection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* Address PR #610 review issues by fixing test files, standardizing documentation notes, and verifying code quality improvements.

* fix: Add path validation to CLAUDE.md distribution to prevent invalid directory creation

- Add isValidPathForClaudeMd() function to reject invalid paths:
  - Tilde paths (~) that Node.js doesn't expand
  - URLs (http://, https://)
  - Paths with spaces (likely command text or PR references)
  - Paths with # (GitHub issue/PR references)
  - Relative paths that escape project boundary

- Integrate validation in updateFolderClaudeMdFiles loop
- Add 6 unit tests for path validation
- Update .gitignore to prevent accidental commit of malformed directories
- Clean up existing invalid directories (~/, PR #610..., git diff..., https:)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: Implement path validation in CLAUDE.md generation to prevent invalid directory creation

- Added `isValidPathForClaudeMd()` function to validate file paths in `src/utils/claude-md-utils.ts`.
- Integrated path validation in `updateFolderClaudeMdFiles` to skip invalid paths.
- Added 6 new unit tests in `tests/utils/claude-md-utils.test.ts` to cover various rejection cases.
- Updated `.gitignore` to prevent tracking of invalid directories.
- Cleaned up existing invalid directories in the repository.

* feat: Promote critical WARN logs to ERROR level across codebase

Comprehensive log-level audit promoting 38+ WARN messages to ERROR for
improved debugging and incident response:

- Parser: observation type errors, data contamination
- SDK/Agents: empty init responses (Gemini, OpenRouter)
- Worker/Queue: session recovery, auto-recovery failures
- Chroma: sync failures, search failures (now treated as critical)
- SQLite: search failures (primary data store)
- Session/Generator: failures, missing context
- Infrastructure: shutdown, process management failures
- File Operations: CLAUDE.md updates, config reads
- Branch Management: recovery checkout failures

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: Address PR #614 review issues

- Remove incorrectly tracked tilde-prefixed files from git
- Fix absolute path validation to check projectRoot boundaries
- Add test coverage for absolute path validation edge cases

Closes review issues:
- Issue 1: ~/ prefixed files removed from tracking
- Issue 3: Absolute paths now validated against projectRoot
- Issue 4: Added 3 new test cases for absolute path scenarios

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

* build assets and context

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 23:34:20 -05:00

16 KiB

Issue #586: Race Condition in memory_session_id Capture

Report Date: 2026-01-07 Issue: #586 Reporter: rocky2431 Environment: claude-mem 9.0.0, macOS Darwin 24.6.0, Node v22.x / Bun 1.x


1. Executive Summary

This issue describes a critical race condition where new sessions frequently have an empty (NULL) memory_session_id in the sdk_sessions table. This prevents observations from being stored, as the ResponseProcessor requires a valid memorySessionId before processing agent responses.

Key Finding: The race condition occurs because session initialization via handleSessionInitByClaudeId() creates the session with a NULL memory_session_id, but the SDK agent may not have responded yet to provide its session ID when subsequent PostToolUse hooks attempt to store observations.

Error Message:

Cannot store observations: memorySessionId not yet captured

Severity: Critical Priority: P1 Impact: Sessions with NULL memory_session_id cannot store any observations, leading to data loss and incomplete session history.


2. Problem Analysis

2.1 Error Manifestation

The error originates from ResponseProcessor.ts (line 73-75):

// CRITICAL: Must use memorySessionId (not contentSessionId) for FK constraint
if (!session.memorySessionId) {
  throw new Error('Cannot store observations: memorySessionId not yet captured');
}

2.2 Observed Symptoms

  1. Log Evidence:

    [2026-01-07 04:02:39.872] [INFO ] [SESSION] [session-14379] Session initialized
    {project=claude-task-master, contentSessionId=a48d7f90-27e4-4a1d-b379-bf2195ee333e,
    queueDepth=0, hasGenerator=false}
    

    Note: contentSessionId is present but memorySessionId is missing.

  2. Database State:

    SELECT id, memory_session_id, project FROM sdk_sessions ORDER BY id DESC LIMIT 5;
    
    14379 | (NULL) | claude-task-master   -- Missing!
    14293 | 090b5397-... | .claude        -- OK
    14285 | (NULL) | .claude              -- Missing!
    
  3. Queue Accumulation:

    • Observations are enqueued to pending_messages table
    • Hundreds of unprocessed items accumulate
    • Only user prompts are recorded, no AI analysis

2.3 Race Condition Timeline

Time T0: SessionStart hook triggers
         └─> new-hook.ts calls /api/sessions/init
             └─> createSDKSession() creates row with memory_session_id = NULL

Time T1: PostToolUse hook triggers (user action)
         └─> save-hook.ts calls /api/sessions/observations
             └─> Observation queued to pending_messages

Time T2: SDK Agent generator starts
         └─> Waiting for first message from Claude SDK

Time T3: First SDK message arrives (RACE CONDITION WINDOW)
         └─> updateMemorySessionId() called with captured ID
         └─> Database updated: memory_session_id = "sdk-gen-abc123"

Time T4: SDK Agent attempts to process queued observations
         └─> processAgentResponse() checks session.memorySessionId
         └─> If NULL (not yet updated): ERROR thrown

The Problem: If PostToolUse events arrive during the window between session creation (T0) and SDK session ID capture (T3), the ResponseProcessor will fail because memorySessionId is still NULL.


3. Technical Details

3.1 Session ID Architecture

Claude-mem uses a dual session ID system (documented in docs/SESSION_ID_ARCHITECTURE.md):

ID Purpose Source Initial Value
contentSessionId User's Claude Code conversation ID Hook system Set immediately
memorySessionId Memory agent's internal session ID SDK response NULL (captured later)

3.2 Session Creation Flow

File: src/services/sqlite/sessions/create.ts (lines 24-47)

export function createSDKSession(
  db: Database,
  contentSessionId: string,
  project: string,
  userPrompt: string
): number {
  // Pure INSERT OR IGNORE - no updates, no complexity
  // NOTE: memory_session_id starts as NULL. It is captured by SDKAgent from the first SDK
  // response and stored via updateMemorySessionId(). CRITICAL: memory_session_id must NEVER
  // equal contentSessionId - that would inject memory messages into the user's transcript!
  db.prepare(`
    INSERT OR IGNORE INTO sdk_sessions
    (content_session_id, memory_session_id, project, user_prompt, started_at, started_at_epoch, status)
    VALUES (?, NULL, ?, ?, ?, ?, 'active')
  `).run(contentSessionId, project, userPrompt, now.toISOString(), nowEpoch);
  // ...
}

3.3 Memory Session ID Capture

File: src/services/worker/SDKAgent.ts (lines 117-141)

// Process SDK messages
for await (const message of queryResult) {
  // Capture memory session ID from first SDK message (any type has session_id)
  if (!session.memorySessionId && message.session_id) {
    session.memorySessionId = message.session_id;
    // Persist to database for cross-restart recovery
    this.dbManager.getSessionStore().updateMemorySessionId(
      session.sessionDbId,
      message.session_id
    );
    // ... verification logging ...
  }
  // ...
}

3.4 Response Processor Validation

File: src/services/worker/agents/ResponseProcessor.ts (lines 72-75)

// CRITICAL: Must use memorySessionId (not contentSessionId) for FK constraint
if (!session.memorySessionId) {
  throw new Error('Cannot store observations: memorySessionId not yet captured');
}

3.5 Session Manager Initialization

File: src/services/worker/SessionManager.ts (lines 127-143)

// Create active session
// Load memorySessionId from database if previously captured (enables resume across restarts)
session = {
  sessionDbId,
  contentSessionId: dbSession.content_session_id,
  memorySessionId: dbSession.memory_session_id || null,  // NULL initially!
  // ...
};

4. Impact Assessment

4.1 Direct Impact

Impact Area Description
Data Loss Observations queued during race window are never stored
Queue Growth pending_messages table grows unbounded
User Experience Session history incomplete - only prompts, no analysis
System Load Repeated retry attempts consume resources

4.2 Frequency

The issue appears intermittent - some sessions initialize correctly while others fail. The race condition depends on:

  • System load
  • Claude SDK response latency
  • Hook timing relative to SDK startup
  • Issue #520 (CLOSED): Stuck messages in 'processing' status - similar queue recovery problem
  • Issue #591: OpenRouter Agent fails to capture memorySessionId - architectural gap for stateless providers

5. Root Cause Analysis

5.1 Primary Root Cause

Architectural Timing Gap: The session initialization API (/api/sessions/init) creates sessions with a NULL memory_session_id, expecting the SDK agent to capture it from the first response. However, there is no synchronization mechanism to prevent observation processing before this capture occurs.

5.2 Contributing Factors

  1. Asynchronous SDK Agent Startup: The generator starts asynchronously without blocking the hook response
  2. No Capture Wait Mechanism: Observations are queued immediately without waiting for memorySessionId capture
  3. Strict Validation in ResponseProcessor: The processor throws an error rather than handling the NULL case gracefully
  4. No Retry Logic: Failed observations due to missing memorySessionId are not retried after capture

5.3 Timing Window Analysis

Hook Execution Timeline:
├─ new-hook.ts (UserPromptSubmit)
│   ├─ POST /api/sessions/init → createSDKSession(memory_session_id=NULL)
│   └─ POST /sessions/{id}/init → startSession() [async, non-blocking]
│
├─ [RACE CONDITION WINDOW OPENS]
│   └─ SDK agent waiting for Claude response
│
├─ save-hook.ts (PostToolUse) ← CAN TRIGGER DURING WINDOW
│   └─ POST /api/sessions/observations
│       └─ Queued, will fail when processed
│
├─ [SDK FIRST MESSAGE ARRIVES]
│   └─ updateMemorySessionId(captured_id)
│       └─ Database updated, session.memorySessionId set
│
├─ [RACE CONDITION WINDOW CLOSES]
│
└─ Subsequent observations process successfully

If memorySessionId is not available, wait briefly with exponential backoff:

// In processAgentResponse():
async function waitForMemorySessionId(
  session: ActiveSession,
  dbManager: DatabaseManager,
  maxRetries: number = 5,
  baseDelayMs: number = 100
): Promise<boolean> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    if (session.memorySessionId) return true;

    // Check database for updates
    const dbSession = dbManager.getSessionById(session.sessionDbId);
    if (dbSession?.memory_session_id) {
      session.memorySessionId = dbSession.memory_session_id;
      return true;
    }

    await new Promise(resolve => setTimeout(resolve, baseDelayMs * Math.pow(2, attempt)));
  }
  return false;
}

// Usage:
const captured = await waitForMemorySessionId(session, dbManager);
if (!captured) {
  throw new Error('Cannot store observations: memorySessionId not yet captured after retries');
}

Pros:

  • Non-breaking change
  • Handles timing variations gracefully
  • Minimal code modification

Cons:

  • Adds latency in worst case
  • Polling-based solution

6.2 Solution B: Lazy Capture on First PostToolUse

Capture memorySessionId on the first PostToolUse if not already set:

// In handleObservationsByClaudeId():
if (!session.memorySessionId && session.contentSessionId) {
  // Generate a placeholder that will be updated when SDK responds
  const tempId = `pending-${session.contentSessionId}`;
  session.memorySessionId = tempId;
  store.updateMemorySessionId(sessionDbId, tempId);
  logger.warn('SESSION', 'Generated temporary memorySessionId', { tempId });
}

Pros:

  • Immediate resolution
  • No retry delays

Cons:

  • Temporary IDs may cause confusion
  • Requires updating when real ID is captured

6.3 Solution C: Use contentSessionId as Fallback

For initial observations before SDK capture, use contentSessionId:

// In processAgentResponse():
const effectiveMemorySessionId = session.memorySessionId || session.contentSessionId;

Pros:

  • Simple implementation
  • No timing issues

Cons:

  • Violates architectural principle that memorySessionId should differ from contentSessionId
  • Risk of FK constraint issues
  • May cause resume problems

6.4 Solution D: Block Until memorySessionId is Captured

Modify handleObservationsByClaudeId to wait for SDK capture:

// In handleObservationsByClaudeId():
const session = this.sessionManager.getSession(sessionDbId);
if (!session?.memorySessionId) {
  // Return a "pending" response, client should retry
  res.status(202).json({
    status: 'pending',
    reason: 'awaiting_memory_session_id',
    retryAfterMs: 500
  });
  return;
}

Pros:

  • Explicit handling
  • Client-controlled retry

Cons:

  • Requires hook changes
  • May cause hook timeout

Solution A is recommended because:

  1. Handles the race condition transparently
  2. Minimal impact on existing code
  3. Self-healing behavior (retries until successful)
  4. Maintains architectural integrity
  5. Low regression risk

7. Priority/Severity Assessment

7.1 Severity Matrix

Factor Assessment
Data Loss High - Observations lost during race window
Functionality Partial - Some sessions work, some don't
Frequency Intermittent - Depends on system timing
Workaround Manual SQL fix available
Affected Users All users under specific timing conditions

7.2 Priority Assignment

Priority: P1 (High)

Rationale:

  • Silent data loss is occurring
  • Affects core functionality (observation storage)
  • Unpredictable - users may not know data is being lost
  • Fix is straightforward with low regression risk
Action Timeline
Implement Solution A 2-4 hours
Unit tests 1 hour
Integration tests 1 hour
Code review 30 minutes
Release Same day

8. Workaround

Users experiencing this issue can manually fix affected sessions:

-- Find sessions with missing memory_session_id
SELECT id, content_session_id, project
FROM sdk_sessions
WHERE memory_session_id IS NULL;

-- Option 1: Use content_session_id as memory_session_id (not recommended)
-- WARNING: May cause issues with session resume
UPDATE sdk_sessions
SET memory_session_id = content_session_id
WHERE id = <sessionDbId> AND memory_session_id IS NULL;

-- Option 2: Generate a unique ID
UPDATE sdk_sessions
SET memory_session_id = 'manual-' || content_session_id
WHERE id = <sessionDbId> AND memory_session_id IS NULL;

Important: After applying the workaround, the worker must be restarted to pick up the new memory_session_id values.


9. Testing Recommendations

9.1 Unit Tests

describe('ResponseProcessor memorySessionId handling', () => {
  it('should wait for memorySessionId capture with retry', async () => {
    const session = createMockSession({ memorySessionId: null });

    // Simulate delayed capture
    setTimeout(() => {
      session.memorySessionId = 'captured-id';
    }, 200);

    await expect(
      processAgentResponse(text, session, dbManager, sessionManager, worker, 0, null, 'Test')
    ).resolves.not.toThrow();
  });

  it('should throw after max retries if memorySessionId never captured', async () => {
    const session = createMockSession({ memorySessionId: null });

    await expect(
      processAgentResponse(text, session, dbManager, sessionManager, worker, 0, null, 'Test')
    ).rejects.toThrow('memorySessionId not yet captured after retries');
  });
});

9.2 Integration Tests

describe('Session initialization race condition', () => {
  it('should handle rapid PostToolUse events during SDK startup', async () => {
    // Create session
    const sessionDbId = store.createSDKSession(contentSessionId, project, prompt);

    // Immediately queue observations (before SDK responds)
    for (let i = 0; i < 5; i++) {
      sessionManager.queueObservation(sessionDbId, {
        tool_name: 'Read',
        tool_input: { file_path: '/test.txt' },
        tool_response: { content: 'test' },
        prompt_number: 1,
        cwd: '/test'
      });
    }

    // Start SDK agent (will capture memorySessionId)
    await sdkAgent.startSession(session, worker);

    // Verify all observations were stored
    const stored = db.prepare('SELECT COUNT(*) as count FROM observations WHERE memory_session_id = ?')
      .get(session.memorySessionId);
    expect(stored.count).toBeGreaterThanOrEqual(5);
  });
});

File Relevance
src/services/worker/agents/ResponseProcessor.ts Error origin (line 73-75), primary fix location
src/services/worker/SessionManager.ts Session initialization with NULL memorySessionId
src/services/worker/SDKAgent.ts memorySessionId capture logic
src/services/sqlite/sessions/create.ts Session creation with NULL memory_session_id
src/hooks/new-hook.ts Session initialization hook
src/hooks/save-hook.ts PostToolUse observation queueing
docs/SESSION_ID_ARCHITECTURE.md Architecture documentation

11. Conclusion

Issue #586 describes a critical race condition in the session initialization process where memory_session_id is not captured before observations are processed. This results in silent data loss as observations fail to store with the error "Cannot store observations: memorySessionId not yet captured".

The recommended fix is to implement a retry mechanism in ResponseProcessor.processAgentResponse() that waits for the memorySessionId to be captured, with exponential backoff. This approach:

  • Maintains the existing architectural integrity
  • Handles timing variations gracefully
  • Has low regression risk
  • Is straightforward to implement and test

Immediate Action Required: Implement Solution A (Retry Mechanism) and release a hotfix to prevent ongoing data loss.