Files
claude-mem/docs/public/usage/manual-recovery.mdx
Alex Newman 266c746d50 feat: Fix observation timestamps, refactor session management, and enhance worker reliability (#437)
* Refactor worker version checks and increase timeout settings

- Updated the default hook timeout from 5000ms to 120000ms for improved stability.
- Modified the worker version check to log a warning instead of restarting the worker on version mismatch.
- Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function.
- Enhanced polling mechanism for worker readiness with increased retries and reduced interval.

* feat: implement worker queue polling to ensure processing completion before proceeding

* refactor: change worker command from start to restart in hooks configuration

* refactor: remove session management complexity

- Simplify createSDKSession to pure INSERT OR IGNORE
- Remove auto-create logic from storeObservation/storeSummary
- Delete 11 unused session management methods
- Derive prompt_number from user_prompts count
- Keep sdk_sessions table schema unchanged for compatibility

* refactor: simplify session management by removing unused methods and auto-creation logic

* Refactor session prompt number retrieval in SessionRoutes

- Updated the method of obtaining the prompt number from the session.
- Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy.
- Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter.

* refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager

Phase 7 of session management simplification. Updates SessionManager to derive
prompt numbers from user_prompts table count instead of using the deprecated
prompt_counter column.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: simplify SessionCompletionHandler to use direct SQL query

Phase 8: Remove call to findActiveSDKSession() and replace with direct
database query in SessionCompletionHandler.completeByClaudeId().

This removes dependency on the deleted findActiveSDKSession() method
and simplifies the code by using a straightforward SELECT query.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionCompleted call from SDKAgent

- Delete call to markSessionCompleted() in SDKAgent.ts
- Session status is no longer tracked or updated
- Part of phase 9: simplifying session management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: remove markSessionComplete method (Phase 10)

- Deleted markSessionComplete() method from DatabaseManager
- Removed markSessionComplete call from SessionCompletionHandler
- Session completion status no longer tracked in database
- Part of session management simplification effort

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* refactor: replace deleted updateSDKSessionId calls in import script (Phase 11)

- Replace updateSDKSessionId() calls with direct SQL UPDATE statements
- Method was deleted in Phase 3 as part of session management simplification
- Import script now uses direct database access consistently

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test: add validation for SQL updates in sdk_sessions table

* refactor: enhance worker-cli to support manual and automated runs

* Remove cleanup hook and associated session completion logic

- Deleted the cleanup-hook implementation from the hooks directory.
- Removed the session completion endpoint that was used by the cleanup hook.
- Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies.
- Adjusted the SessionRoutes to reflect the removal of the session completion route.

* fix: update worker-cli command to use bun for consistency

* feat: Implement timestamp fix for observations and enhance processing logic

- Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message.
- Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing.
- Modified `SessionManager` to track the earliest timestamp when yielding messages.
- Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues.
- Verified that all corrupted observations have been repaired and logic for future processing is sound.
- Ensured orphan processing can be safely re-enabled after validation.

* feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries

* Refactor pending queue processing and add management endpoints

- Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint.
- Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results.
- Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process.
- Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically.
- Enhanced logging and error handling for better monitoring of session processing.

* updated agent sdk

* feat: Add manual recovery guide and queue management endpoints to documentation

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-25 15:36:46 -05:00

451 lines
11 KiB
Plaintext

---
title: "Manual Recovery"
description: "Recover stuck observations after worker crashes or restarts"
---
# Manual Recovery Guide
## Overview
Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.
**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.
## When Do You Need Manual Recovery?
You should trigger manual recovery when:
- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
- **No new summaries appearing** - Observations are being saved but not processed into summaries
- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
- **System crashes** - Unexpected shutdowns left messages in incomplete states
## Quick Start
### Using the CLI Tool (Recommended)
The interactive CLI tool is the safest and easiest way to recover stuck observations:
```bash
# Check status and prompt for recovery
bun scripts/check-pending-queue.ts
```
This will:
1. Check worker health
2. Show queue summary (pending, processing, failed, stuck counts)
3. Display sessions with pending work
4. Prompt you to confirm recovery
5. Show recently processed messages for feedback
### Auto-Process Without Prompts
For scripting or when you're confident recovery is needed:
```bash
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process
# Limit to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
```
## Understanding Queue States
Messages progress through these lifecycle states:
1. **pending** → Queued, waiting to process
2. **processing** → Currently being processed by SDK agent
3. **processed** → Completed successfully
4. **failed** → Failed after 3 retry attempts
### Stuck Detection
Messages in `processing` state for **>5 minutes** are considered stuck:
- They're automatically reset to `pending` on worker startup
- They're NOT automatically reprocessed (requires manual trigger)
- They appear in the `stuckCount` field when checking queue status
## Recovery Methods
### Method 1: Interactive CLI Tool
**Best for**: Regular users, interactive sessions, when you want visibility into what's happening
```bash
bun scripts/check-pending-queue.ts
```
**Example Output**:
```
Checking worker health...
Worker is healthy ✓
Queue Summary:
Pending: 12 messages
Processing: 2 messages (1 stuck)
Failed: 0 messages
Recently Processed: 5 messages in last 30 minutes
Sessions with pending work: 3
Session 44: 5 pending, 1 processing (age: 2m)
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
Session 46: 2 pending
Would you like to process these pending queues? (y/n)
```
**Features**:
- ✅ Pre-flight health check (verifies worker is running)
- ✅ Detailed queue breakdown by session
- ✅ Age tracking for stuck detection
- ✅ Confirmation prompt (prevents accidental reprocessing)
- ✅ Non-interactive mode with `--process` flag
- ✅ Session limit control with `--limit N`
### Method 2: HTTP API
**Best for**: Automation, scripting, integration with monitoring systems
#### Check Queue Status
```bash
curl http://localhost:37777/api/pending-queue
```
**Response**:
```json
{
"queue": {
"messages": [
{
"id": 123,
"session_db_id": 45,
"claude_session_id": "abc123",
"message_type": "observation",
"status": "pending",
"retry_count": 0,
"created_at_epoch": 1730886600000
}
],
"totalPending": 12,
"totalProcessing": 2,
"totalFailed": 0,
"stuckCount": 1
},
"recentlyProcessed": [...],
"sessionsWithPendingWork": [44, 45, 46]
}
```
**Key Fields**:
- `totalPending` - Messages waiting to process
- `totalProcessing` - Messages currently processing
- `stuckCount` - Processing messages >5 minutes old
- `sessionsWithPendingWork` - Session IDs needing recovery
#### Trigger Recovery
```bash
curl -X POST http://localhost:37777/api/pending-queue/process \
-H "Content-Type: application/json" \
-d '{"sessionLimit": 10}'
```
**Response**:
```json
{
"success": true,
"totalPendingSessions": 15,
"sessionsStarted": 10,
"sessionsSkipped": 2,
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}
```
**Response Fields**:
- `totalPendingSessions` - Total sessions with pending messages in database
- `sessionsStarted` - Sessions we started processing this request
- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
- `startedSessionIds` - Database IDs of sessions we started
## Best Practices
### 1. Always Check Before Recovery
```bash
# Check queue status first
curl http://localhost:37777/api/pending-queue
# Or use CLI tool which checks automatically
bun scripts/check-pending-queue.ts
```
### 2. Start with Low Session Limits
```bash
# Process only 5 sessions at a time
bun scripts/check-pending-queue.ts --process --limit 5
```
This prevents overwhelming the worker with too many concurrent SDK agents.
### 3. Monitor During Recovery
Watch worker logs while recovery runs:
```bash
npm run worker:logs
```
Look for:
- SDK agent starts: `Starting SDK agent for session...`
- Processing completions: `Processed observation...`
- Errors: `ERROR` or `Failed to process...`
### 4. Verify Recovery Success
Check recently processed messages:
```bash
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
```
Or use the CLI tool which shows this automatically.
### 5. Handle Failed Messages
Messages that fail 3 times are marked `failed` and won't auto-retry:
```bash
# View failed messages
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT id, session_db_id, message_type, retry_count
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC;
"
```
You can manually reset them if needed:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', retry_count = 0
WHERE status = 'failed';
"
```
## Troubleshooting
### Recovery Not Working
**Symptom**: Triggered recovery but messages still pending
**Solutions**:
1. **Verify worker health**:
```bash
curl http://localhost:37777/health
```
2. **Check worker logs for errors**:
```bash
npm run worker:logs | grep -i error
```
3. **Restart worker**:
```bash
claude-mem restart
```
4. **Check database integrity**:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
```
### Messages Stuck Forever
**Symptom**: Messages show as "processing" for hours
**Solution**: Force reset stuck messages
```bash
# Reset all stuck messages to pending
sqlite3 ~/.claude-mem/claude-mem.db "
UPDATE pending_messages
SET status = 'pending', started_processing_at_epoch = NULL
WHERE status = 'processing';
"
# Then trigger recovery
bun scripts/check-pending-queue.ts --process
```
### Worker Crashes During Recovery
**Symptom**: Worker stops while processing recovered messages
**Solutions**:
1. **Check available memory**:
```bash
npm run worker:status
```
2. **Reduce session limit**:
```bash
bun scripts/check-pending-queue.ts --process --limit 3
```
3. **Check for SDK errors in logs**:
```bash
npm run worker:logs | grep -i "sdk"
```
4. **Increase worker memory** (if using custom runner):
```bash
export NODE_OPTIONS="--max-old-space-size=4096"
claude-mem restart
```
## Advanced Usage
### Direct Database Inspection
View all pending messages:
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
id,
session_db_id,
message_type,
status,
retry_count,
datetime(created_at_epoch/1000, 'unixepoch') as created_at,
datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
FROM pending_messages
WHERE status IN ('pending', 'processing')
ORDER BY created_at_epoch;
"
```
### Count Messages by Status
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT status, COUNT(*) as count
FROM pending_messages
GROUP BY status;
"
```
### Find Sessions with Pending Work
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
session_db_id,
COUNT(*) as pending_count,
GROUP_CONCAT(message_type) as message_types
FROM pending_messages
WHERE status IN ('pending', 'processing')
GROUP BY session_db_id;
"
```
### View Recent Failures
```bash
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT
id,
session_db_id,
message_type,
retry_count,
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
FROM pending_messages
WHERE status = 'failed'
ORDER BY completed_at_epoch DESC
LIMIT 10;
"
```
## Integration Examples
### Cron Job for Automatic Recovery
```bash
#!/bin/bash
# Run every hour to process stuck queues
# Check if worker is healthy
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
# Auto-process up to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
else
echo "Worker not healthy, skipping recovery"
exit 1
fi
```
### Monitoring Script
```bash
#!/bin/bash
# Alert if stuck count exceeds threshold
STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')
if [ "$STUCK_COUNT" -gt 5 ]; then
echo "WARNING: $STUCK_COUNT stuck messages detected"
# Send alert (email, Slack, etc.)
fi
```
### Pre-Shutdown Recovery
```bash
#!/bin/bash
# Process pending queues before system shutdown
echo "Processing pending queues before shutdown..."
bun scripts/check-pending-queue.ts --process --limit 20
echo "Waiting for processing to complete..."
sleep 10
echo "Stopping worker..."
claude-mem stop
```
## Migration Note
If you're upgrading from v4.x to v5.x:
**v4.x Behavior** (Automatic Recovery):
- Worker automatically recovered stuck messages on startup
- No user control over reprocessing timing
**v5.x Behavior** (Manual Recovery):
- Stuck messages detected but NOT automatically reprocessed
- User must explicitly trigger recovery via CLI or API
- Prevents unexpected duplicate observations
- Provides explicit control over when processing happens
**Migration Steps**:
1. Upgrade to v5.x
2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
3. Process if needed: `bun scripts/check-pending-queue.ts --process`
4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)
## See Also
- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
- [Database Schema](../architecture/database) - Pending messages table structure