claude-mem/docs/public/usage/manual-recovery.mdx

---
title: "Manual Recovery"
description: "Recover stuck observations after worker crashes or restarts"
---

# Manual Recovery Guide

## Overview

Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.

**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.

## When Do You Need Manual Recovery?

You should trigger manual recovery when:

- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
- **No new summaries appearing** - Observations are being saved but not processed into summaries
- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
- **System crashes** - Unexpected shutdowns left messages in incomplete states

## Quick Start

### Using the CLI Tool (Recommended)

The interactive CLI tool is the safest and easiest way to recover stuck observations:

```bash
# Check status and prompt for recovery
bun scripts/check-pending-queue.ts
```

This will:
1. Check worker health
2. Show queue summary (pending, processing, failed, stuck counts)
3. Display sessions with pending work
4. Prompt you to confirm recovery
5. Show recently processed messages for feedback

### Auto-Process Without Prompts

For scripting or when you're confident recovery is needed:

```bash
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process

# Limit to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5
```

## Understanding Queue States

Messages progress through these lifecycle states:

1. **pending** → Queued, waiting to process
2. **processing** → Currently being processed by SDK agent
3. **processed** → Completed successfully
4. **failed** → Failed after 3 retry attempts

### Stuck Detection

Messages in `processing` state for **>5 minutes** are considered stuck:

- They're automatically reset to `pending` on worker startup
- They're NOT automatically reprocessed (requires manual trigger)
- They appear in the `stuckCount` field when checking queue status

## Recovery Methods

### Method 1: Interactive CLI Tool

**Best for**: Regular users, interactive sessions, when you want visibility into what's happening

```bash
bun scripts/check-pending-queue.ts
```

**Example Output**:
```
Checking worker health...
Worker is healthy ✓

Queue Summary:
  Pending: 12 messages
  Processing: 2 messages (1 stuck)
  Failed: 0 messages
  Recently Processed: 5 messages in last 30 minutes

Sessions with pending work: 3
  Session 44: 5 pending, 1 processing (age: 2m)
  Session 45: 4 pending, 1 processing (age: 7m - STUCK)
  Session 46: 2 pending

Would you like to process these pending queues? (y/n)
```

**Features**:
- ✅ Pre-flight health check (verifies worker is running)
- ✅ Detailed queue breakdown by session
- ✅ Age tracking for stuck detection
- ✅ Confirmation prompt (prevents accidental reprocessing)
- ✅ Non-interactive mode with `--process` flag
- ✅ Session limit control with `--limit N`

### Method 2: HTTP API

**Best for**: Automation, scripting, integration with monitoring systems

#### Check Queue Status

```bash
curl http://localhost:37777/api/pending-queue
```

**Response**:
```json
{
  "queue": {
    "messages": [
      {
        "id": 123,
        "session_db_id": 45,
        "claude_session_id": "abc123",
        "message_type": "observation",
        "status": "pending",
        "retry_count": 0,
        "created_at_epoch": 1730886600000
      }
    ],
    "totalPending": 12,
    "totalProcessing": 2,
    "totalFailed": 0,
    "stuckCount": 1
  },
  "recentlyProcessed": [...],
  "sessionsWithPendingWork": [44, 45, 46]
}
```

**Key Fields**:
- `totalPending` - Messages waiting to process
- `totalProcessing` - Messages currently processing
- `stuckCount` - Processing messages >5 minutes old
- `sessionsWithPendingWork` - Session IDs needing recovery

#### Trigger Recovery

```bash
curl -X POST http://localhost:37777/api/pending-queue/process \
  -H "Content-Type: application/json" \
  -d '{"sessionLimit": 10}'
```

**Response**:
```json
{
  "success": true,
  "totalPendingSessions": 15,
  "sessionsStarted": 10,
  "sessionsSkipped": 2,
  "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}
```

**Response Fields**:
- `totalPendingSessions` - Total sessions with pending messages in database
- `sessionsStarted` - Sessions we started processing this request
- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
- `startedSessionIds` - Database IDs of sessions we started

## Best Practices

### 1. Always Check Before Recovery

```bash
# Check queue status first
curl http://localhost:37777/api/pending-queue

# Or use CLI tool which checks automatically
bun scripts/check-pending-queue.ts
```

### 2. Start with Low Session Limits

```bash
# Process only 5 sessions at a time
bun scripts/check-pending-queue.ts --process --limit 5
```

This prevents overwhelming the worker with too many concurrent SDK agents.

### 3. Monitor During Recovery

Watch worker logs while recovery runs:

```bash
npm run worker:logs
```

Look for:
- SDK agent starts: `Starting SDK agent for session...`
- Processing completions: `Processed observation...`
- Errors: `ERROR` or `Failed to process...`

### 4. Verify Recovery Success

Check recently processed messages:

```bash
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
```

Or use the CLI tool which shows this automatically.

### 5. Handle Failed Messages

Messages that fail 3 times are marked `failed` and won't auto-retry:

```bash
# View failed messages
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT id, session_db_id, message_type, retry_count
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC;
"
```

You can manually reset them if needed:

```bash
sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', retry_count = 0
  WHERE status = 'failed';
"
```

## Troubleshooting

### Recovery Not Working

**Symptom**: Triggered recovery but messages still pending

**Solutions**:

1. **Verify worker health**:
   ```bash
   curl http://localhost:37777/health
   ```

2. **Check worker logs for errors**:
   ```bash
   npm run worker:logs | grep -i error
   ```

3. **Restart worker**:
   ```bash
   claude-mem restart
   ```

4. **Check database integrity**:
   ```bash
   sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
   ```

### Messages Stuck Forever

**Symptom**: Messages show as "processing" for hours

**Solution**: Force reset stuck messages

```bash
# Reset all stuck messages to pending
sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', started_processing_at_epoch = NULL
  WHERE status = 'processing';
"

# Then trigger recovery
bun scripts/check-pending-queue.ts --process
```

### Worker Crashes During Recovery

**Symptom**: Worker stops while processing recovered messages

**Solutions**:

1. **Check available memory**:
   ```bash
   npm run worker:status
   ```

2. **Reduce session limit**:
   ```bash
   bun scripts/check-pending-queue.ts --process --limit 3
   ```

3. **Check for SDK errors in logs**:
   ```bash
   npm run worker:logs | grep -i "sdk"
   ```

4. **Increase worker memory** (if using custom runner):
   ```bash
   export NODE_OPTIONS="--max-old-space-size=4096"
   claude-mem restart
   ```

## Advanced Usage

### Direct Database Inspection

View all pending messages:

```bash
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    status,
    retry_count,
    datetime(created_at_epoch/1000, 'unixepoch') as created_at,
    datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
    CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  ORDER BY created_at_epoch;
"
```

### Count Messages by Status

```bash
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT status, COUNT(*) as count
  FROM pending_messages
  GROUP BY status;
"
```

### Find Sessions with Pending Work

```bash
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    session_db_id,
    COUNT(*) as pending_count,
    GROUP_CONCAT(message_type) as message_types
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  GROUP BY session_db_id;
"
```

### View Recent Failures

```bash
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    retry_count,
    datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC
  LIMIT 10;
"
```

## Integration Examples

### Cron Job for Automatic Recovery

```bash
#!/bin/bash
# Run every hour to process stuck queues

# Check if worker is healthy
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
  # Auto-process up to 5 sessions
  bun scripts/check-pending-queue.ts --process --limit 5
else
  echo "Worker not healthy, skipping recovery"
  exit 1
fi
```

### Monitoring Script

```bash
#!/bin/bash
# Alert if stuck count exceeds threshold

STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')

if [ "$STUCK_COUNT" -gt 5 ]; then
  echo "WARNING: $STUCK_COUNT stuck messages detected"
  # Send alert (email, Slack, etc.)
fi
```

### Pre-Shutdown Recovery

```bash
#!/bin/bash
# Process pending queues before system shutdown

echo "Processing pending queues before shutdown..."
bun scripts/check-pending-queue.ts --process --limit 20

echo "Waiting for processing to complete..."
sleep 10

echo "Stopping worker..."
claude-mem stop
```

## Migration Note

If you're upgrading from v4.x to v5.x:

**v4.x Behavior** (Automatic Recovery):
- Worker automatically recovered stuck messages on startup
- No user control over reprocessing timing

**v5.x Behavior** (Manual Recovery):
- Stuck messages detected but NOT automatically reprocessed
- User must explicitly trigger recovery via CLI or API
- Prevents unexpected duplicate observations
- Provides explicit control over when processing happens

**Migration Steps**:
1. Upgrade to v5.x
2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
3. Process if needed: `bun scripts/check-pending-queue.ts --process`
4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)

## See Also

- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
- [Database Schema](../architecture/database) - Pending messages table structure