mirror of
https://github.com/thedotmack/claude-mem
synced 2026-04-25 17:15:04 +02:00
* Refactor worker version checks and increase timeout settings - Updated the default hook timeout from 5000ms to 120000ms for improved stability. - Modified the worker version check to log a warning instead of restarting the worker on version mismatch. - Removed legacy PM2 cleanup and worker start logic, simplifying the ensureWorkerRunning function. - Enhanced polling mechanism for worker readiness with increased retries and reduced interval. * feat: implement worker queue polling to ensure processing completion before proceeding * refactor: change worker command from start to restart in hooks configuration * refactor: remove session management complexity - Simplify createSDKSession to pure INSERT OR IGNORE - Remove auto-create logic from storeObservation/storeSummary - Delete 11 unused session management methods - Derive prompt_number from user_prompts count - Keep sdk_sessions table schema unchanged for compatibility * refactor: simplify session management by removing unused methods and auto-creation logic * Refactor session prompt number retrieval in SessionRoutes - Updated the method of obtaining the prompt number from the session. - Replaced `store.getPromptCounter(sessionDbId)` with `store.getPromptNumberFromUserPrompts(claudeSessionId)` for better clarity and accuracy. - Adjusted the logic for incrementing the prompt number to derive it from the user prompts count instead of directly incrementing a counter. * refactor: replace getPromptCounter with getPromptNumberFromUserPrompts in SessionManager Phase 7 of session management simplification. Updates SessionManager to derive prompt numbers from user_prompts table count instead of using the deprecated prompt_counter column. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: simplify SessionCompletionHandler to use direct SQL query Phase 8: Remove call to findActiveSDKSession() and replace with direct database query in SessionCompletionHandler.completeByClaudeId(). This removes dependency on the deleted findActiveSDKSession() method and simplifies the code by using a straightforward SELECT query. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionCompleted call from SDKAgent - Delete call to markSessionCompleted() in SDKAgent.ts - Session status is no longer tracked or updated - Part of phase 9: simplifying session management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: remove markSessionComplete method (Phase 10) - Deleted markSessionComplete() method from DatabaseManager - Removed markSessionComplete call from SessionCompletionHandler - Session completion status no longer tracked in database - Part of session management simplification effort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * refactor: replace deleted updateSDKSessionId calls in import script (Phase 11) - Replace updateSDKSessionId() calls with direct SQL UPDATE statements - Method was deleted in Phase 3 as part of session management simplification - Import script now uses direct database access consistently 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * test: add validation for SQL updates in sdk_sessions table * refactor: enhance worker-cli to support manual and automated runs * Remove cleanup hook and associated session completion logic - Deleted the cleanup-hook implementation from the hooks directory. - Removed the session completion endpoint that was used by the cleanup hook. - Updated the SessionCompletionHandler to eliminate the completeByClaudeId method and its dependencies. - Adjusted the SessionRoutes to reflect the removal of the session completion route. * fix: update worker-cli command to use bun for consistency * feat: Implement timestamp fix for observations and enhance processing logic - Added `earliestPendingTimestamp` to `ActiveSession` to track the original timestamp of the earliest pending message. - Updated `SDKAgent` to capture and utilize the earliest pending timestamp during response processing. - Modified `SessionManager` to track the earliest timestamp when yielding messages. - Created scripts for fixing corrupted timestamps, validating fixes, and investigating timestamp issues. - Verified that all corrupted observations have been repaired and logic for future processing is sound. - Ensured orphan processing can be safely re-enabled after validation. * feat: Enhance SessionStore to support custom database paths and add timestamp fields for observations and summaries * Refactor pending queue processing and add management endpoints - Disabled automatic recovery of orphaned queues on startup; users must now use the new /api/pending-queue/process endpoint. - Updated processOrphanedQueues method to processPendingQueues with improved session handling and return detailed results. - Added new API endpoints for managing pending queues: GET /api/pending-queue and POST /api/pending-queue/process. - Introduced a new script (check-pending-queue.ts) for checking and processing pending observation queues interactively or automatically. - Enhanced logging and error handling for better monitoring of session processing. * updated agent sdk * feat: Add manual recovery guide and queue management endpoints to documentation --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
451 lines
11 KiB
Plaintext
451 lines
11 KiB
Plaintext
---
|
|
title: "Manual Recovery"
|
|
description: "Recover stuck observations after worker crashes or restarts"
|
|
---
|
|
|
|
# Manual Recovery Guide
|
|
|
|
## Overview
|
|
|
|
Claude-mem's manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns.
|
|
|
|
**Key Change in v5.x**: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.
|
|
|
|
## When Do You Need Manual Recovery?
|
|
|
|
You should trigger manual recovery when:
|
|
|
|
- **Worker crashed or restarted** - Observations were queued but worker stopped before processing
|
|
- **No new summaries appearing** - Observations are being saved but not processed into summaries
|
|
- **Stuck messages detected** - Messages showing as "processing" for >5 minutes
|
|
- **System crashes** - Unexpected shutdowns left messages in incomplete states
|
|
|
|
## Quick Start
|
|
|
|
### Using the CLI Tool (Recommended)
|
|
|
|
The interactive CLI tool is the safest and easiest way to recover stuck observations:
|
|
|
|
```bash
|
|
# Check status and prompt for recovery
|
|
bun scripts/check-pending-queue.ts
|
|
```
|
|
|
|
This will:
|
|
1. Check worker health
|
|
2. Show queue summary (pending, processing, failed, stuck counts)
|
|
3. Display sessions with pending work
|
|
4. Prompt you to confirm recovery
|
|
5. Show recently processed messages for feedback
|
|
|
|
### Auto-Process Without Prompts
|
|
|
|
For scripting or when you're confident recovery is needed:
|
|
|
|
```bash
|
|
# Auto-process without prompting
|
|
bun scripts/check-pending-queue.ts --process
|
|
|
|
# Limit to 5 sessions
|
|
bun scripts/check-pending-queue.ts --process --limit 5
|
|
```
|
|
|
|
## Understanding Queue States
|
|
|
|
Messages progress through these lifecycle states:
|
|
|
|
1. **pending** → Queued, waiting to process
|
|
2. **processing** → Currently being processed by SDK agent
|
|
3. **processed** → Completed successfully
|
|
4. **failed** → Failed after 3 retry attempts
|
|
|
|
### Stuck Detection
|
|
|
|
Messages in `processing` state for **>5 minutes** are considered stuck:
|
|
|
|
- They're automatically reset to `pending` on worker startup
|
|
- They're NOT automatically reprocessed (requires manual trigger)
|
|
- They appear in the `stuckCount` field when checking queue status
|
|
|
|
## Recovery Methods
|
|
|
|
### Method 1: Interactive CLI Tool
|
|
|
|
**Best for**: Regular users, interactive sessions, when you want visibility into what's happening
|
|
|
|
```bash
|
|
bun scripts/check-pending-queue.ts
|
|
```
|
|
|
|
**Example Output**:
|
|
```
|
|
Checking worker health...
|
|
Worker is healthy ✓
|
|
|
|
Queue Summary:
|
|
Pending: 12 messages
|
|
Processing: 2 messages (1 stuck)
|
|
Failed: 0 messages
|
|
Recently Processed: 5 messages in last 30 minutes
|
|
|
|
Sessions with pending work: 3
|
|
Session 44: 5 pending, 1 processing (age: 2m)
|
|
Session 45: 4 pending, 1 processing (age: 7m - STUCK)
|
|
Session 46: 2 pending
|
|
|
|
Would you like to process these pending queues? (y/n)
|
|
```
|
|
|
|
**Features**:
|
|
- ✅ Pre-flight health check (verifies worker is running)
|
|
- ✅ Detailed queue breakdown by session
|
|
- ✅ Age tracking for stuck detection
|
|
- ✅ Confirmation prompt (prevents accidental reprocessing)
|
|
- ✅ Non-interactive mode with `--process` flag
|
|
- ✅ Session limit control with `--limit N`
|
|
|
|
### Method 2: HTTP API
|
|
|
|
**Best for**: Automation, scripting, integration with monitoring systems
|
|
|
|
#### Check Queue Status
|
|
|
|
```bash
|
|
curl http://localhost:37777/api/pending-queue
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"queue": {
|
|
"messages": [
|
|
{
|
|
"id": 123,
|
|
"session_db_id": 45,
|
|
"claude_session_id": "abc123",
|
|
"message_type": "observation",
|
|
"status": "pending",
|
|
"retry_count": 0,
|
|
"created_at_epoch": 1730886600000
|
|
}
|
|
],
|
|
"totalPending": 12,
|
|
"totalProcessing": 2,
|
|
"totalFailed": 0,
|
|
"stuckCount": 1
|
|
},
|
|
"recentlyProcessed": [...],
|
|
"sessionsWithPendingWork": [44, 45, 46]
|
|
}
|
|
```
|
|
|
|
**Key Fields**:
|
|
- `totalPending` - Messages waiting to process
|
|
- `totalProcessing` - Messages currently processing
|
|
- `stuckCount` - Processing messages >5 minutes old
|
|
- `sessionsWithPendingWork` - Session IDs needing recovery
|
|
|
|
#### Trigger Recovery
|
|
|
|
```bash
|
|
curl -X POST http://localhost:37777/api/pending-queue/process \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"sessionLimit": 10}'
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"totalPendingSessions": 15,
|
|
"sessionsStarted": 10,
|
|
"sessionsSkipped": 2,
|
|
"startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
|
|
}
|
|
```
|
|
|
|
**Response Fields**:
|
|
- `totalPendingSessions` - Total sessions with pending messages in database
|
|
- `sessionsStarted` - Sessions we started processing this request
|
|
- `sessionsSkipped` - Sessions already processing (prevents duplicate agents)
|
|
- `startedSessionIds` - Database IDs of sessions we started
|
|
|
|
## Best Practices
|
|
|
|
### 1. Always Check Before Recovery
|
|
|
|
```bash
|
|
# Check queue status first
|
|
curl http://localhost:37777/api/pending-queue
|
|
|
|
# Or use CLI tool which checks automatically
|
|
bun scripts/check-pending-queue.ts
|
|
```
|
|
|
|
### 2. Start with Low Session Limits
|
|
|
|
```bash
|
|
# Process only 5 sessions at a time
|
|
bun scripts/check-pending-queue.ts --process --limit 5
|
|
```
|
|
|
|
This prevents overwhelming the worker with too many concurrent SDK agents.
|
|
|
|
### 3. Monitor During Recovery
|
|
|
|
Watch worker logs while recovery runs:
|
|
|
|
```bash
|
|
npm run worker:logs
|
|
```
|
|
|
|
Look for:
|
|
- SDK agent starts: `Starting SDK agent for session...`
|
|
- Processing completions: `Processed observation...`
|
|
- Errors: `ERROR` or `Failed to process...`
|
|
|
|
### 4. Verify Recovery Success
|
|
|
|
Check recently processed messages:
|
|
|
|
```bash
|
|
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
|
|
```
|
|
|
|
Or use the CLI tool which shows this automatically.
|
|
|
|
### 5. Handle Failed Messages
|
|
|
|
Messages that fail 3 times are marked `failed` and won't auto-retry:
|
|
|
|
```bash
|
|
# View failed messages
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
SELECT id, session_db_id, message_type, retry_count
|
|
FROM pending_messages
|
|
WHERE status = 'failed'
|
|
ORDER BY completed_at_epoch DESC;
|
|
"
|
|
```
|
|
|
|
You can manually reset them if needed:
|
|
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
UPDATE pending_messages
|
|
SET status = 'pending', retry_count = 0
|
|
WHERE status = 'failed';
|
|
"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Recovery Not Working
|
|
|
|
**Symptom**: Triggered recovery but messages still pending
|
|
|
|
**Solutions**:
|
|
|
|
1. **Verify worker health**:
|
|
```bash
|
|
curl http://localhost:37777/health
|
|
```
|
|
|
|
2. **Check worker logs for errors**:
|
|
```bash
|
|
npm run worker:logs | grep -i error
|
|
```
|
|
|
|
3. **Restart worker**:
|
|
```bash
|
|
claude-mem restart
|
|
```
|
|
|
|
4. **Check database integrity**:
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
|
|
```
|
|
|
|
### Messages Stuck Forever
|
|
|
|
**Symptom**: Messages show as "processing" for hours
|
|
|
|
**Solution**: Force reset stuck messages
|
|
|
|
```bash
|
|
# Reset all stuck messages to pending
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
UPDATE pending_messages
|
|
SET status = 'pending', started_processing_at_epoch = NULL
|
|
WHERE status = 'processing';
|
|
"
|
|
|
|
# Then trigger recovery
|
|
bun scripts/check-pending-queue.ts --process
|
|
```
|
|
|
|
### Worker Crashes During Recovery
|
|
|
|
**Symptom**: Worker stops while processing recovered messages
|
|
|
|
**Solutions**:
|
|
|
|
1. **Check available memory**:
|
|
```bash
|
|
npm run worker:status
|
|
```
|
|
|
|
2. **Reduce session limit**:
|
|
```bash
|
|
bun scripts/check-pending-queue.ts --process --limit 3
|
|
```
|
|
|
|
3. **Check for SDK errors in logs**:
|
|
```bash
|
|
npm run worker:logs | grep -i "sdk"
|
|
```
|
|
|
|
4. **Increase worker memory** (if using custom runner):
|
|
```bash
|
|
export NODE_OPTIONS="--max-old-space-size=4096"
|
|
claude-mem restart
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Direct Database Inspection
|
|
|
|
View all pending messages:
|
|
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
SELECT
|
|
id,
|
|
session_db_id,
|
|
message_type,
|
|
status,
|
|
retry_count,
|
|
datetime(created_at_epoch/1000, 'unixepoch') as created_at,
|
|
datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
|
|
CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
|
|
FROM pending_messages
|
|
WHERE status IN ('pending', 'processing')
|
|
ORDER BY created_at_epoch;
|
|
"
|
|
```
|
|
|
|
### Count Messages by Status
|
|
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
SELECT status, COUNT(*) as count
|
|
FROM pending_messages
|
|
GROUP BY status;
|
|
"
|
|
```
|
|
|
|
### Find Sessions with Pending Work
|
|
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
SELECT
|
|
session_db_id,
|
|
COUNT(*) as pending_count,
|
|
GROUP_CONCAT(message_type) as message_types
|
|
FROM pending_messages
|
|
WHERE status IN ('pending', 'processing')
|
|
GROUP BY session_db_id;
|
|
"
|
|
```
|
|
|
|
### View Recent Failures
|
|
|
|
```bash
|
|
sqlite3 ~/.claude-mem/claude-mem.db "
|
|
SELECT
|
|
id,
|
|
session_db_id,
|
|
message_type,
|
|
retry_count,
|
|
datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
|
|
FROM pending_messages
|
|
WHERE status = 'failed'
|
|
ORDER BY completed_at_epoch DESC
|
|
LIMIT 10;
|
|
"
|
|
```
|
|
|
|
## Integration Examples
|
|
|
|
### Cron Job for Automatic Recovery
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Run every hour to process stuck queues
|
|
|
|
# Check if worker is healthy
|
|
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
|
|
# Auto-process up to 5 sessions
|
|
bun scripts/check-pending-queue.ts --process --limit 5
|
|
else
|
|
echo "Worker not healthy, skipping recovery"
|
|
exit 1
|
|
fi
|
|
```
|
|
|
|
### Monitoring Script
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Alert if stuck count exceeds threshold
|
|
|
|
STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')
|
|
|
|
if [ "$STUCK_COUNT" -gt 5 ]; then
|
|
echo "WARNING: $STUCK_COUNT stuck messages detected"
|
|
# Send alert (email, Slack, etc.)
|
|
fi
|
|
```
|
|
|
|
### Pre-Shutdown Recovery
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Process pending queues before system shutdown
|
|
|
|
echo "Processing pending queues before shutdown..."
|
|
bun scripts/check-pending-queue.ts --process --limit 20
|
|
|
|
echo "Waiting for processing to complete..."
|
|
sleep 10
|
|
|
|
echo "Stopping worker..."
|
|
claude-mem stop
|
|
```
|
|
|
|
## Migration Note
|
|
|
|
If you're upgrading from v4.x to v5.x:
|
|
|
|
**v4.x Behavior** (Automatic Recovery):
|
|
- Worker automatically recovered stuck messages on startup
|
|
- No user control over reprocessing timing
|
|
|
|
**v5.x Behavior** (Manual Recovery):
|
|
- Stuck messages detected but NOT automatically reprocessed
|
|
- User must explicitly trigger recovery via CLI or API
|
|
- Prevents unexpected duplicate observations
|
|
- Provides explicit control over when processing happens
|
|
|
|
**Migration Steps**:
|
|
1. Upgrade to v5.x
|
|
2. Check for stuck messages: `bun scripts/check-pending-queue.ts`
|
|
3. Process if needed: `bun scripts/check-pending-queue.ts --process`
|
|
4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)
|
|
|
|
## See Also
|
|
|
|
- [Worker Service Architecture](../architecture/worker-service) - Technical details on queue processing
|
|
- [Troubleshooting - Manual Recovery](../troubleshooting#manual-recovery-for-stuck-observations) - Common issues and solutions
|
|
- [Database Schema](../architecture/database) - Pending messages table structure
|