mirror of
https://github.com/thedotmack/claude-mem
synced 2026-04-25 17:15:04 +02:00
docs: add architecture overview and production guide
Architecture overview covers the 4-layer system design, hook lifecycle, data flow, and key patterns (CLAIM-CONFIRM, circuit-breaker, graceful degradation, deduplication, dual session IDs). Production guide provides recommended settings, health monitoring metrics and thresholds, quick health check commands, multi-machine sync setup, growth expectations, common issues with solutions, and log analysis tips. Based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
committed by
Alex Newman
parent
76207fb8d6
commit
bf8b7dbd9f
139
docs/architecture-overview.md
Normal file
139
docs/architecture-overview.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# claude-mem Architecture Overview
|
||||
|
||||
## System Layers
|
||||
|
||||
```
|
||||
+-----------------------------------------------------------+
|
||||
| Claude Code (host) |
|
||||
| +-- Hook System (5 events) |
|
||||
| +-- MCP Client (search tools) |
|
||||
+-----------------------------------------------------------+
|
||||
| CLI Layer (Bun) |
|
||||
| +-- bun-runner.js (Node->Bun bridge) |
|
||||
| +-- hook-command.ts (orchestrator) |
|
||||
| +-- handlers/ (context, session-init, observation, |
|
||||
| summarize, session-complete) |
|
||||
+-----------------------------------------------------------+
|
||||
| Worker Daemon (Express, port 37777) |
|
||||
| +-- SessionManager (session lifecycle) |
|
||||
| +-- SDKAgent (Claude Agent SDK) |
|
||||
| +-- SearchManager (search orchestration) |
|
||||
| +-- ProcessRegistry (subprocess management) |
|
||||
| +-- ChromaSync (embedding synchronization) |
|
||||
+-----------------------------------------------------------+
|
||||
| Storage Layer |
|
||||
| +-- SQLite (claude-mem.db) -- structured data |
|
||||
| +-- ChromaDB (chroma.sqlite3) -- vector embeddings |
|
||||
| +-- MCP Server (interface for Claude Code) |
|
||||
+-----------------------------------------------------------+
|
||||
```
|
||||
|
||||
## Hook Lifecycle
|
||||
|
||||
| Event | Handler | What it does | Timeout |
|
||||
|-------|---------|-------------|---------|
|
||||
| Setup | setup.sh | Install system dependencies | 300s |
|
||||
| SessionStart | smart-install.js + context | Install deps + start worker + inject context | 60s |
|
||||
| UserPromptSubmit | session-init | Register session + start SDK agent + semantic injection | 60s |
|
||||
| PostToolUse | observation | Capture tool usage -> enqueue in worker | 120s |
|
||||
| Stop | summarize + session-complete | Request summary + end session | 120s+30s |
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
User prompt -> session-init -> /api/sessions/init + /api/context/semantic
|
||||
|
|
||||
Tool use -> observation -> /api/sessions/observations
|
||||
| |
|
||||
| PendingMessageStore.enqueue()
|
||||
| |
|
||||
| SDKAgent.startSession()
|
||||
| |
|
||||
| Claude Agent SDK -> ResponseProcessor
|
||||
| |
|
||||
| +-- storeObservations() -> SQLite
|
||||
| +-- chromaSync.sync() -> ChromaDB
|
||||
| +-- broadcastObservation() -> SSE/UI
|
||||
|
|
||||
Stop -> summarize -> /api/sessions/summarize
|
||||
-> session-complete -> /api/sessions/complete + drain
|
||||
```
|
||||
|
||||
## Key Patterns
|
||||
|
||||
### CLAIM-CONFIRM (PendingMessageStore)
|
||||
|
||||
```
|
||||
enqueue() -> INSERT status='pending'
|
||||
claimNextMessage() -> UPDATE status='processing' (atomic)
|
||||
confirmProcessed() -> DELETE (success)
|
||||
markFailed() -> UPDATE status='failed' (retry < 3)
|
||||
|
||||
Self-healing: messages in 'processing' for >60s reset to 'pending'
|
||||
```
|
||||
|
||||
### Circuit-Breaker (SessionRoutes)
|
||||
|
||||
```
|
||||
Generator crash -> retry 1 (1s) -> retry 2 (2s) -> retry 3 (4s)
|
||||
-> consecutiveRestarts > 3 -> CIRCUIT-BREAKER
|
||||
-> markAllSessionMessagesAbandoned(sessionDbId)
|
||||
-> Stop. No infinite loop.
|
||||
```
|
||||
|
||||
Counter resets to 0 when generator completes work naturally.
|
||||
|
||||
### Graceful Degradation (hook-command.ts)
|
||||
|
||||
```
|
||||
Transport errors (ECONNREFUSED, timeout, 5xx) -> exit 0 (never block Claude Code)
|
||||
Client bugs (4xx, TypeError, ReferenceError) -> exit 2 (blocking, needs fix)
|
||||
```
|
||||
|
||||
The worker being unavailable NEVER blocks the user's Claude Code session.
|
||||
|
||||
### Deduplication (observations)
|
||||
|
||||
```
|
||||
SHA256(memory_session_id + title + narrative) -> content_hash
|
||||
If hash exists within 30s window -> return existing ID (no insert)
|
||||
```
|
||||
|
||||
### Two Types of Session ID
|
||||
|
||||
- `contentSessionId` — from Claude Code, invariant during the session
|
||||
- `memorySessionId` — from SDK Agent, changes on each worker restart
|
||||
|
||||
The conversion between them is handled by SessionStore and is critical for FK constraints.
|
||||
|
||||
## Storage
|
||||
|
||||
### SQLite (claude-mem.db)
|
||||
|
||||
| Table | Key fields | Purpose |
|
||||
|-------|-----------|---------|
|
||||
| sdk_sessions | content_session_id, memory_session_id, status | Session lifecycle |
|
||||
| observations | memory_session_id, type, title, narrative, content_hash | Tool usage observations |
|
||||
| session_summaries | memory_session_id, request, learned, completed | Session summaries |
|
||||
| user_prompts | content_session_id, prompt_text | User prompt history |
|
||||
| pending_messages | session_db_id, status, message_type | CLAIM-CONFIRM queue |
|
||||
| observation_feedback | observation_id, signal_type | Usage tracking |
|
||||
|
||||
### ChromaDB (chroma.sqlite3)
|
||||
|
||||
Vector embeddings for semantic search. Each observation generates multiple documents:
|
||||
|
||||
```
|
||||
obs_{id}_narrative -> main text
|
||||
obs_{id}_fact_0 -> first fact
|
||||
obs_{id}_fact_1 -> second fact
|
||||
...
|
||||
```
|
||||
|
||||
Accessed via chroma-mcp (MCP process), communication over stdio.
|
||||
|
||||
## Process Management
|
||||
|
||||
- **ProcessRegistry:** Tracks all Claude SDK subprocesses, manages PID lifecycle
|
||||
- **Orphan Reaper (5min):** Kills processes with no active session
|
||||
- **GracefulShutdown:** 7-step shutdown (PID file, children, HTTP server, sessions, MCP, DB, force-kill)
|
||||
111
docs/production-guide.md
Normal file
111
docs/production-guide.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# claude-mem Production Guide
|
||||
|
||||
Practical guide based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects.
|
||||
|
||||
## Recommended Settings
|
||||
|
||||
| Setting | Default | Recommended | Why |
|
||||
|---------|---------|-------------|-----|
|
||||
| CLAUDE_MEM_MAX_CONCURRENT_AGENTS | 2 | 3 | Better throughput without overload |
|
||||
| CLAUDE_MEM_SEMANTIC_INJECT | (new) | true | Relevant context >> recent context |
|
||||
| CLAUDE_MEM_SEMANTIC_INJECT_LIMIT | (new) | 5 | Sweet spot for token cost vs coverage |
|
||||
| CLAUDE_MEM_TIER_ROUTING_ENABLED | (new) | true | ~52% cost savings, no quality loss |
|
||||
|
||||
## Health Monitoring
|
||||
|
||||
### Key metrics to watch
|
||||
|
||||
| Metric | Healthy | Warning | Action |
|
||||
|--------|---------|---------|--------|
|
||||
| pending_messages (pending) | 0-5 | >10 | Check worker logs, may need restart |
|
||||
| pending_messages (failed) | 0 | >0 growing | Circuit-breaker may be tripping |
|
||||
| sdk_sessions (active) | 0-3 | >5 stuck | Orphan sessions, worker restart |
|
||||
| WAL size | <10 MB | >20 MB | Run `PRAGMA wal_checkpoint(TRUNCATE)` |
|
||||
| Chroma size | Growing slowly | Sudden jump | Check for sync loops |
|
||||
| Errors/day in logs | 0-2 | >10 | Investigate log patterns |
|
||||
|
||||
### Quick health check
|
||||
|
||||
```bash
|
||||
# Check worker status
|
||||
curl -s http://127.0.0.1:37777/api/health | python3 -m json.tool
|
||||
|
||||
# Check database stats
|
||||
sqlite3 ~/.claude-mem/claude-mem.db "
|
||||
SELECT 'observations' as metric, COUNT(*) as value FROM observations
|
||||
UNION ALL SELECT 'summaries', COUNT(*) FROM session_summaries
|
||||
UNION ALL SELECT 'pending', COUNT(*) FROM pending_messages WHERE status='pending'
|
||||
UNION ALL SELECT 'active_sessions', COUNT(*) FROM sdk_sessions WHERE status='active';
|
||||
"
|
||||
```
|
||||
|
||||
## Multi-Machine Setup
|
||||
|
||||
If running claude-mem on multiple machines, use `claude-mem-sync` to keep observations in sync:
|
||||
|
||||
```bash
|
||||
claude-mem-sync push <remote-host> # local -> remote
|
||||
claude-mem-sync pull <remote-host> # remote -> local
|
||||
claude-mem-sync sync <remote-host> # bidirectional
|
||||
claude-mem-sync status <remote-host> # compare counts
|
||||
```
|
||||
|
||||
Deduplication is by `(created_at, title)` — safe to run repeatedly.
|
||||
|
||||
## Growth Expectations
|
||||
|
||||
Based on active daily development usage:
|
||||
|
||||
| Metric | Per day | Per month | Notes |
|
||||
|--------|---------|-----------|-------|
|
||||
| Observations | ~120 | ~3,600 | Varies with coding activity |
|
||||
| Summaries | ~40 | ~1,200 | One per session |
|
||||
| SQLite | ~0.8 MB | ~24 MB | ~5 KB per observation |
|
||||
| Chroma | ~4 MB | ~120 MB | ~50 KB per observation (embeddings) |
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Summarize error loop
|
||||
|
||||
**Symptom:** Repeated `[ERROR] Missing last_assistant_message` in logs.
|
||||
**Cause:** Transcript with no assistant messages triggers summary attempt that fails repeatedly.
|
||||
**Fix:** PR #1566 — skip summary when transcript is empty.
|
||||
|
||||
### Chroma sync failures
|
||||
|
||||
**Symptom:** `[ERROR] Batch add failed... IDs already exist`
|
||||
**Cause:** MCP timeout during add leaves partial writes; retry fails on existing IDs.
|
||||
**Fix:** PR #1566 — fallback to update (upsert pattern).
|
||||
|
||||
### Port conflict on startup
|
||||
|
||||
**Symptom:** `Worker failed to start... Is port 37777 in use?`
|
||||
**Cause:** Two sessions starting simultaneously — HTTP check is non-atomic (TOCTOU race).
|
||||
**Fix:** PR #1566 — atomic socket bind on Unix.
|
||||
|
||||
### Orphaned pending messages
|
||||
|
||||
**Symptom:** `pending_messages` table growing with old entries for completed sessions.
|
||||
**Cause:** SIGTERM kills generator before queue is drained.
|
||||
**Fix:** PR #1567 — drain after deleteSession().
|
||||
|
||||
### Context not relevant to current topic
|
||||
|
||||
**Symptom:** Claude receives observations about CSS when you're asking about authentication.
|
||||
**Cause:** Default recency-based injection selects most recent, not most relevant.
|
||||
**Fix:** PR #1568 — semantic injection via Chroma on every prompt.
|
||||
|
||||
## Log Analysis Tips
|
||||
|
||||
```bash
|
||||
# Count errors by day
|
||||
grep '\[ERROR\]' ~/.claude-mem/logs/claude-mem-*.log | \
|
||||
sed 's/\[20[0-9][0-9]-[0-9][0-9]-/\n&/g' | \
|
||||
grep -oP '^\[20\d{2}-\d{2}-\d{2}' | sort | uniq -c
|
||||
|
||||
# Find circuit-breaker trips
|
||||
grep 'circuit\|Circuit\|ABANDONED\|abandoned' ~/.claude-mem/logs/claude-mem-*.log
|
||||
|
||||
# Check pending message health
|
||||
grep 'CLAIMED\|CONFIRMED\|FAILED\|ABANDONED' ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log | tail -20
|
||||
```
|
||||
Reference in New Issue
Block a user