docs: add architecture overview and production guide

Architecture overview covers the 4-layer system design, hook lifecycle,
data flow, and key patterns (CLAIM-CONFIRM, circuit-breaker, graceful
degradation, deduplication, dual session IDs).

Production guide provides recommended settings, health monitoring
metrics and thresholds, quick health check commands, multi-machine
sync setup, growth expectations, common issues with solutions, and
log analysis tips.

Based on 23 days of production usage with 3,400+ observations
across two physical servers and 8 projects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alessandro Costa
2026-04-01 22:05:06 -03:00
committed by Alex Newman
parent 76207fb8d6
commit bf8b7dbd9f
2 changed files with 250 additions and 0 deletions

View File

@@ -0,0 +1,139 @@
# claude-mem Architecture Overview
## System Layers
```
+-----------------------------------------------------------+
| Claude Code (host) |
| +-- Hook System (5 events) |
| +-- MCP Client (search tools) |
+-----------------------------------------------------------+
| CLI Layer (Bun) |
| +-- bun-runner.js (Node->Bun bridge) |
| +-- hook-command.ts (orchestrator) |
| +-- handlers/ (context, session-init, observation, |
| summarize, session-complete) |
+-----------------------------------------------------------+
| Worker Daemon (Express, port 37777) |
| +-- SessionManager (session lifecycle) |
| +-- SDKAgent (Claude Agent SDK) |
| +-- SearchManager (search orchestration) |
| +-- ProcessRegistry (subprocess management) |
| +-- ChromaSync (embedding synchronization) |
+-----------------------------------------------------------+
| Storage Layer |
| +-- SQLite (claude-mem.db) -- structured data |
| +-- ChromaDB (chroma.sqlite3) -- vector embeddings |
| +-- MCP Server (interface for Claude Code) |
+-----------------------------------------------------------+
```
## Hook Lifecycle
| Event | Handler | What it does | Timeout |
|-------|---------|-------------|---------|
| Setup | setup.sh | Install system dependencies | 300s |
| SessionStart | smart-install.js + context | Install deps + start worker + inject context | 60s |
| UserPromptSubmit | session-init | Register session + start SDK agent + semantic injection | 60s |
| PostToolUse | observation | Capture tool usage -> enqueue in worker | 120s |
| Stop | summarize + session-complete | Request summary + end session | 120s+30s |
## Data Flow
```
User prompt -> session-init -> /api/sessions/init + /api/context/semantic
|
Tool use -> observation -> /api/sessions/observations
| |
| PendingMessageStore.enqueue()
| |
| SDKAgent.startSession()
| |
| Claude Agent SDK -> ResponseProcessor
| |
| +-- storeObservations() -> SQLite
| +-- chromaSync.sync() -> ChromaDB
| +-- broadcastObservation() -> SSE/UI
|
Stop -> summarize -> /api/sessions/summarize
-> session-complete -> /api/sessions/complete + drain
```
## Key Patterns
### CLAIM-CONFIRM (PendingMessageStore)
```
enqueue() -> INSERT status='pending'
claimNextMessage() -> UPDATE status='processing' (atomic)
confirmProcessed() -> DELETE (success)
markFailed() -> UPDATE status='failed' (retry < 3)
Self-healing: messages in 'processing' for >60s reset to 'pending'
```
### Circuit-Breaker (SessionRoutes)
```
Generator crash -> retry 1 (1s) -> retry 2 (2s) -> retry 3 (4s)
-> consecutiveRestarts > 3 -> CIRCUIT-BREAKER
-> markAllSessionMessagesAbandoned(sessionDbId)
-> Stop. No infinite loop.
```
Counter resets to 0 when generator completes work naturally.
### Graceful Degradation (hook-command.ts)
```
Transport errors (ECONNREFUSED, timeout, 5xx) -> exit 0 (never block Claude Code)
Client bugs (4xx, TypeError, ReferenceError) -> exit 2 (blocking, needs fix)
```
The worker being unavailable NEVER blocks the user's Claude Code session.
### Deduplication (observations)
```
SHA256(memory_session_id + title + narrative) -> content_hash
If hash exists within 30s window -> return existing ID (no insert)
```
### Two Types of Session ID
- `contentSessionId` — from Claude Code, invariant during the session
- `memorySessionId` — from SDK Agent, changes on each worker restart
The conversion between them is handled by SessionStore and is critical for FK constraints.
## Storage
### SQLite (claude-mem.db)
| Table | Key fields | Purpose |
|-------|-----------|---------|
| sdk_sessions | content_session_id, memory_session_id, status | Session lifecycle |
| observations | memory_session_id, type, title, narrative, content_hash | Tool usage observations |
| session_summaries | memory_session_id, request, learned, completed | Session summaries |
| user_prompts | content_session_id, prompt_text | User prompt history |
| pending_messages | session_db_id, status, message_type | CLAIM-CONFIRM queue |
| observation_feedback | observation_id, signal_type | Usage tracking |
### ChromaDB (chroma.sqlite3)
Vector embeddings for semantic search. Each observation generates multiple documents:
```
obs_{id}_narrative -> main text
obs_{id}_fact_0 -> first fact
obs_{id}_fact_1 -> second fact
...
```
Accessed via chroma-mcp (MCP process), communication over stdio.
## Process Management
- **ProcessRegistry:** Tracks all Claude SDK subprocesses, manages PID lifecycle
- **Orphan Reaper (5min):** Kills processes with no active session
- **GracefulShutdown:** 7-step shutdown (PID file, children, HTTP server, sessions, MCP, DB, force-kill)

111
docs/production-guide.md Normal file
View File

@@ -0,0 +1,111 @@
# claude-mem Production Guide
Practical guide based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects.
## Recommended Settings
| Setting | Default | Recommended | Why |
|---------|---------|-------------|-----|
| CLAUDE_MEM_MAX_CONCURRENT_AGENTS | 2 | 3 | Better throughput without overload |
| CLAUDE_MEM_SEMANTIC_INJECT | (new) | true | Relevant context >> recent context |
| CLAUDE_MEM_SEMANTIC_INJECT_LIMIT | (new) | 5 | Sweet spot for token cost vs coverage |
| CLAUDE_MEM_TIER_ROUTING_ENABLED | (new) | true | ~52% cost savings, no quality loss |
## Health Monitoring
### Key metrics to watch
| Metric | Healthy | Warning | Action |
|--------|---------|---------|--------|
| pending_messages (pending) | 0-5 | >10 | Check worker logs, may need restart |
| pending_messages (failed) | 0 | >0 growing | Circuit-breaker may be tripping |
| sdk_sessions (active) | 0-3 | >5 stuck | Orphan sessions, worker restart |
| WAL size | <10 MB | >20 MB | Run `PRAGMA wal_checkpoint(TRUNCATE)` |
| Chroma size | Growing slowly | Sudden jump | Check for sync loops |
| Errors/day in logs | 0-2 | >10 | Investigate log patterns |
### Quick health check
```bash
# Check worker status
curl -s http://127.0.0.1:37777/api/health | python3 -m json.tool
# Check database stats
sqlite3 ~/.claude-mem/claude-mem.db "
SELECT 'observations' as metric, COUNT(*) as value FROM observations
UNION ALL SELECT 'summaries', COUNT(*) FROM session_summaries
UNION ALL SELECT 'pending', COUNT(*) FROM pending_messages WHERE status='pending'
UNION ALL SELECT 'active_sessions', COUNT(*) FROM sdk_sessions WHERE status='active';
"
```
## Multi-Machine Setup
If running claude-mem on multiple machines, use `claude-mem-sync` to keep observations in sync:
```bash
claude-mem-sync push <remote-host> # local -> remote
claude-mem-sync pull <remote-host> # remote -> local
claude-mem-sync sync <remote-host> # bidirectional
claude-mem-sync status <remote-host> # compare counts
```
Deduplication is by `(created_at, title)` — safe to run repeatedly.
## Growth Expectations
Based on active daily development usage:
| Metric | Per day | Per month | Notes |
|--------|---------|-----------|-------|
| Observations | ~120 | ~3,600 | Varies with coding activity |
| Summaries | ~40 | ~1,200 | One per session |
| SQLite | ~0.8 MB | ~24 MB | ~5 KB per observation |
| Chroma | ~4 MB | ~120 MB | ~50 KB per observation (embeddings) |
## Common Issues and Solutions
### Summarize error loop
**Symptom:** Repeated `[ERROR] Missing last_assistant_message` in logs.
**Cause:** Transcript with no assistant messages triggers summary attempt that fails repeatedly.
**Fix:** PR #1566 — skip summary when transcript is empty.
### Chroma sync failures
**Symptom:** `[ERROR] Batch add failed... IDs already exist`
**Cause:** MCP timeout during add leaves partial writes; retry fails on existing IDs.
**Fix:** PR #1566 — fallback to update (upsert pattern).
### Port conflict on startup
**Symptom:** `Worker failed to start... Is port 37777 in use?`
**Cause:** Two sessions starting simultaneously — HTTP check is non-atomic (TOCTOU race).
**Fix:** PR #1566 — atomic socket bind on Unix.
### Orphaned pending messages
**Symptom:** `pending_messages` table growing with old entries for completed sessions.
**Cause:** SIGTERM kills generator before queue is drained.
**Fix:** PR #1567 — drain after deleteSession().
### Context not relevant to current topic
**Symptom:** Claude receives observations about CSS when you're asking about authentication.
**Cause:** Default recency-based injection selects most recent, not most relevant.
**Fix:** PR #1568 — semantic injection via Chroma on every prompt.
## Log Analysis Tips
```bash
# Count errors by day
grep '\[ERROR\]' ~/.claude-mem/logs/claude-mem-*.log | \
sed 's/\[20[0-9][0-9]-[0-9][0-9]-/\n&/g' | \
grep -oP '^\[20\d{2}-\d{2}-\d{2}' | sort | uniq -c
# Find circuit-breaker trips
grep 'circuit\|Circuit\|ABANDONED\|abandoned' ~/.claude-mem/logs/claude-mem-*.log
# Check pending message health
grep 'CLAIMED\|CONFIRMED\|FAILED\|ABANDONED' ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log | tail -20
```