docs: add architecture overview and production guide

Architecture overview covers the 4-layer system design, hook lifecycle, data flow, and key patterns (CLAIM-CONFIRM, circuit-breaker, graceful degradation, deduplication, dual session IDs). Production guide provides recommended settings, health monitoring metrics and thresholds, quick health check commands, multi-machine sync setup, growth expectations, common issues with solutions, and log analysis tips. Based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 17:15:04 +02:00 · 2026-04-01 22:05:06 -03:00
parent 76207fb8d6
commit bf8b7dbd9f
2 changed files with 250 additions and 0 deletions
--- a/docs/architecture-overview.md
+++ b/docs/architecture-overview.md
@@ -0,0 +1,139 @@
+# claude-mem Architecture Overview
+
+## System Layers
+
+```
+-----------------------------------------------------------+
+|  Claude Code (host)                                       |
+|  +-- Hook System (5 events)                               |
+|  +-- MCP Client (search tools)                            |
+-----------------------------------------------------------+
+|  CLI Layer (Bun)                                          |
+|  +-- bun-runner.js (Node->Bun bridge)                     |
+|  +-- hook-command.ts (orchestrator)                        |
+|  +-- handlers/ (context, session-init, observation,        |
+|                 summarize, session-complete)               |
+-----------------------------------------------------------+
+|  Worker Daemon (Express, port 37777)                      |
+|  +-- SessionManager (session lifecycle)                   |
+|  +-- SDKAgent (Claude Agent SDK)                          |
+|  +-- SearchManager (search orchestration)                 |
+|  +-- ProcessRegistry (subprocess management)              |
+|  +-- ChromaSync (embedding synchronization)               |
+-----------------------------------------------------------+
+|  Storage Layer                                            |
+|  +-- SQLite (claude-mem.db) -- structured data            |
+|  +-- ChromaDB (chroma.sqlite3) -- vector embeddings       |
+|  +-- MCP Server (interface for Claude Code)               |
+-----------------------------------------------------------+
+```
+
+## Hook Lifecycle
+
+| Event | Handler | What it does | Timeout |
+|-------|---------|-------------|---------|
+| Setup | setup.sh | Install system dependencies | 300s |
+| SessionStart | smart-install.js + context | Install deps + start worker + inject context | 60s |
+| UserPromptSubmit | session-init | Register session + start SDK agent + semantic injection | 60s |
+| PostToolUse | observation | Capture tool usage -> enqueue in worker | 120s |
+| Stop | summarize + session-complete | Request summary + end session | 120s+30s |
+
+## Data Flow
+
+```
+User prompt -> session-init -> /api/sessions/init + /api/context/semantic
+  |
+Tool use -> observation -> /api/sessions/observations
+  |                              |
+  |                    PendingMessageStore.enqueue()
+  |                              |
+  |                    SDKAgent.startSession()
+  |                              |
+  |                    Claude Agent SDK -> ResponseProcessor
+  |                              |
+  |                    +-- storeObservations() -> SQLite
+  |                    +-- chromaSync.sync() -> ChromaDB
+  |                    +-- broadcastObservation() -> SSE/UI
+  |
+Stop -> summarize -> /api/sessions/summarize
+     -> session-complete -> /api/sessions/complete + drain
+```
+
+## Key Patterns
+
+### CLAIM-CONFIRM (PendingMessageStore)
+
+```
+enqueue()           -> INSERT status='pending'
+claimNextMessage()  -> UPDATE status='processing' (atomic)
+confirmProcessed()  -> DELETE (success)
+markFailed()        -> UPDATE status='failed' (retry < 3)
+
+Self-healing: messages in 'processing' for >60s reset to 'pending'
+```
+
+### Circuit-Breaker (SessionRoutes)
+
+```
+Generator crash -> retry 1 (1s) -> retry 2 (2s) -> retry 3 (4s)
+  -> consecutiveRestarts > 3 -> CIRCUIT-BREAKER
+  -> markAllSessionMessagesAbandoned(sessionDbId)
+  -> Stop. No infinite loop.
+```
+
+Counter resets to 0 when generator completes work naturally.
+
+### Graceful Degradation (hook-command.ts)
+
+```
+Transport errors (ECONNREFUSED, timeout, 5xx) -> exit 0 (never block Claude Code)
+Client bugs (4xx, TypeError, ReferenceError)  -> exit 2 (blocking, needs fix)
+```
+
+The worker being unavailable NEVER blocks the user's Claude Code session.
+
+### Deduplication (observations)
+
+```
+SHA256(memory_session_id + title + narrative) -> content_hash
+If hash exists within 30s window -> return existing ID (no insert)
+```
+
+### Two Types of Session ID
+
+- `contentSessionId` — from Claude Code, invariant during the session
+- `memorySessionId` — from SDK Agent, changes on each worker restart
+
+The conversion between them is handled by SessionStore and is critical for FK constraints.
+
+## Storage
+
+### SQLite (claude-mem.db)
+
+| Table | Key fields | Purpose |
+|-------|-----------|---------|
+| sdk_sessions | content_session_id, memory_session_id, status | Session lifecycle |
+| observations | memory_session_id, type, title, narrative, content_hash | Tool usage observations |
+| session_summaries | memory_session_id, request, learned, completed | Session summaries |
+| user_prompts | content_session_id, prompt_text | User prompt history |
+| pending_messages | session_db_id, status, message_type | CLAIM-CONFIRM queue |
+| observation_feedback | observation_id, signal_type | Usage tracking |
+
+### ChromaDB (chroma.sqlite3)
+
+Vector embeddings for semantic search. Each observation generates multiple documents:
+
+```
+obs_{id}_narrative  -> main text
+obs_{id}_fact_0     -> first fact
+obs_{id}_fact_1     -> second fact
+...
+```
+
+Accessed via chroma-mcp (MCP process), communication over stdio.
+
+## Process Management
+
+- **ProcessRegistry:** Tracks all Claude SDK subprocesses, manages PID lifecycle
+- **Orphan Reaper (5min):** Kills processes with no active session
+- **GracefulShutdown:** 7-step shutdown (PID file, children, HTTP server, sessions, MCP, DB, force-kill)
--- a/docs/production-guide.md
+++ b/docs/production-guide.md
@@ -0,0 +1,111 @@
+# claude-mem Production Guide
+
+Practical guide based on 23 days of production usage with 3,400+ observations across two physical servers and 8 projects.
+
+## Recommended Settings
+
+| Setting | Default | Recommended | Why |
+|---------|---------|-------------|-----|
+| CLAUDE_MEM_MAX_CONCURRENT_AGENTS | 2 | 3 | Better throughput without overload |
+| CLAUDE_MEM_SEMANTIC_INJECT | (new) | true | Relevant context >> recent context |
+| CLAUDE_MEM_SEMANTIC_INJECT_LIMIT | (new) | 5 | Sweet spot for token cost vs coverage |
+| CLAUDE_MEM_TIER_ROUTING_ENABLED | (new) | true | ~52% cost savings, no quality loss |
+
+## Health Monitoring
+
+### Key metrics to watch
+
+| Metric | Healthy | Warning | Action |
+|--------|---------|---------|--------|
+| pending_messages (pending) | 0-5 | >10 | Check worker logs, may need restart |
+| pending_messages (failed) | 0 | >0 growing | Circuit-breaker may be tripping |
+| sdk_sessions (active) | 0-3 | >5 stuck | Orphan sessions, worker restart |
+| WAL size | <10 MB | >20 MB | Run `PRAGMA wal_checkpoint(TRUNCATE)` |
+| Chroma size | Growing slowly | Sudden jump | Check for sync loops |
+| Errors/day in logs | 0-2 | >10 | Investigate log patterns |
+
+### Quick health check
+
+```bash
+# Check worker status
+curl -s http://127.0.0.1:37777/api/health | python3 -m json.tool
+
+# Check database stats
+sqlite3 ~/.claude-mem/claude-mem.db "
+  SELECT 'observations' as metric, COUNT(*) as value FROM observations
+  UNION ALL SELECT 'summaries', COUNT(*) FROM session_summaries
+  UNION ALL SELECT 'pending', COUNT(*) FROM pending_messages WHERE status='pending'
+  UNION ALL SELECT 'active_sessions', COUNT(*) FROM sdk_sessions WHERE status='active';
+"
+```
+
+## Multi-Machine Setup
+
+If running claude-mem on multiple machines, use `claude-mem-sync` to keep observations in sync:
+
+```bash
+claude-mem-sync push <remote-host>    # local -> remote
+claude-mem-sync pull <remote-host>    # remote -> local
+claude-mem-sync sync <remote-host>    # bidirectional
+claude-mem-sync status <remote-host>  # compare counts
+```
+
+Deduplication is by `(created_at, title)` — safe to run repeatedly.
+
+## Growth Expectations
+
+Based on active daily development usage:
+
+| Metric | Per day | Per month | Notes |
+|--------|---------|-----------|-------|
+| Observations | ~120 | ~3,600 | Varies with coding activity |
+| Summaries | ~40 | ~1,200 | One per session |
+| SQLite | ~0.8 MB | ~24 MB | ~5 KB per observation |
+| Chroma | ~4 MB | ~120 MB | ~50 KB per observation (embeddings) |
+
+## Common Issues and Solutions
+
+### Summarize error loop
+
+**Symptom:** Repeated `[ERROR] Missing last_assistant_message` in logs.
+**Cause:** Transcript with no assistant messages triggers summary attempt that fails repeatedly.
+**Fix:** PR #1566 — skip summary when transcript is empty.
+
+### Chroma sync failures
+
+**Symptom:** `[ERROR] Batch add failed... IDs already exist`
+**Cause:** MCP timeout during add leaves partial writes; retry fails on existing IDs.
+**Fix:** PR #1566 — fallback to update (upsert pattern).
+
+### Port conflict on startup
+
+**Symptom:** `Worker failed to start... Is port 37777 in use?`
+**Cause:** Two sessions starting simultaneously — HTTP check is non-atomic (TOCTOU race).
+**Fix:** PR #1566 — atomic socket bind on Unix.
+
+### Orphaned pending messages
+
+**Symptom:** `pending_messages` table growing with old entries for completed sessions.
+**Cause:** SIGTERM kills generator before queue is drained.
+**Fix:** PR #1567 — drain after deleteSession().
+
+### Context not relevant to current topic
+
+**Symptom:** Claude receives observations about CSS when you're asking about authentication.
+**Cause:** Default recency-based injection selects most recent, not most relevant.
+**Fix:** PR #1568 — semantic injection via Chroma on every prompt.
+
+## Log Analysis Tips
+
+```bash
+# Count errors by day
+grep '\[ERROR\]' ~/.claude-mem/logs/claude-mem-*.log | \
+  sed 's/\[20[0-9][0-9]-[0-9][0-9]-/\n&/g' | \
+  grep -oP '^\[20\d{2}-\d{2}-\d{2}' | sort | uniq -c
+
+# Find circuit-breaker trips
+grep 'circuit\|Circuit\|ABANDONED\|abandoned' ~/.claude-mem/logs/claude-mem-*.log
+
+# Check pending message health
+grep 'CLAIMED\|CONFIRMED\|FAILED\|ABANDONED' ~/.claude-mem/logs/claude-mem-$(date +%Y-%m-%d).log | tail -20
+```