Files
claude-mem/docs/context/SMART_INSTALL_ANALYSIS.md
Alex Newman c3761a2204 Refactor silent debugging to happy path error handling
- Replaced instances of silentDebug with happy_path_error__with_fallback across multiple files to improve error logging and handling.
- Updated the utility function to provide clearer semantics for error handling when expected values are missing.
- Introduced a script to find potential silent failures in the codebase that may need to be addressed with the new error handling approach.
2025-12-09 15:09:44 -05:00

17 KiB
Raw Blame History

Claude-Mem Smart Install & Plugin Hooks - Comprehensive Analysis

Generated: 2025-12-09 Scope: Smart install system, all plugin hooks, cross-platform compatibility, error handling, edge cases


Executive Summary

This report provides a comprehensive analysis of claude-mem's smart install system and plugin hook infrastructure. The analysis focuses on cross-platform compatibility, error handling patterns, artificial blockers, and edge case handling.

Key Findings:

  • Overall architecture is well-designed with clear separation of concerns
  • ⚠️ Multiple cross-platform compatibility issues identified
  • ⚠️ Several silent failure patterns that hinder debugging
  • ⚠️ Artificial blockers that could prevent legitimate use cases
  • ⚠️ Inconsistent timeout values across different components
  • No nested try-catch anti-patterns found

Architecture Overview

Smart Install System Flow

User Invokes Hook
    ↓
ensureWorkerRunning() [worker-utils.ts]
    ↓
isWorkerHealthy() → fetch /health endpoint
    ↓
    ├─ [HEALTHY] → Continue
    └─ [UNHEALTHY] → startWorker()
        ↓
        ├─ [Windows] → PowerShell Start-Process (hidden window)
        └─ [Unix] → PM2 start ecosystem.config.cjs
            ↓
        Wait for health check (15 retries × 1000ms)
            ↓
            ├─ [SUCCESS] → Continue
            └─ [FAILURE] → Throw error with manual recovery instructions

Plugin Hook Lifecycle

  1. SessionStart (context-hook.ts + user-message-hook.ts)

    • context-hook: Fetches context via HTTP/curl
    • user-message-hook: Displays context to user via stderr
  2. UserPromptSubmit (new-hook.ts)

    • Creates/retrieves SDK session
    • Strips privacy tags from prompt
    • Initializes session via HTTP
  3. PostToolUse (save-hook.ts)

    • Filters skipped tools
    • Sends observation to worker via HTTP
  4. Stop (summary-hook.ts)

    • Parses transcript JSONL
    • Extracts last user/assistant messages
    • Requests summary generation via HTTP
  5. SessionEnd (cleanup-hook.ts)

    • Marks session complete
    • Fire-and-forget HTTP request

Cross-Platform Compatibility Issues

🔴 CRITICAL: curl Dependency (context-hook.ts)

Location: src/hooks/context-hook.ts:32

const result = execSync(`curl -s "${url}"`, { encoding: "utf-8", timeout: 5000 });

Issues:

  1. Windows Compatibility: curl is not guaranteed to be available on Windows systems (though included in Windows 10 1803+, it may be missing on older systems or custom installations)
  2. Error Handling: No try-catch around execSync - will throw unhandled exception if curl fails
  3. Redundancy: Uses curl when JavaScript's native fetch is already used everywhere else in the codebase

Impact: High - SessionStart hook will crash if curl is unavailable or returns non-zero exit code

Edge Cases:

  • Corporate proxies blocking curl
  • Systems without curl in PATH
  • curl returning non-zero exit with valid output (warnings, etc.)

Recommendation:

// Replace curl with fetch (already used in user-message-hook.ts)
const response = await fetch(url, { signal: AbortSignal.timeout(5000) });
const result = await response.text();

🟡 MEDIUM: Platform-Specific Process Spawning (worker-utils.ts)

Location: src/shared/worker-utils.ts:55-93

Windows Implementation:

spawnSync('powershell.exe', [
  '-NoProfile',
  '-NonInteractive',
  '-Command',
  `Start-Process -FilePath 'node' -ArgumentList '${workerScript}' -WorkingDirectory '${MARKETPLACE_ROOT}' -WindowStyle Hidden`
])

Issues:

  1. PowerShell Dependency: Assumes PowerShell is available and in PATH
  2. Command Injection Risk: Worker script path inserted directly into command string without escaping
  3. Process Monitoring: Windows approach launches detached process with no PM2 monitoring - harder to debug/restart
  4. Health Check Timeout: Comment says "Windows needs longer timeouts" but timeout is same for all platforms (500ms)

Edge Cases:

  • Windows systems with PowerShell execution policy restrictions
  • Paths containing single quotes or special characters
  • Windows subsystem for Linux (WSL) environments
  • Wine/Proton compatibility layers

Unix Implementation:

const localPm2Base = path.join(MARKETPLACE_ROOT, 'node_modules', '.bin', 'pm2');
const pm2Command = existsSync(localPm2Base) ? localPm2Base : 'pm2';

Issues:

  1. PM2 Dependency: Falls back to global pm2 if local not found, but doesn't verify it exists
  2. Silent Failure: If PM2 not installed globally, spawnSync will fail with cryptic ENOENT error

Recommendation:

  • Add pm2 existence check before spawn
  • Implement consistent process monitoring across platforms
  • Add path escaping for Windows command construction
  • Actually implement longer timeout for Windows if needed

🟡 MEDIUM: Git Dependency (paths.ts)

Location: src/shared/paths.ts:89-97

export function getCurrentProjectName(): string {
  try {
    const gitRoot = execSync('git rev-parse --show-toplevel', {
      cwd: process.cwd(),
      encoding: 'utf8',
      stdio: ['pipe', 'pipe', 'ignore']
    }).trim();
    return basename(gitRoot);
  } catch {
    return basename(process.cwd());
  }
}

Issues:

  1. Git Assumption: Assumes git is installed and available in PATH
  2. Non-Git Projects: Silently falls back to cwd basename, but this behavior is undocumented

Edge Cases:

  • Projects not using git
  • Monorepos where cwd !== git root is desired
  • Systems without git installed

Status: Already handled with fallback, but could benefit from debug logging


Error Handling Analysis

🔴 CRITICAL: Silent Failures Without Logging

1. Settings File Loading (early-settings.ts:20-28)

try {
  if (existsSync(SETTINGS_PATH)) {
    const data = JSON.parse(readFileSync(SETTINGS_PATH, 'utf-8'));
    const fileValue = data.env?.[key];
    if (fileValue !== undefined) return fileValue;
  }
} catch {
  // Fail silently - fall through to env var
}

Problem:

  • Invalid JSON in settings file fails silently
  • File read permission errors fail silently
  • Users have no way to know their settings file is being ignored

Impact: High - Users may think settings are applied when they're actually using defaults

Recommendation:

} catch (error) {
  logger.warn('SETTINGS', 'Failed to load settings file', { path: SETTINGS_PATH }, error);
}

2. Worker Startup Failure (worker-utils.ts:104-107)

try {
  // ... worker startup logic ...
} catch (error) {
  // Failed to start worker
  return false;
}

Problem:

  • Catches ALL errors during worker startup
  • Returns boolean with no information about what failed
  • User only gets generic error after all retries exhausted

Impact: High - Makes debugging worker startup issues extremely difficult

Recommendation:

} catch (error) {
  logger.error('WORKER', 'Failed to start worker', {}, error as Error);
  return false;
}

3. Worker Health Check (worker-utils.ts:30-40)

async function isWorkerHealthy(): Promise<boolean> {
  try {
    const port = getWorkerPort();
    const response = await fetch(`http://127.0.0.1:${port}/health`, {
      signal: AbortSignal.timeout(HEALTH_CHECK_TIMEOUT_MS)
    });
    return response.ok;
  } catch {
    return false;
  }
}

Problem:

  • Network errors, timeouts, and non-200 responses all indistinguishable
  • No logging at all - completely silent

Impact: Medium - Hard to debug why health checks fail

Recommendation:

} catch (error) {
  logger.debug('WORKER', 'Health check failed', { port }, error);
  return false;
}

4. Tool Formatting (logger.ts:122-124)

try {
  const input = typeof toolInput === 'string' ? JSON.parse(toolInput) : toolInput;
  // ...
} catch {
  return toolName;
}

Problem:

  • Invalid JSON in tool input fails silently
  • Could mask data corruption issues

Impact: Low - Only affects log formatting

Status: Acceptable for log formatting, but could log at DEBUG level


🟢 GOOD: No Nested Try-Catch Anti-Patterns

Analysis confirmed zero instances of nested try-catch blocks. Error handling is consistently at single level per function.


Artificial Blockers & Unnecessary Checks

🔴 CRITICAL: First-Run Detection (user-message-hook.ts:14-40)

const nodeModulesPath = join(pluginDir, 'node_modules');

if (!existsSync(nodeModulesPath)) {
  // Show first-time setup message
  console.error(`...`);
  process.exit(3);
}

Problems:

  1. False Positive: Will trigger if user manually deletes node_modules (e.g., for troubleshooting)
  2. Installation Race: Could fail if installation is still in progress
  3. Hook-Level Check: Runs on EVERY SessionStart, not just actual first run

Impact: High - Prevents usage until node_modules exists, even if dependencies are installed elsewhere

Edge Cases:

  • User runs rm -rf node_modules for troubleshooting
  • Package manager installation interrupted
  • Symlinked node_modules (some package managers)

Recommendation:

  • Use a .first-run-complete marker file instead
  • Move check to npm postinstall script
  • Make check more robust (check for specific required modules)

🟡 MEDIUM: Overly Specific Validation (paths.ts:117-119)

if (!existsSync(join(commandsDir, 'save.md'))) {
  throw new Error('Package commands directory missing required files');
}

Problem:

  • Checks for ONE specific file to validate entire directory
  • Hardcoded filename could break if files reorganized
  • Error message doesn't specify what's missing

Impact: Medium - Could prevent package from working after internal refactoring

Recommendation:

  • Remove check entirely (let actual command invocation fail with better error)
  • Or check all required files if validation is critical

🟡 MEDIUM: Duplicate Health Endpoints

Locations:

  • src/services/worker-service.ts:107 - /api/health
  • src/services/worker/http/routes/ViewerRoutes.ts:27 - /health

Usage:

  • worker-utils.ts uses /health
  • mcp-server.ts uses /api/health

Problem:

  • Redundant endpoints doing the same thing
  • Inconsistent usage across codebase
  • Maintenance burden

Impact: Low - Both work, but creates confusion

Recommendation:

  • Standardize on /api/health (follows REST convention)
  • Remove /health endpoint
  • Update worker-utils.ts to use /api/health

Timeout Configuration Issues

Inconsistent Timeouts Across Components

Component Timeout Location Purpose
Health check 500ms worker-utils.ts:13 Check if worker alive
Worker startup wait 1000ms worker-utils.ts:14 Wait between health checks
Worker startup retries 15x worker-utils.ts:15 Max retries (15s total)
Hook HTTP requests 2000ms cleanup-hook.ts:61, save-hook.ts:70, summary-hook.ts:164 Send data to worker
New hook session init 5000ms new-hook.ts:129 Initialize session
Context hook fetch 5000ms context-hook.ts:32 Fetch context via curl
User message hook 5000ms user-message-hook.ts:52 Fetch context display

Problems:

  1. Health Check Too Aggressive: 500ms may be too short for loaded systems or slow network
  2. No Platform Adjustment: Comment says "Windows needs longer timeouts" but values are same
  3. Hook Timeout Variation: Some hooks use 2s, others use 5s with no clear reasoning

Recommendations:

  • Increase health check timeout to 1000ms minimum
  • Actually implement longer timeouts for Windows
  • Standardize hook timeouts to 5000ms across the board
  • Make timeouts configurable via settings

Edge Case Analysis

Handled Well

  1. JSONL Parsing: summary-hook.ts continues on malformed lines (60-64, 117-121)
  2. Git Not Available: paths.ts falls back to cwd basename (89-97)
  3. Settings File Missing: early-settings.ts falls back to env vars and defaults (20-28)
  4. Privacy Tags: new-hook.ts handles fully-private prompts (99-109)
  5. Tool Skipping: save-hook.ts filters low-value tools (24-30)

Missing Edge Case Handling ⚠️

  1. curl Failure: context-hook.ts has no error handling for curl failures
  2. PM2 Not Installed: worker-utils.ts assumes pm2 exists globally
  3. PowerShell Restrictions: worker-utils.ts doesn't check execution policy
  4. Concurrent Worker Starts: No locking to prevent multiple hooks from starting worker simultaneously
  5. Port Already In Use: No detection or recovery if worker port is taken
  6. Zombie Processes: Windows approach doesn't track PIDs, can't detect/kill zombies

Recommendations Summary

High Priority 🔴

  1. Replace curl with fetch in context-hook.ts

    • Eliminates external dependency
    • Consistent with rest of codebase
    • Better error handling
  2. Add logging to silent failures

    • early-settings.ts: Log when settings file fails to load
    • worker-utils.ts: Log startup failures with details
    • worker-utils.ts: Log health check failures at debug level
  3. Fix first-run detection

    • Use marker file instead of node_modules check
    • More reliable and intentional

Medium Priority 🟡

  1. Verify PM2 availability before attempting to use it

    • Check existence before spawn
    • Provide clear error message if missing
  2. Implement platform-specific timeouts

    • Actually use longer timeouts on Windows as comment suggests
    • Make timeouts configurable
  3. Standardize health endpoints

    • Remove duplicate /health endpoint
    • Use /api/health everywhere
  4. Add path escaping for Windows PowerShell commands

    • Prevent injection issues
    • Handle paths with special characters

Low Priority 🟢

  1. Standardize HTTP timeouts across all hooks
  2. Add concurrent startup protection (locking mechanism)
  3. Improve error messages with actionable recovery steps

Testing Recommendations

Cross-Platform Testing Needed

  1. Windows Environments:

    • Windows 10 (various versions)
    • Windows 11
    • Windows Server
    • WSL/WSL2
    • PowerShell execution policies (Restricted, RemoteSigned, Unrestricted)
  2. Unix Environments:

    • macOS (Intel + Apple Silicon)
    • Linux (Ubuntu, Fedora, Arch)
    • FreeBSD
  3. Edge Environments:

    • Docker containers
    • CI/CD environments
    • Systems without git installed
    • Systems without curl (or with restricted curl)
    • Corporate networks with proxies
    • Low-spec systems (slow startup)

Test Scenarios

  1. Cold Start: First run with no existing data
  2. Corrupt Settings: Invalid JSON in settings.json
  3. Missing Dependencies: No PM2, no git, no curl
  4. Port Conflicts: Worker port already in use
  5. Rapid Hook Invocations: Multiple hooks trying to start worker simultaneously
  6. Permission Issues: Read-only filesystem, restricted execution
  7. Network Issues: Localhost blocked, slow network

Code Quality Assessment

Strengths

  • Clean separation of concerns (hooks → worker → database)
  • No nested try-catch anti-patterns
  • Consistent use of modern async/await
  • Good use of TypeScript for type safety
  • Idempotent database operations
  • Clear documentation in critical sections

Weaknesses ⚠️

  • Silent failures hinder debugging
  • Inconsistent error handling patterns
  • Platform-specific code not fully tested/documented
  • Timeout configuration hardcoded and inconsistent
  • Some artificial blockers prevent legitimate use cases

Technical Debt

  • Duplicate health endpoints
  • curl dependency when fetch available
  • PM2 dependency on Unix but not Windows (inconsistent monitoring)
  • First-run detection using node_modules existence
  • Hardcoded timeout values

Conclusion

The claude-mem smart install and plugin hook system is architecturally sound with a well-designed separation of concerns. However, several cross-platform compatibility issues and silent failure patterns could cause problems in production, particularly on Windows systems or in edge case scenarios.

The highest priority improvements are:

  1. Removing the curl dependency
  2. Adding proper logging to silent failures
  3. Fixing the fragile first-run detection
  4. Verifying external dependencies before use

These changes would significantly improve debuggability and cross-platform reliability without requiring major architectural changes.


Analysis Methodology:

  • Systematic review of all TypeScript source files
  • Static analysis of error handling patterns
  • Cross-platform compatibility assessment
  • Edge case identification through code path analysis
  • Comparison against best practices and KISS principles

Files Analyzed:

  • src/hooks/*.ts (6 files)
  • src/services/worker-service.ts
  • src/services/worker/*.ts (10+ files)
  • src/servers/mcp-server.ts
  • src/shared/*.ts (worker-utils, early-settings, paths)
  • src/utils/*.ts (logger, silent-debug, tag-stripping)