fix: handle Unix kill errors in orphaned process cleanup

Replace execAsync kill command with individual process.kill calls wrapped
in try/catch to gracefully handle processes that have already exited.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alex Newman
2025-12-26 20:51:41 -05:00
parent d6041ae4ec
commit 2a60794485
3 changed files with 287 additions and 52 deletions

View File

@@ -0,0 +1,229 @@
# PR #458 Fix Plan: Worker Lifecycle Critical Issues
**Branch:** `bugfix/spawn-worker`
**Created:** 2025-12-26
**Context:** PR review found 3 critical bugs, 5 important issues in the worker self-spawn implementation
---
## Phase 1: Fix Critical `waitForProcessesExit` Bug
**Goal:** Fix the logic bug that crashes shutdown when child processes exit
**File:** `src/services/worker-service.ts`
**What to do:**
1. Find the `waitForProcessesExit` method (around line 752-771)
2. The `pids.filter()` callback calls `process.kill(pid, 0)` which throws when the process is dead
3. Wrap in try/catch to return false when process doesn't exist
**Current code:**
```typescript
const stillAlive = pids.filter(pid => {
process.kill(pid, 0); // Signal 0 checks if process exists - throws if dead
return true;
});
```
**Fixed code:**
```typescript
const stillAlive = pids.filter(pid => {
try {
process.kill(pid, 0);
return true;
} catch {
return false;
}
});
```
**Verification:** The filter should now correctly remove PIDs of exited processes instead of throwing.
---
## Phase 2: Fix Spawn PID Validation
**Goal:** Prevent invalid PID file writes when spawn fails
**File:** `src/services/worker-service.ts`
**What to do:**
1. Find the `start` case in the CLI switch (around line 816-844)
2. After spawn(), add check for undefined pid before writing PID file
3. Exit with error if spawn failed
**Current code:**
```typescript
const child = spawn(process.execPath, [__filename, '--daemon'], {...});
child.unref();
writePidFile({ pid: child.pid!, port, startedAt: new Date().toISOString() });
```
**Fixed code:**
```typescript
const child = spawn(process.execPath, [__filename, '--daemon'], {...});
if (child.pid === undefined) {
console.error('Failed to spawn worker daemon');
process.exit(1);
}
child.unref();
writePidFile({ pid: child.pid, port, startedAt: new Date().toISOString() });
```
**Verification:** Remove the `!` non-null assertion since we now properly check.
---
## Phase 3: Fix Unix Process Cleanup Error Handling
**Goal:** Handle errors in orphaned process cleanup
**File:** `src/services/worker-service.ts`
**What to do:**
1. Find `cleanupOrphanedProcesses` method (around line 389-458)
2. The Unix branch at line 454 has `await execAsync(\`kill ${pids.join(' ')}\`)` with no error handling
3. Replace with individual process.kill calls wrapped in try/catch
**Current code:**
```typescript
} else {
await execAsync(`kill ${pids.join(' ')}`);
}
```
**Fixed code:**
```typescript
} else {
for (const pid of pids) {
try {
process.kill(pid, 'SIGKILL');
} catch {
// Process already exited - that's fine
}
}
}
```
---
## Phase 4: Fix Windows taskkill Error Handling
**Goal:** Prevent cleanup abort when one process fails to kill
**File:** `src/services/worker-service.ts`
**What to do:**
1. Find the Windows taskkill loop in `cleanupOrphanedProcesses` (around line 444-452)
2. Wrap `execSync` in try/catch so one failure doesn't abort the loop
**Current code:**
```typescript
for (const pid of pids) {
if (!Number.isInteger(pid) || pid <= 0) {
logger.warn('SYSTEM', 'Skipping invalid PID', { pid });
continue;
}
execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
}
```
**Fixed code:**
```typescript
for (const pid of pids) {
if (!Number.isInteger(pid) || pid <= 0) {
logger.warn('SYSTEM', 'Skipping invalid PID', { pid });
continue;
}
try {
execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
} catch {
// Process may have already exited - continue cleanup
}
}
```
---
## Phase 5: Use Readiness Endpoint for Health Checks
**Goal:** Ensure `waitForHealth` checks full initialization, not just HTTP server
**File:** `src/services/worker-service.ts`
**What to do:**
1. Find `waitForHealth` function (around line 61-68)
2. Change endpoint from `/api/health` to `/api/readiness`
3. `/api/readiness` returns 503 until background init completes, `/api/health` returns 200 immediately
**Current code:**
```typescript
async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
if (await isPortInUse(port)) return true;
await new Promise(r => setTimeout(r, 500));
}
return false;
}
```
**Fixed code:**
```typescript
async function waitForHealth(port: number, timeoutMs: number = 30000): Promise<boolean> {
const start = Date.now();
while (Date.now() - start < timeoutMs) {
try {
const response = await fetch(`http://127.0.0.1:${port}/api/readiness`, {
signal: AbortSignal.timeout(2000)
});
if (response.ok) return true;
} catch {
// Not ready yet
}
await new Promise(r => setTimeout(r, 500));
}
return false;
}
```
---
## Phase 6: Build, Test, and Commit
**Goal:** Verify fixes work and commit changes
**Commands:**
```bash
npm run build-and-sync
```
**Commit message:**
```
fix: address critical error handling issues in worker lifecycle
- Fix waitForProcessesExit crash when child processes exit
- Add spawn pid validation before writing PID file
- Handle Unix kill errors in orphaned process cleanup
- Handle Windows taskkill errors in cleanup loop
- Use /api/readiness for health checks instead of /api/health
Fixes issues found in PR #458 review.
```
---
## Summary
| Phase | Priority | Description |
|-------|----------|-------------|
| 1 | CRITICAL | Fix `waitForProcessesExit` filter bug |
| 2 | CRITICAL | Validate `child.pid` after spawn |
| 3 | CRITICAL | Fix Unix kill error handling |
| 4 | HIGH | Fix Windows taskkill error handling |
| 5 | HIGH | Use readiness endpoint for health |
| 6 | - | Build and commit |
**Estimated changes:** ~30 lines modified in `worker-service.ts`

File diff suppressed because one or more lines are too long

View File

@@ -451,7 +451,13 @@ export class WorkerService {
execSync(`taskkill /PID ${pid} /T /F`, { timeout: 60000, stdio: 'ignore' });
}
} else {
await execAsync(`kill ${pids.join(' ')}`);
for (const pid of pids) {
try {
process.kill(pid, 'SIGKILL');
} catch {
// Process already exited - that's fine
}
}
}
logger.info('SYSTEM', 'Orphaned processes cleaned up', { count: pids.length });