docs: add browser process cleanup plan

2026-04-25 17:25:15 +02:00 · 2026-04-08 17:48:44 -05:00
parent 0d270655ab
commit b1e9215375
1 changed files with 238 additions and 0 deletions
--- a/doc/plans/2026-04-08-agent-browser-process-cleanup-plan.md
+++ b/doc/plans/2026-04-08-agent-browser-process-cleanup-plan.md
@@ -0,0 +1,238 @@
+# PAP-1231 Agent Browser Process Cleanup Plan
+
+Status: Proposed
+Date: 2026-04-08
+Related issue: `PAP-1231`
+Audience: Engineering
+
+## Goal
+
+Explain why browser processes accumulate during local agent runs and define a cleanup plan that fixes the general process-ownership problem rather than treating `agent-browser` as a one-off.
+
+## Short answer
+
+Yes, there is a likely root cause in Paperclip's local execution model.
+
+Today, heartbeat-run local adapters persist and manage only the top-level spawned PID. Their timeout/cancel path uses direct `child.kill()` semantics. That is weaker than the runtime-service path, which already tracks and terminates whole process groups.
+
+If Codex, Claude, Cursor, or a skill launched through them starts Chrome or Chromium helpers, Paperclip can lose ownership of those descendants even when it still believes it handled the run correctly.
+
+## Observed implementation facts
+
+### 1. Heartbeat-run local adapters track only one PID
+
+`packages/adapter-utils/src/server-utils.ts`
+
+- `runChildProcess()` spawns the adapter command and records only `child.pid`
+- timeout handling sends `SIGTERM` and then `SIGKILL` to the direct child
+- there is no process-group creation or process-group kill path there today
+
+`packages/db/src/schema/heartbeat_runs.ts`
+
+- `heartbeat_runs` stores `process_pid`
+- there is no persisted `process_group_id`
+
+`server/src/services/heartbeat.ts`
+
+- cancellation logic uses the in-memory child handle and calls `child.kill()`
+- orphaned-run recovery checks whether the recorded direct PID is alive
+- the recovery model is built around one tracked process, not a descendant tree
+
+### 2. Workspace runtime already uses stronger ownership
+
+`server/src/services/workspace-runtime.ts`
+
+- runtime services are spawned with `detached: process.platform !== "win32"`
+- the service record stores `processGroupId`
+- shutdown calls `terminateLocalService()` with group-aware killing
+
+`server/src/services/local-service-supervisor.ts`
+
+- `terminateLocalService()` prefers `process.kill(-processGroupId, signal)` on POSIX
+- it escalates from `SIGTERM` to `SIGKILL`
+
+This is the clearest internal comparison point: Paperclip already has one local-process subsystem that treats process-group ownership as the right abstraction.
+
+### 3. The current recovery path explains why leaks would be visible but hard to reason about
+
+If the direct adapter process exits, hangs, or is cancelled after launching a browser subtree:
+
+- Paperclip may think it cancelled the run because the parent process is gone
+- descendant Chrome helpers may still be running
+- orphan recovery has no persisted process-group identity to reconcile or reap later
+
+That makes the failure look like an `agent-browser` problem when the more general bug is "executor descendants are not owned strongly enough."
+
+## Why `agent-browser` makes the problem obvious
+
+Inference:
+
+- Chromium is intentionally multi-process
+- browser automation often leaves a browser process plus renderer, GPU, utility, and crashpad/helper children
+- skills that open browsers repeatedly amplify the symptom because each run can produce several descendant processes
+
+So `agent-browser` is probably not the root cause. It is the workload that exposes the weak ownership model fastest.
+
+## Success condition
+
+This work is successful when Paperclip can:
+
+1. start a local adapter run and own the full descendant tree it created
+2. cancel, timeout, or recover that run without leaving Chrome descendants behind on POSIX
+3. detect and clean up stale local descendants after server restarts
+4. expose enough metadata that operators can see which run owns which spawned process tree
+
+## Non-goals
+
+Do not:
+
+- special-case `agent-browser` only
+- depend on manual `pkill chrome` cleanup as the primary fix
+- require every skill author to add bespoke browser teardown logic before Paperclip can clean up correctly
+- change remote/http adapter behavior as part of the first pass
+
+## Proposed plan
+
+### Phase 0: reproduce and instrument
+
+Objective:
+
+- make the leak measurable from Paperclip's side before changing execution semantics
+
+Work:
+
+- add a reproducible local test script or fixture that launches a child process which itself launches descendants and ignores normal parent exit
+- capture parent PID, descendant PIDs, and run ID in logs during local adapter execution
+- document current behavior separately for:
+  - normal completion
+  - timeout
+  - explicit cancellation
+  - server restart during run
+
+Deliverable:
+
+- one short repro note attached to the implementation issue or child issue
+
+### Phase 1: give heartbeat-run local adapters process-group ownership
+
+Objective:
+
+- align adapter-run execution with the stronger runtime-service model
+
+Work:
+
+- update `runChildProcess()` to create a dedicated process group on POSIX
+- persist both:
+  - direct PID
+  - process-group ID
+- update the run cancellation and timeout paths to kill the group first, then escalate
+- keep direct-PID fallback behavior for platforms where group kill is not available
+
+Likely touched surfaces:
+
+- `packages/adapter-utils/src/server-utils.ts`
+- `packages/db/src/schema/heartbeat_runs.ts`
+- `packages/shared/src/types/heartbeat.ts`
+- `server/src/services/heartbeat.ts`
+
+Important design choice:
+
+- use the same ownership model for all local child-process adapters, not just Codex or Claude
+
+### Phase 2: make restart recovery group-aware
+
+Objective:
+
+- prevent stale descendants from surviving server crashes or restarts indefinitely
+
+Work:
+
+- teach orphan reconciliation to inspect the persisted process-group ID, not only the direct PID
+- if the direct parent is gone but the group still exists, mark the run as detached-orphaned with clearer metadata
+- decide whether restart recovery should:
+  - adopt the still-running group, or
+  - terminate it as unrecoverable
+
+Recommendation:
+
+- for heartbeat runs, prefer terminating unrecoverable orphan groups rather than adopting them unless we can prove the adapter session remains safe and observable
+
+Reason:
+
+- runtime services are long-lived and adoptable
+- heartbeat runs are task executions with stricter audit and cancellation semantics
+
+### Phase 3: add operator-visible cleanup tools
+
+Objective:
+
+- make the system diagnosable when ownership still fails
+
+Work:
+
+- surface the tracked process metadata in run details or debug endpoints
+- add a control-plane cleanup action or CLI utility for stale local run processes owned by Paperclip
+- scope cleanup by run/agent/company instead of broad browser-name matching
+
+This should replace ad hoc scripts as the general-purpose escape hatch.
+
+### Phase 4: cover platform and regression cases
+
+Objective:
+
+- keep the fix from regressing and define platform behavior explicitly
+
+Tests to add:
+
+- unit tests around process-group-aware cancellation in adapter execution utilities
+- heartbeat recovery tests for:
+  - surviving descendant tree after parent loss
+  - timeout cleanup
+  - cancellation cleanup
+- platform-conditional behavior notes for Windows, where negative-PID group kill does not apply
+
+## Recommended first implementation slice
+
+The first shipping slice should be narrow:
+
+1. introduce process-group ownership for local heartbeat-run adapters on POSIX
+2. persist group metadata on `heartbeat_runs`
+3. switch timeout/cancel paths from direct-child kill to group kill
+4. add one regression test that proves descendants die with the parent run
+
+That should address the main Chrome accumulation path without taking on the full restart-recovery design in the same patch.
+
+## Risks
+
+### 1. Over-killing unrelated processes
+
+If process-group boundaries are created incorrectly, cleanup could terminate more than the run owns.
+
+Mitigation:
+
+- create a fresh process group only for the spawned adapter command
+- persist and target that exact group
+
+### 2. Cross-platform differences
+
+Windows does not support the POSIX negative-PID kill pattern used elsewhere in the repo.
+
+Mitigation:
+
+- ship POSIX-first
+- keep direct-child fallback on Windows
+- document Windows as partial until job-object or equivalent handling is designed
+
+### 3. Session recovery complexity
+
+Adopting a still-running orphaned group may look attractive but can break observability if stdout/stderr pipes are already gone.
+
+Mitigation:
+
+- default to deterministic cleanup for heartbeat runs unless adoption is explicitly proven safe
+
+## Recommendation
+
+Treat this as a Paperclip executor ownership bug, not an `agent-browser` bug.
+
+`agent-browser` should remain a useful repro case, but the implementation should be shared across all local child-process adapters so any descendant process tree spawned by Codex, Claude, Cursor, Gemini, Pi, or OpenCode is owned and cleaned up consistently.