8.5 KiB
PAP-1231 Agent Browser Process Cleanup Plan
Status: Proposed
Date: 2026-04-08
Related issue: PAP-1231
Audience: Engineering
Goal
Explain why browser processes accumulate during local agent runs and define a cleanup plan that fixes the general process-ownership problem rather than treating agent-browser as a one-off.
Short answer
Yes, there is a likely root cause in Paperclip's local execution model.
Today, heartbeat-run local adapters persist and manage only the top-level spawned PID. Their timeout/cancel path uses direct child.kill() semantics. That is weaker than the runtime-service path, which already tracks and terminates whole process groups.
If Codex, Claude, Cursor, or a skill launched through them starts Chrome or Chromium helpers, Paperclip can lose ownership of those descendants even when it still believes it handled the run correctly.
Observed implementation facts
1. Heartbeat-run local adapters track only one PID
packages/adapter-utils/src/server-utils.ts
runChildProcess()spawns the adapter command and records onlychild.pid- timeout handling sends
SIGTERMand thenSIGKILLto the direct child - there is no process-group creation or process-group kill path there today
packages/db/src/schema/heartbeat_runs.ts
heartbeat_runsstoresprocess_pid- there is no persisted
process_group_id
server/src/services/heartbeat.ts
- cancellation logic uses the in-memory child handle and calls
child.kill() - orphaned-run recovery checks whether the recorded direct PID is alive
- the recovery model is built around one tracked process, not a descendant tree
2. Workspace runtime already uses stronger ownership
server/src/services/workspace-runtime.ts
- runtime services are spawned with
detached: process.platform !== "win32" - the service record stores
processGroupId - shutdown calls
terminateLocalService()with group-aware killing
server/src/services/local-service-supervisor.ts
terminateLocalService()prefersprocess.kill(-processGroupId, signal)on POSIX- it escalates from
SIGTERMtoSIGKILL
This is the clearest internal comparison point: Paperclip already has one local-process subsystem that treats process-group ownership as the right abstraction.
3. The current recovery path explains why leaks would be visible but hard to reason about
If the direct adapter process exits, hangs, or is cancelled after launching a browser subtree:
- Paperclip may think it cancelled the run because the parent process is gone
- descendant Chrome helpers may still be running
- orphan recovery has no persisted process-group identity to reconcile or reap later
That makes the failure look like an agent-browser problem when the more general bug is "executor descendants are not owned strongly enough."
Why agent-browser makes the problem obvious
Inference:
- Chromium is intentionally multi-process
- browser automation often leaves a browser process plus renderer, GPU, utility, and crashpad/helper children
- skills that open browsers repeatedly amplify the symptom because each run can produce several descendant processes
So agent-browser is probably not the root cause. It is the workload that exposes the weak ownership model fastest.
Success condition
This work is successful when Paperclip can:
- start a local adapter run and own the full descendant tree it created
- cancel, timeout, or recover that run without leaving Chrome descendants behind on POSIX
- detect and clean up stale local descendants after server restarts
- expose enough metadata that operators can see which run owns which spawned process tree
Non-goals
Do not:
- special-case
agent-browseronly - depend on manual
pkill chromecleanup as the primary fix - require every skill author to add bespoke browser teardown logic before Paperclip can clean up correctly
- change remote/http adapter behavior as part of the first pass
Proposed plan
Phase 0: reproduce and instrument
Objective:
- make the leak measurable from Paperclip's side before changing execution semantics
Work:
- add a reproducible local test script or fixture that launches a child process which itself launches descendants and ignores normal parent exit
- capture parent PID, descendant PIDs, and run ID in logs during local adapter execution
- document current behavior separately for:
- normal completion
- timeout
- explicit cancellation
- server restart during run
Deliverable:
- one short repro note attached to the implementation issue or child issue
Phase 1: give heartbeat-run local adapters process-group ownership
Objective:
- align adapter-run execution with the stronger runtime-service model
Work:
- update
runChildProcess()to create a dedicated process group on POSIX - persist both:
- direct PID
- process-group ID
- update the run cancellation and timeout paths to kill the group first, then escalate
- keep direct-PID fallback behavior for platforms where group kill is not available
Likely touched surfaces:
packages/adapter-utils/src/server-utils.tspackages/db/src/schema/heartbeat_runs.tspackages/shared/src/types/heartbeat.tsserver/src/services/heartbeat.ts
Important design choice:
- use the same ownership model for all local child-process adapters, not just Codex or Claude
Phase 2: make restart recovery group-aware
Objective:
- prevent stale descendants from surviving server crashes or restarts indefinitely
Work:
- teach orphan reconciliation to inspect the persisted process-group ID, not only the direct PID
- if the direct parent is gone but the group still exists, mark the run as detached-orphaned with clearer metadata
- decide whether restart recovery should:
- adopt the still-running group, or
- terminate it as unrecoverable
Recommendation:
- for heartbeat runs, prefer terminating unrecoverable orphan groups rather than adopting them unless we can prove the adapter session remains safe and observable
Reason:
- runtime services are long-lived and adoptable
- heartbeat runs are task executions with stricter audit and cancellation semantics
Phase 3: add operator-visible cleanup tools
Objective:
- make the system diagnosable when ownership still fails
Work:
- surface the tracked process metadata in run details or debug endpoints
- add a control-plane cleanup action or CLI utility for stale local run processes owned by Paperclip
- scope cleanup by run/agent/company instead of broad browser-name matching
This should replace ad hoc scripts as the general-purpose escape hatch.
Phase 4: cover platform and regression cases
Objective:
- keep the fix from regressing and define platform behavior explicitly
Tests to add:
- unit tests around process-group-aware cancellation in adapter execution utilities
- heartbeat recovery tests for:
- surviving descendant tree after parent loss
- timeout cleanup
- cancellation cleanup
- platform-conditional behavior notes for Windows, where negative-PID group kill does not apply
Recommended first implementation slice
The first shipping slice should be narrow:
- introduce process-group ownership for local heartbeat-run adapters on POSIX
- persist group metadata on
heartbeat_runs - switch timeout/cancel paths from direct-child kill to group kill
- add one regression test that proves descendants die with the parent run
That should address the main Chrome accumulation path without taking on the full restart-recovery design in the same patch.
Risks
1. Over-killing unrelated processes
If process-group boundaries are created incorrectly, cleanup could terminate more than the run owns.
Mitigation:
- create a fresh process group only for the spawned adapter command
- persist and target that exact group
2. Cross-platform differences
Windows does not support the POSIX negative-PID kill pattern used elsewhere in the repo.
Mitigation:
- ship POSIX-first
- keep direct-child fallback on Windows
- document Windows as partial until job-object or equivalent handling is designed
3. Session recovery complexity
Adopting a still-running orphaned group may look attractive but can break observability if stdout/stderr pipes are already gone.
Mitigation:
- default to deterministic cleanup for heartbeat runs unless adoption is explicitly proven safe
Recommendation
Treat this as a Paperclip executor ownership bug, not an agent-browser bug.
agent-browser should remain a useful repro case, but the implementation should be shared across all local child-process adapters so any descendant process tree spawned by Codex, Claude, Cursor, Gemini, Pi, or OpenCode is owned and cleaned up consistently.