eliott/paperclip

Fork 0

mirror of https://github.com/paperclipai/paperclip synced 2026-04-25 17:25:15 +02:00

Files

dotta b1e9215375 docs: add browser process cleanup plan

2026-04-09 06:14:12 -05:00

8.5 KiB

Raw Permalink Blame History

PAP-1231 Agent Browser Process Cleanup Plan

Status: Proposed Date: 2026-04-08 Related issue: PAP-1231 Audience: Engineering

Goal

Explain why browser processes accumulate during local agent runs and define a cleanup plan that fixes the general process-ownership problem rather than treating agent-browser as a one-off.

Short answer

Yes, there is a likely root cause in Paperclip's local execution model.

Today, heartbeat-run local adapters persist and manage only the top-level spawned PID. Their timeout/cancel path uses direct child.kill() semantics. That is weaker than the runtime-service path, which already tracks and terminates whole process groups.

If Codex, Claude, Cursor, or a skill launched through them starts Chrome or Chromium helpers, Paperclip can lose ownership of those descendants even when it still believes it handled the run correctly.

Observed implementation facts

1. Heartbeat-run local adapters track only one PID

packages/adapter-utils/src/server-utils.ts

runChildProcess() spawns the adapter command and records only child.pid
timeout handling sends SIGTERM and then SIGKILL to the direct child
there is no process-group creation or process-group kill path there today

packages/db/src/schema/heartbeat_runs.ts

heartbeat_runs stores process_pid
there is no persisted process_group_id

server/src/services/heartbeat.ts

cancellation logic uses the in-memory child handle and calls child.kill()
orphaned-run recovery checks whether the recorded direct PID is alive
the recovery model is built around one tracked process, not a descendant tree

2. Workspace runtime already uses stronger ownership

server/src/services/workspace-runtime.ts

runtime services are spawned with detached: process.platform !== "win32"
the service record stores processGroupId
shutdown calls terminateLocalService() with group-aware killing

server/src/services/local-service-supervisor.ts

terminateLocalService() prefers process.kill(-processGroupId, signal) on POSIX
it escalates from SIGTERM to SIGKILL

This is the clearest internal comparison point: Paperclip already has one local-process subsystem that treats process-group ownership as the right abstraction.

3. The current recovery path explains why leaks would be visible but hard to reason about

If the direct adapter process exits, hangs, or is cancelled after launching a browser subtree:

Paperclip may think it cancelled the run because the parent process is gone
descendant Chrome helpers may still be running
orphan recovery has no persisted process-group identity to reconcile or reap later

That makes the failure look like an agent-browser problem when the more general bug is "executor descendants are not owned strongly enough."

Why `agent-browser` makes the problem obvious

Inference:

Chromium is intentionally multi-process
browser automation often leaves a browser process plus renderer, GPU, utility, and crashpad/helper children
skills that open browsers repeatedly amplify the symptom because each run can produce several descendant processes

So agent-browser is probably not the root cause. It is the workload that exposes the weak ownership model fastest.

Success condition

This work is successful when Paperclip can:

start a local adapter run and own the full descendant tree it created
cancel, timeout, or recover that run without leaving Chrome descendants behind on POSIX
detect and clean up stale local descendants after server restarts
expose enough metadata that operators can see which run owns which spawned process tree

Non-goals

Do not:

special-case agent-browser only
depend on manual pkill chrome cleanup as the primary fix
require every skill author to add bespoke browser teardown logic before Paperclip can clean up correctly
change remote/http adapter behavior as part of the first pass

Proposed plan

Phase 0: reproduce and instrument

Objective:

make the leak measurable from Paperclip's side before changing execution semantics

Work:

add a reproducible local test script or fixture that launches a child process which itself launches descendants and ignores normal parent exit
capture parent PID, descendant PIDs, and run ID in logs during local adapter execution
document current behavior separately for:
- normal completion
- timeout
- explicit cancellation
- server restart during run

Deliverable:

one short repro note attached to the implementation issue or child issue

Phase 1: give heartbeat-run local adapters process-group ownership

Objective:

align adapter-run execution with the stronger runtime-service model

Work:

update runChildProcess() to create a dedicated process group on POSIX
persist both:
- direct PID
- process-group ID
update the run cancellation and timeout paths to kill the group first, then escalate
keep direct-PID fallback behavior for platforms where group kill is not available

Likely touched surfaces:

packages/adapter-utils/src/server-utils.ts
packages/db/src/schema/heartbeat_runs.ts
packages/shared/src/types/heartbeat.ts
server/src/services/heartbeat.ts

Important design choice:

use the same ownership model for all local child-process adapters, not just Codex or Claude

Phase 2: make restart recovery group-aware

Objective:

prevent stale descendants from surviving server crashes or restarts indefinitely

Work:

teach orphan reconciliation to inspect the persisted process-group ID, not only the direct PID
if the direct parent is gone but the group still exists, mark the run as detached-orphaned with clearer metadata
decide whether restart recovery should:
- adopt the still-running group, or
- terminate it as unrecoverable

Recommendation:

for heartbeat runs, prefer terminating unrecoverable orphan groups rather than adopting them unless we can prove the adapter session remains safe and observable

Reason:

runtime services are long-lived and adoptable
heartbeat runs are task executions with stricter audit and cancellation semantics

Phase 3: add operator-visible cleanup tools

Objective:

make the system diagnosable when ownership still fails

Work:

surface the tracked process metadata in run details or debug endpoints
add a control-plane cleanup action or CLI utility for stale local run processes owned by Paperclip
scope cleanup by run/agent/company instead of broad browser-name matching

This should replace ad hoc scripts as the general-purpose escape hatch.

Phase 4: cover platform and regression cases

Objective:

keep the fix from regressing and define platform behavior explicitly

Tests to add:

unit tests around process-group-aware cancellation in adapter execution utilities
heartbeat recovery tests for:
- surviving descendant tree after parent loss
- timeout cleanup
- cancellation cleanup
platform-conditional behavior notes for Windows, where negative-PID group kill does not apply

Recommended first implementation slice

The first shipping slice should be narrow:

introduce process-group ownership for local heartbeat-run adapters on POSIX
persist group metadata on heartbeat_runs
switch timeout/cancel paths from direct-child kill to group kill
add one regression test that proves descendants die with the parent run

That should address the main Chrome accumulation path without taking on the full restart-recovery design in the same patch.

Risks

1. Over-killing unrelated processes

If process-group boundaries are created incorrectly, cleanup could terminate more than the run owns.

Mitigation:

create a fresh process group only for the spawned adapter command
persist and target that exact group

2. Cross-platform differences

Windows does not support the POSIX negative-PID kill pattern used elsewhere in the repo.

Mitigation:

ship POSIX-first
keep direct-child fallback on Windows
document Windows as partial until job-object or equivalent handling is designed

3. Session recovery complexity

Adopting a still-running orphaned group may look attractive but can break observability if stdout/stderr pipes are already gone.

Mitigation:

default to deterministic cleanup for heartbeat runs unless adoption is explicitly proven safe

Recommendation

Treat this as a Paperclip executor ownership bug, not an agent-browser bug.

agent-browser should remain a useful repro case, but the implementation should be shared across all local child-process adapters so any descendant process tree spawned by Codex, Claude, Cursor, Gemini, Pi, or OpenCode is owned and cleaned up consistently.

8.5 KiB Raw Permalink Blame History

PAP-1231 Agent Browser Process Cleanup Plan

Goal

Short answer

Observed implementation facts

1. Heartbeat-run local adapters track only one PID

2. Workspace runtime already uses stronger ownership

3. The current recovery path explains why leaks would be visible but hard to reason about

Why agent-browser makes the problem obvious

Success condition

Non-goals

Proposed plan

Phase 0: reproduce and instrument

Phase 1: give heartbeat-run local adapters process-group ownership

Phase 2: make restart recovery group-aware

Phase 3: add operator-visible cleanup tools

Phase 4: cover platform and regression cases

Recommended first implementation slice

Risks

1. Over-killing unrelated processes

2. Cross-platform differences

3. Session recovery complexity

Recommendation

8.5 KiB

Raw Permalink Blame History

Why `agent-browser` makes the problem obvious