fix(#2410): emit wave + plan checkpoint heartbeats to prevent stream idle timeout (#2626)

/gsd:manager's background execute-phase Task fails with
"Stream idle timeout - partial response received" on multi-plan phases
(Claude Code + Opus 4.7 at ~200K+ cache_read) because the long subagent
never emits tokens fast enough between large tool_results — the SSE layer
times out mid-assistant-turn and the harness retries hit the same TTFT
wall after prompt cache TTL expires.

Root cause: no orchestrator-level activity at wave/plan boundaries.

Fix (maintainer-approved A+B):
- A (wave boundary): execute-phase.md now emits a `[checkpoint]`
  heartbeat before each wave spawns and after each wave completes.
- B (plan boundary): also emit `[checkpoint]` before each Task()
  dispatch and after each executor returns (complete/failed/checkpoint).
  Heartbeats are literal assistant-text lines (no tool call) with a
  monotonic `{P}/{Q} plans done` counter so partial-transcript recovery
  tools can grep progress even when a run dies mid-phase.

Docs: COMMANDS.md /gsd-manager section documents the marker format.
Tests: tests/bug-2410-stream-checkpoint-heartbeats.test.cjs (12 cases)
asserts the heartbeats exist at every boundary and in the right workflow
step. Full suite: 5422 node:test cases pass. Pre-existing vitest
failures on main are unrelated to this change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Tom Boucher
2026-04-23 11:54:11 -04:00
committed by GitHub
parent 807db75d55
commit df0ab0c0c9
3 changed files with 246 additions and 1 deletions

View File

@@ -562,6 +562,24 @@ Interactive command center for managing multiple phases from one terminal.
/gsd-manager # Open command center dashboard
```
**Checkpoint Heartbeats (#2410):**
Background `execute-phase` runs emit `[checkpoint]` markers at every wave and plan
boundary so the Claude API SSE stream never idles long enough to trigger
`Stream idle timeout - partial response received` on multi-plan phases. The
format is:
```
[checkpoint] phase {N} wave {W}/{M} starting, {count} plan(s), {P}/{Q} plans done
[checkpoint] phase {N} wave {W}/{M} plan {plan_id} starting ({P}/{Q} plans done)
[checkpoint] phase {N} wave {W}/{M} plan {plan_id} complete ({P}/{Q} plans done)
[checkpoint] phase {N} wave {W}/{M} complete, {P}/{Q} plans done ({ok}/{count} ok)
```
If a background phase fails partway through, grep the transcript for `[checkpoint]`
to see the last confirmed boundary. The manager's background-completion handler
uses these markers to report partial progress when an agent errors out.
**Manager Passthrough Flags:**
Configure per-step flags in `.planning/config.json` under `manager.flags`. These flags are appended to each dispatched command: