diff --git a/get-shit-done/workflows/execute-phase.md b/get-shit-done/workflows/execute-phase.md
index 649247f4..68000cfb 100644
--- a/get-shit-done/workflows/execute-phase.md
+++ b/get-shit-done/workflows/execute-phase.md
@@ -366,11 +366,16 @@ Execute each selected wave in sequence. Within a wave: parallel if `PARALLELIZAT
- You are running as a PARALLEL executor agent. Use --no-verify on all git
- commits to avoid pre-commit hook contention with other agents. The
- orchestrator validates hooks once after all agents complete.
+ You are running as a PARALLEL executor agent in a git worktree.
+ Use --no-verify on all git commits to avoid pre-commit hook contention
+ with other agents. The orchestrator validates hooks once after all agents complete.
For gsd-tools commits: add --no-verify flag.
For direct git commits: use git commit --no-verify -m "..."
+
+ IMPORTANT: Do NOT modify STATE.md or ROADMAP.md. execute-plan.md
+ auto-detects worktree mode (`.git` is a file, not a directory) and skips
+ shared file updates automatically. The orchestrator updates them centrally
+ after merge.
@@ -408,6 +413,7 @@ Execute each selected wave in sequence. Within a wave: parallel if `PARALLELIZAT
- [ ] All tasks executed
- [ ] Each task committed individually
- [ ] SUMMARY.md created in plan directory
+ - [ ] No modifications to shared orchestrator artifacts (the orchestrator handles all post-wave shared-file writes)
"
)
@@ -548,22 +554,113 @@ Execute each selected wave in sequence. Within a wave: parallel if `PARALLELIZAT
**If no worktrees found:** Skip silently — agents may have been spawned without worktree isolation.
-5.6. **Post-wave shared artifact update (worktree mode only):**
+5.6. **Post-merge test gate (parallel mode only):**
- When executor agents ran with `isolation="worktree"`, they skipped STATE.md and ROADMAP.md updates to avoid last-merge-wins overwrites. The orchestrator is the single writer for these files. After worktrees are merged back, update shared artifacts once:
+ After merging all worktrees in a wave, run the project's test suite to catch
+ cross-plan integration issues that individual worktree self-checks cannot detect
+ (e.g., conflicting type definitions, removed exports, import changes).
+
+ This addresses the Generator self-evaluation blind spot identified in Anthropic's
+ harness engineering research: agents reliably report Self-Check: PASSED even when
+ merging their work creates failures.
```bash
- # Update ROADMAP.md for each completed plan in this wave
- for PLAN_ID in ${WAVE_PLAN_IDS}; do
- node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" roadmap update-plan-progress "${PHASE_NUMBER}" "${PLAN_ID}" completed
- done
+ # Detect test runner and run quick smoke test (timeout: 5 minutes)
+ TEST_EXIT=0
+ timeout 300 bash -c '
+ if [ -f "package.json" ]; then
+ npm test 2>&1
+ elif [ -f "Cargo.toml" ]; then
+ cargo test 2>&1
+ elif [ -f "go.mod" ]; then
+ go test ./... 2>&1
+ elif [ -f "pyproject.toml" ] || [ -f "requirements.txt" ]; then
+ python -m pytest -x -q --tb=short 2>&1 || uv run python -m pytest -x -q --tb=short 2>&1
+ else
+ echo "⚠ No test runner detected — skipping post-merge test gate"
+ exit 0
+ fi
+ '
+ TEST_EXIT=$?
+ if [ "${TEST_EXIT}" -eq 0 ]; then
+ echo "✓ Post-merge test gate passed — no cross-plan conflicts"
+ elif [ "${TEST_EXIT}" -eq 124 ]; then
+ echo "⚠ Post-merge test gate timed out after 5 minutes"
+ else
+ echo "✗ Post-merge test gate failed (exit code ${TEST_EXIT})"
+ WAVE_FAILURE_COUNT=$((WAVE_FAILURE_COUNT + 1))
+ fi
+ ```
+ **If `TEST_EXIT` is 0 (pass):** `✓ Post-merge test gate: {N} tests passed — no cross-plan conflicts` → continue to orchestrator tracking update.
+
+ **If `TEST_EXIT` is 124 (timeout):** Log warning, treat as non-blocking, continue. Tests may need a longer budget or manual run.
+
+ **If `TEST_EXIT` is non-zero (test failure):** Increment `WAVE_FAILURE_COUNT` to track
+ cumulative failures across waves. Subsequent waves should report:
+ `⚠ Note: ${WAVE_FAILURE_COUNT} prior wave(s) had test failures`
+
+5.7. **Post-wave shared artifact update (worktree mode only, skip if tests failed):**
+
+ When executor agents ran with `isolation="worktree"`, they skipped STATE.md and ROADMAP.md updates to avoid last-merge-wins overwrites. The orchestrator is the single writer for these files. After worktrees are merged back, update shared artifacts once.
+
+ **Only update tracking when tests passed (TEST_EXIT=0).**
+ If tests failed or timed out, skip the tracking update — plans should
+ not be marked as complete when integration tests are failing or inconclusive.
+
+ ```bash
+ # Guard: only update tracking if post-merge tests passed
+ # Timeout (124) is treated as inconclusive — do NOT mark plans complete
+ if [ "${TEST_EXIT}" -eq 0 ]; then
+ # Update ROADMAP plan progress for each completed plan in this wave
+ for plan_id in {completed_plan_ids}; do
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" roadmap update-plan-progress "${PHASE_NUMBER}" "${plan_id}" "complete"
+ done
+
+ # Only commit tracking files if they actually changed
+ if ! git diff --quiet .planning/ROADMAP.md .planning/STATE.md 2>/dev/null; then
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" commit "docs(phase-${PHASE_NUMBER}): update tracking after wave ${N}" --files .planning/ROADMAP.md .planning/STATE.md
+ fi
+ elif [ "${TEST_EXIT}" -eq 124 ]; then
+ echo "⚠ Skipping tracking update — test suite timed out. Plans remain in-progress. Run tests manually to confirm."
+ else
+ echo "⚠ Skipping tracking update — post-merge tests failed (exit ${TEST_EXIT}). Plans remain in-progress until tests pass."
+ fi
```
Where `WAVE_PLAN_IDS` is the space-separated list of plan IDs that completed in this wave.
**If `workflow.use_worktrees` is `false`:** Sequential agents already updated STATE.md and ROADMAP.md themselves — skip this step.
+5.8. **Handle test gate failures (when `WAVE_FAILURE_COUNT > 0`):**
+
+ ```
+ ## ⚠ Post-Merge Test Failure (cumulative failures: ${WAVE_FAILURE_COUNT})
+
+ Wave {N} worktrees merged successfully, but {M} tests fail after merge.
+ This typically indicates conflicting changes across parallel plans
+ (e.g., type definitions, shared imports, API contracts).
+
+ Failed tests:
+ {first 10 lines of failure output}
+
+ Options:
+ 1. Fix now (recommended) — resolve conflicts before next wave
+ 2. Continue — failures may compound in subsequent waves
+ ```
+
+ Note: If `WAVE_FAILURE_COUNT > 1`, strongly recommend "Fix now" — compounding
+ failures across multiple waves become exponentially harder to diagnose.
+
+ If "Fix now": diagnose failures (typically import conflicts, missing types,
+ or changed function signatures from parallel plans modifying the same module).
+ Fix, commit as `fix: resolve post-merge conflicts from wave {N}`, re-run tests.
+
+ **Why this matters:** Worktree isolation means each agent's Self-Check passes
+ in isolation. But when merged, add/add conflicts in shared files (models, registries,
+ CLI entry points) can silently drop code. The post-merge gate catches this before
+ the next wave builds on a broken foundation.
+
6. **Report completion — spot-check claims first:**
For each SUMMARY.md:
@@ -849,10 +946,11 @@ Collect all unique test file paths into `REGRESSION_FILES`.
```bash
# Detect test runner and run prior phase tests
if [ -f "package.json" ]; then
- # Node.js — use project's test runner
- npx jest ${REGRESSION_FILES} --passWithNoTests --no-coverage -q 2>&1 || npx vitest run ${REGRESSION_FILES} 2>&1
+ npm test 2>&1
elif [ -f "Cargo.toml" ]; then
cargo test 2>&1
+elif [ -f "go.mod" ]; then
+ go test ./... 2>&1
elif [ -f "requirements.txt" ] || [ -f "pyproject.toml" ]; then
python -m pytest ${REGRESSION_FILES} -q --tb=short 2>&1
fi
diff --git a/get-shit-done/workflows/execute-plan.md b/get-shit-done/workflows/execute-plan.md
index c78ce13f..e341c1ce 100644
--- a/get-shit-done/workflows/execute-plan.md
+++ b/get-shit-done/workflows/execute-plan.md
@@ -396,19 +396,29 @@ Next: more plans → "Ready for {next-plan}" | last → "Phase complete, ready f
+**Skip this step if running in parallel mode** (the orchestrator in execute-phase.md
+handles STATE.md/ROADMAP.md updates centrally after merging worktrees to avoid
+merge conflicts).
+
Update STATE.md using gsd-tools:
```bash
-# Advance plan counter (handles last-plan edge case)
-node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state advance-plan
+# Auto-detect parallel mode: .git is a file in worktrees, a directory in main repo
+IS_WORKTREE=$([ -f .git ] && echo "true" || echo "false")
-# Recalculate progress bar from disk state
-node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state update-progress
+# Skip in parallel mode — orchestrator handles STATE.md centrally
+if [ "$IS_WORKTREE" != "true" ]; then
+ # Advance plan counter (handles last-plan edge case)
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state advance-plan
-# Record execution metrics
-node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state record-metric \
- --phase "${PHASE}" --plan "${PLAN}" --duration "${DURATION}" \
- --tasks "${TASK_COUNT}" --files "${FILE_COUNT}"
+ # Recalculate progress bar from disk state
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state update-progress
+
+ # Record execution metrics
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" state record-metric \
+ --phase "${PHASE}" --plan "${PLAN}" --duration "${DURATION}" \
+ --tasks "${TASK_COUNT}" --files "${FILE_COUNT}"
+fi
```
@@ -443,8 +453,17 @@ If SUMMARY "Issues Encountered" ≠ "None": yolo → log and continue. Interacti
+**Skip this step if running in parallel mode** (the orchestrator handles ROADMAP.md
+updates centrally after merging worktrees).
+
```bash
-node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" roadmap update-plan-progress "${PHASE}"
+# Auto-detect parallel mode: .git is a file in worktrees, a directory in main repo
+IS_WORKTREE=$([ -f .git ] && echo "true" || echo "false")
+
+# Skip in parallel mode — orchestrator handles ROADMAP.md centrally
+if [ "$IS_WORKTREE" != "true" ]; then
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" roadmap update-plan-progress "${PHASE}"
+fi
```
Counts PLAN vs SUMMARY files on disk. Updates progress table row with correct count and status (`In Progress` or `Complete` with date).
@@ -463,7 +482,15 @@ Extract requirement IDs from the plan's frontmatter (e.g., `requirements: [AUTH-
Task code already committed per-task. Commit plan metadata:
```bash
-node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" commit "docs({phase}-{plan}): complete [plan-name] plan" --files .planning/phases/XX-name/{phase}-{plan}-SUMMARY.md .planning/STATE.md .planning/ROADMAP.md .planning/REQUIREMENTS.md
+# Auto-detect parallel mode: .git is a file in worktrees, a directory in main repo
+IS_WORKTREE=$([ -f .git ] && echo "true" || echo "false")
+
+# In parallel mode: exclude STATE.md and ROADMAP.md (orchestrator commits these)
+if [ "$IS_WORKTREE" = "true" ]; then
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" commit "docs({phase}-{plan}): complete [plan-name] plan" --files .planning/phases/XX-name/{phase}-{plan}-SUMMARY.md .planning/REQUIREMENTS.md
+else
+ node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" commit "docs({phase}-{plan}): complete [plan-name] plan" --files .planning/phases/XX-name/{phase}-{plan}-SUMMARY.md .planning/STATE.md .planning/ROADMAP.md .planning/REQUIREMENTS.md
+fi
```
@@ -507,8 +534,8 @@ All routes: `/clear` first for fresh context.
- All verifications pass
- USER-SETUP.md generated if user_setup in frontmatter
- SUMMARY.md created with substantive content
-- STATE.md updated (position, decisions, issues, session)
-- ROADMAP.md updated
+- STATE.md updated (position, decisions, issues, session) — unless parallel mode (orchestrator handles)
+- ROADMAP.md updated — unless parallel mode (orchestrator handles)
- If codebase map exists: map updated with execution changes (or skipped if no significant changes)
- If USER-SETUP.md created: prominently surfaced in completion output
diff --git a/get-shit-done/workflows/verify-phase.md b/get-shit-done/workflows/verify-phase.md
index 31e953df..468d2753 100644
--- a/get-shit-done/workflows/verify-phase.md
+++ b/get-shit-done/workflows/verify-phase.md
@@ -183,6 +183,89 @@ grep -E "Phase ${PHASE_NUM}" .planning/REQUIREMENTS.md 2>/dev/null || true
For each requirement: parse description → identify supporting truths/artifacts → status: ✓ SATISFIED / ✗ BLOCKED / ? NEEDS HUMAN.
+
+**Run the project's test suite and CLI commands to verify behavior, not just structure.**
+
+Static checks (grep, file existence, wiring) catch structural gaps but miss runtime
+failures. This step runs actual tests and project commands to verify the phase goal
+is behaviorally achieved.
+
+This follows Anthropic's harness engineering principle: separating generation from
+evaluation, with the evaluator interacting with the running system rather than
+inspecting static artifacts.
+
+**Step 1: Run test suite**
+
+```bash
+# Detect test runner and run all tests (timeout: 5 minutes)
+TEST_EXIT=0
+timeout 300 bash -c '
+if [ -f "package.json" ]; then
+ npm test 2>&1
+elif [ -f "Cargo.toml" ]; then
+ cargo test 2>&1
+elif [ -f "go.mod" ]; then
+ go test ./... 2>&1
+elif [ -f "pyproject.toml" ] || [ -f "requirements.txt" ]; then
+ python -m pytest -q --tb=short 2>&1 || uv run python -m pytest -q --tb=short 2>&1
+else
+ echo "⚠ No test runner detected — skipping test suite"
+ exit 1
+fi
+'
+TEST_EXIT=$?
+if [ "${TEST_EXIT}" -eq 0 ]; then
+ echo "✓ Test suite passed"
+elif [ "${TEST_EXIT}" -eq 124 ]; then
+ echo "⚠ Test suite timed out after 5 minutes"
+else
+ echo "✗ Test suite failed (exit code ${TEST_EXIT})"
+fi
+```
+
+Record: total tests, passed, failed, coverage (if available).
+
+**If any tests fail:** Mark as `behavioral_failures` — these are BLOCKER severity
+regardless of whether static checks passed. A phase cannot be verified if tests fail.
+
+**Step 2: Run project CLI/commands from success criteria (if testable)**
+
+For each success criterion that describes a user command (e.g., "User can run
+`mixtiq validate`", "User can run `npm start`"):
+
+1. Check if the command exists and required inputs are available:
+ - Look for example files in `templates/`, `fixtures/`, `test/`, `examples/`, or `testdata/`
+ - Check if the CLI binary/script exists on PATH or in the project
+2. **If no suitable inputs or fixtures exist:** Mark as `? NEEDS HUMAN` with reason
+ "No test fixtures available — requires manual verification" and move on.
+ Do NOT invent example inputs.
+3. If inputs are available: run the command and verify it exits successfully.
+
+```bash
+# Only run if both command and input exist
+if command -v {project_cli} &>/dev/null && [ -f "{example_input}" ]; then
+ {project_cli} {example_input} 2>&1
+fi
+```
+
+Record: command, exit code, output summary, pass/fail (or SKIPPED if no fixtures).
+
+**Step 3: Report**
+
+```
+## Behavioral Verification
+
+| Check | Result | Detail |
+|-------|--------|--------|
+| Test suite | {N} passed, {M} failed | {first failure if any} |
+| {CLI command 1} | ✓ / ✗ | {output summary} |
+| {CLI command 2} | ✓ / ✗ | {output summary} |
+```
+
+**If all behavioral checks pass:** Continue to scan_antipatterns.
+**If any fail:** Add to verification gaps with BLOCKER severity.
+
+
Extract files modified in this phase from SUMMARY.md, scan each: