diff --git a/agents/gsd-plan-checker.md b/agents/gsd-plan-checker.md index 53645e5a..7b727566 100644 --- a/agents/gsd-plan-checker.md +++ b/agents/gsd-plan-checker.md @@ -83,6 +83,9 @@ Same methodology (goal-backward), different timing, different subject matter. At decision points during plan verification, apply structured reasoning: @~/.claude/get-shit-done/references/thinking-models-planning.md +For calibration on scoring and issue identification, reference these examples: +@~/.claude/get-shit-done/references/few-shot-examples/plan-checker.md + ## Dimension 1: Requirement Coverage **Question:** Does every phase requirement have task(s) addressing it? diff --git a/agents/gsd-verifier.md b/agents/gsd-verifier.md index 6c1c842a..85e1efb6 100644 --- a/agents/gsd-verifier.md +++ b/agents/gsd-verifier.md @@ -60,6 +60,9 @@ Then verify each level against the actual codebase. At verification decision points, apply structured reasoning: @~/.claude/get-shit-done/references/thinking-models-verification.md +At verification decision points, reference calibration examples: +@~/.claude/get-shit-done/references/few-shot-examples/verifier.md + ## Step 0: Check for Previous Verification ```bash diff --git a/get-shit-done/references/few-shot-examples/plan-checker.md b/get-shit-done/references/few-shot-examples/plan-checker.md new file mode 100644 index 00000000..0eccf838 --- /dev/null +++ b/get-shit-done/references/few-shot-examples/plan-checker.md @@ -0,0 +1,73 @@ +--- +component: plan-checker +version: 1 +last_calibrated: 2026-03-24 +--- + +# Plan-Checker Few-Shot Examples + +## Positive Examples + +### Example 1: BLOCKER flagged for vague task action missing file and function names + +**Input:** Plan task with action: "Implement the authentication feature for the API routes." + +**Output:** + +> ```yaml +> issues: +> - dimension: task_completeness +> severity: BLOCKER +> finding: "Task T1 action says 'implement the authentication feature' without naming target files, functions to create, or middleware to apply. Executor cannot determine what to build." +> affected_field: "" +> suggested_fix: "Specify: create authMiddleware in src/middleware/auth.js, apply to routes in src/routes/api.js lines 12-45, verify with integration test" +> ``` + +**Why this is good:** The checker cited the specific dimension (task_completeness), quoted the problematic text, explained why it is a blocker (executor cannot determine what to build), and gave a concrete fix with file paths and function names. The finding is actionable -- the planner knows exactly what to add. + +### Example 2: BLOCKER for same-wave file conflict between two plans + +**Input:** Two plans in wave 1 -- Plan 01 modifies `get-shit-done/workflows/execute-phase.md`, Plan 02 also modifies `get-shit-done/workflows/execute-phase.md`. No `depends_on` declared. + +**Output:** + +> ```yaml +> issues: +> - dimension: dependency_correctness +> severity: BLOCKER +> finding: "Plans 01 and 02 both modify get-shit-done/workflows/execute-phase.md in wave 1 with no depends_on relationship. Concurrent execution will cause merge conflicts or lost changes." +> affected_field: "files_modified" +> suggested_fix: "Either move Plan 02 to wave 2 with depends_on: ['01'] or consolidate the file changes into a single plan" +> ``` + +**Why this is good:** The checker identified a real structural problem -- two plans modifying the same file in the same wave without a dependency relationship. It cited dependency_correctness, named both plans, the conflicting file, and provided two alternative fixes. + +## Negative Examples + +### Example 1: All dimensions pass when verify commands are missing + +**Input:** Plan with 3 tasks. Task T2 has `` (empty). Task T3 has `echo "done"` (trivial, cannot distinguish pass from fail). + +**Output:** + +> ## CHECK PASSED +> Plans: 1 | Tasks: 3 | Dimensions: 10 | Issues: 0 + +**Why this is bad:** The checker missed two violations. verification_derivation requires verify commands to be runnable and test actual output -- an empty verify block is a BLOCKER. Verify commands must distinguish pass from fail -- `echo "done"` always returns 0 regardless of implementation state. Both should have been flagged. + +### Example 2: False positive INFO for a correctly-sized plan + +**Input:** Plan with 3 tasks, 6 unique files, 1 dependency. All within limits. + +**Output:** + +> ```yaml +> issues: +> - dimension: scope_sanity +> severity: INFO +> finding: "Plan has 3 tasks -- consider splitting into smaller plans for faster iteration" +> affected_field: "task count" +> suggested_fix: "Split tasks into separate plans" +> ``` + +**Why this is bad:** The checker flagged a non-issue. scope_sanity allows 2-3 tasks per plan -- 3 tasks is within limits. The checker applied a personal preference ("smaller is better") rather than the documented threshold. This wastes planner time on false positives and erodes trust in the checker's judgment. A correct check would produce no issue for this plan. diff --git a/get-shit-done/references/few-shot-examples/verifier.md b/get-shit-done/references/few-shot-examples/verifier.md new file mode 100644 index 00000000..3cd0f44e --- /dev/null +++ b/get-shit-done/references/few-shot-examples/verifier.md @@ -0,0 +1,109 @@ +--- +component: verifier +version: 2 +last_calibrated: 2026-03-25 +calibration_source: "Adapted from thinking-partner verification corpus" +--- + +# Verifier Few-Shot Examples + +## Positive Examples + +### Example 1: Thorough multi-level artifact verification with real gap discovery + +**Input:** Must-have artifact: `get-shit-done/workflows/context-bridge.md: >50 lines` + +**Output:** + +> **L1 (Existence):** EXISTS -- `get-shit-done/workflows/context-bridge.md` found, 187 lines. +> **L2 (Substantive):** SUBSTANTIVE -- No TODO/FIXME markers, 4 defined sections (`updateContextTier`, `getContextBudget`, `injectContextWarning`, `resetTierState`), non-trivial logic with config reads and threshold calculations. +> **L3 (Wired):** WIRED -- Referenced by `get-shit-done/workflows/execute-phase.md` at step 3. Called by `agents/gsd-executor.md` via @-reference include. Arguments verified: receives planning directory path matching the expected parameter. +> **Status: PASSED** + +**Why this is good:** The verifier checked all three levels independently with specific evidence at each level -- file path, line count, exported symbols, import location, and argument verification. It did not stop at "file exists" but traced wiring through to actual usage with correct arguments. + +### Example 2: Correct FAILED classification with grep evidence + +**Input:** Must-have truth: `validate-commit.js rejects commits missing scope parentheses` + +**Output:** + +> **Method:** Ran `grep -n "scope" get-shit-done/bin/validate-commit.js` to locate scope validation logic. +> **Evidence:** Lines 34-41 show regex `/^(feat|fix|refactor|test|docs|chore|wip)\(/` which requires opening paren after type. However, no test for MISSING closing paren exists -- the regex accepts `feat(: description` (missing scope content and closing paren). +> **Ran:** `echo "feat(: test" | node get-shit-done/bin/validate-commit.js` -- exit code 0 (accepted). +> **Status: FAILED** -- Commit messages with empty scope `feat(: desc)` are accepted. The truth "rejects commits missing scope parentheses" does not hold for empty-scope edge case. + +**Why this is good:** The verifier did not trust the code's intent -- it ran an actual test with an edge case input and reported the specific failure. Evidence includes the regex, the test command, the exit code, and why the truth does not hold. + +### Example 3: Catching missing wiring (calibration-derived gap pattern) + +**Input:** Must-have key_link: `check-subagent-output.js imports and calls feedback-loop.js extractFeedback()` + +**Output:** + +> **Method:** Ran `grep -n "feedback-loop\|extractFeedback" get-shit-done/bin/check-subagent-output.js`. +> **Evidence:** Zero matches. The file `feedback-loop.js` exists (172 lines, substantive), and `tests/feedback-loop.test.js` passes (229 lines). But `check-subagent-output.js` never imports or calls it. +> **Cross-check:** `grep -rn "require.*feedback-loop" get-shit-done/bin/` returns zero results across all scripts. +> **Status: FAILED** -- The artifact exists and is tested, but is completely orphaned. The key_link "imports and calls" does not hold. This is a **missing wiring** gap. + +**Why this is good:** The verifier caught a common calibration pattern: code that exists, is tested, but is never actually wired into the system. L1 and L2 pass, but L3 (wiring) fails. This pattern accounts for 37% of all gaps found in calibration corpus. + +### Example 4: Catching missing config schema entries (calibration-derived gap pattern) + +**Input:** Must-have truth: `Config toggles features.architecture_graph and features.architecture_guard exist with defaults` + +**Output:** + +> **Method:** Ran `grep "architecture_graph\|architecture_guard" get-shit-done/bin/lib/config-schema.json`. +> **Evidence:** Zero matches in schema. However, `grep "architecture" get-shit-done/bin/lib/config.cjs` shows both flags in CONFIG_DEFAULTS (lines 426, 428). +> **Gap:** Defaults exist in code but schema validation doesn't know about them. `configValidate()` would strip these fields on load, making the defaults unreachable. +> **Status: FAILED** -- Config flags work in code but are invisible to schema validation. This is a **missing wiring** gap between config layers. + +**Why this is good:** The verifier checked multiple config layers (schema vs defaults vs code) rather than just one. Config wiring requires schema + defaults + code to agree -- checking only one layer misses layer-mismatch bugs. + +### Example 5: Distinguishing planning gaps from execution failures + +**Input:** Must-have truth: `All MSYS regex patterns consolidated into shared module` + +**Output:** + +> **Method:** Ran `grep -rn "\/\[a-zA-Z\]\/" get-shit-done/bin/` to find inline MSYS regex. +> **Evidence:** 2 matches in `commands/gsd/misc.md` (lines 26, 487). These were NOT in the original plan scope -- Phase 148 extracted code into misc.md AFTER Phase 149's plan was written. +> **Assessment:** The executor correctly completed all planned tasks. The gap is a **planning ordering issue**, not an execution failure. +> **Status: PASS_WITH_NOTES** -- Executor did what was planned. Unconsolidated regex noted as follow-up, not as a failure of this phase. + +**Why this is good:** The verifier distinguished between what was planned vs what exists now. Code that changed AFTER plan creation is a planning gap, not an executor failure. This prevents false FAIL verdicts on correctly-executed work. + +## Calibration-Derived Gap Patterns + +Based on analysis of verification corpus (80% pass rate, 8 gaps): + +| Pattern | Count | % of Gaps | What to Check | +|---------|-------|-----------|---------------| +| Missing wiring | 3 | 37% | File exists + tested but never imported/called by consumer | +| Missing tests | 2 | 25% | Feature works but test file excluded from CI or never runs | +| Other | 3 | 38% | Config layer mismatches, planning ordering issues, schema gaps | + +## Negative Examples + +### Example 1: Blanket pass with no per-criterion evidence + +**Input:** Must-haves: 5 truths, 3 artifacts, 2 key_links from plan frontmatter. + +**Output:** + +> All must-haves have been verified. The implementation matches the plan requirements. Files exist and are properly connected. Status: PASSED. +> must_haves_checked: 10, must_haves_passed: 10 + +**Why this is bad:** No individual must-have is listed or checked. No file paths, no line numbers, no grep output, no level-by-level breakdown. The verifier restated the must-haves as passing without showing any evidence. This would miss stubs, orphaned exports, and broken wiring. + +### Example 2: File existence check without substance or wiring verification + +**Input:** Must-have artifact: `get-shit-done/bin/lib/audit-checks/index.cjs: >100 lines` + +**Output:** + +> **L1 (Existence):** EXISTS -- File found at `get-shit-done/bin/lib/audit-checks/index.cjs`, 142 lines. +> **Status: PASSED** + +**Why this is bad:** The verifier stopped at Level 1. The file has 142 lines but could contain `// TODO: implement all checks` with stub functions returning empty objects. Level 2 (substantive) and Level 3 (wired) were skipped entirely. A file that exists but is never imported or contains only placeholder code should not pass. diff --git a/tests/few-shot-calibration.test.cjs b/tests/few-shot-calibration.test.cjs new file mode 100644 index 00000000..297230a8 --- /dev/null +++ b/tests/few-shot-calibration.test.cjs @@ -0,0 +1,146 @@ +const { describe, test } = require('node:test'); +const assert = require('node:assert/strict'); +const fs = require('fs'); +const path = require('path'); + +const REFS_DIR = path.join(__dirname, '..', 'get-shit-done', 'references', 'few-shot-examples'); +const AGENTS_DIR = path.join(__dirname, '..', 'agents'); + +// ── Helpers ──────────────────────────────────────────────────────── +function readFile(filePath) { + return fs.readFileSync(filePath, 'utf-8'); +} + +function countPattern(content, pattern) { + const matches = content.match(pattern); + return matches ? matches.length : 0; +} + +// ── File existence ───────────────────────────────────────────────── +describe('few-shot calibration examples', () => { + describe('reference files exist', () => { + test('plan-checker.md exists in references/few-shot-examples/', () => { + assert.ok(fs.existsSync(path.join(REFS_DIR, 'plan-checker.md'))); + }); + + test('verifier.md exists in references/few-shot-examples/', () => { + assert.ok(fs.existsSync(path.join(REFS_DIR, 'verifier.md'))); + }); + }); + + // ── Version/format metadata ──────────────────────────────────── + describe('frontmatter metadata', () => { + test('plan-checker.md has version and component in frontmatter', () => { + const content = readFile(path.join(REFS_DIR, 'plan-checker.md')); + assert.match(content, /^---\n/); + assert.match(content, /component:\s*plan-checker/); + assert.match(content, /version:\s*\d+/); + assert.match(content, /last_calibrated:\s*\d{4}-\d{2}-\d{2}/); + }); + + test('verifier.md has version and component in frontmatter', () => { + const content = readFile(path.join(REFS_DIR, 'verifier.md')); + assert.match(content, /^---\n/); + assert.match(content, /component:\s*verifier/); + assert.match(content, /version:\s*\d+/); + assert.match(content, /last_calibrated:\s*\d{4}-\d{2}-\d{2}/); + }); + + // Version difference is intentional: plan-checker was calibrated first (v1), + // verifier later with updated format (v2) including calibration_source field. + test('version metadata values are present and numeric', () => { + const pcContent = readFile(path.join(REFS_DIR, 'plan-checker.md')); + const vContent = readFile(path.join(REFS_DIR, 'verifier.md')); + const pcVersion = pcContent.match(/version:\s*(\d+)/); + const vVersion = vContent.match(/version:\s*(\d+)/); + assert.ok(pcVersion, 'plan-checker.md must have a numeric version'); + assert.ok(vVersion, 'verifier.md must have a numeric version'); + }); + }); + + // ── Example counts ───────────────────────────────────────────── + describe('example counts', () => { + test('plan-checker.md contains exactly 4 examples (2 positive, 2 negative)', () => { + const content = readFile(path.join(REFS_DIR, 'plan-checker.md')); + const totalExamples = countPattern(content, /^### Example \d+/gm); + assert.strictEqual(totalExamples, 4); + + // Verify section breakdown + const positiveSection = content.indexOf('## Positive Examples'); + const negativeSection = content.indexOf('## Negative Examples'); + assert.ok(positiveSection >= 0, 'must have Positive Examples section'); + assert.ok(negativeSection >= 0, 'must have Negative Examples section'); + assert.ok(positiveSection < negativeSection, 'positive examples come before negative'); + }); + + test('verifier.md contains exactly 7 examples (5 positive, 2 negative)', () => { + const content = readFile(path.join(REFS_DIR, 'verifier.md')); + const totalExamples = countPattern(content, /^### Example \d+/gm); + assert.strictEqual(totalExamples, 7); + + const positiveSection = content.indexOf('## Positive Examples'); + const negativeSection = content.indexOf('## Negative Examples'); + assert.ok(positiveSection >= 0, 'must have Positive Examples section'); + assert.ok(negativeSection >= 0, 'must have Negative Examples section'); + assert.ok(positiveSection < negativeSection, 'positive examples come before negative'); + }); + }); + + // ── WHY annotations ──────────────────────────────────────────── + describe('WHY annotations', () => { + test('every plan-checker example has a WHY annotation', () => { + const content = readFile(path.join(REFS_DIR, 'plan-checker.md')); + const exampleCount = countPattern(content, /^### Example \d+/gm); + const whyCount = countPattern(content, /^\*\*Why this is (good|bad):\*\*/gm); + assert.strictEqual(whyCount, exampleCount, + `expected ${exampleCount} WHY annotations, found ${whyCount}`); + }); + + test('every verifier example has a WHY annotation', () => { + const content = readFile(path.join(REFS_DIR, 'verifier.md')); + const exampleCount = countPattern(content, /^### Example \d+/gm); + const whyCount = countPattern(content, /^\*\*Why this is (good|bad):\*\*/gm); + assert.strictEqual(whyCount, exampleCount, + `expected ${exampleCount} WHY annotations, found ${whyCount}`); + }); + }); + + // ── Agent reference lines ────────────────────────────────────── + describe('agent files reference few-shot examples', () => { + test('gsd-plan-checker.md contains reference to plan-checker few-shot examples', () => { + const content = readFile(path.join(AGENTS_DIR, 'gsd-plan-checker.md')); + assert.match(content, /@~\/\.claude\/get-shit-done\/references\/few-shot-examples\/plan-checker\.md/); + }); + + test('gsd-verifier.md contains reference to verifier few-shot examples', () => { + const content = readFile(path.join(AGENTS_DIR, 'gsd-verifier.md')); + assert.match(content, /@~\/\.claude\/get-shit-done\/references\/few-shot-examples\/verifier\.md/); + }); + }); + + // ── Content structure ────────────────────────────────────────── + describe('content structure', () => { + test('plan-checker examples include input/output pairs', () => { + const content = readFile(path.join(REFS_DIR, 'plan-checker.md')); + const inputCount = countPattern(content, /^\*\*Input:\*\*/gm); + const outputCount = countPattern(content, /^\*\*Output:\*\*/gm); + assert.ok(inputCount >= 4, `expected at least 4 Input blocks, found ${inputCount}`); + assert.ok(outputCount >= 4, `expected at least 4 Output blocks, found ${outputCount}`); + }); + + test('verifier examples include input/output pairs', () => { + const content = readFile(path.join(REFS_DIR, 'verifier.md')); + const inputCount = countPattern(content, /^\*\*Input:\*\*/gm); + const outputCount = countPattern(content, /^\*\*Output:\*\*/gm); + assert.ok(inputCount >= 7, `expected at least 7 Input blocks, found ${inputCount}`); + assert.ok(outputCount >= 7, `expected at least 7 Output blocks, found ${outputCount}`); + }); + + test('verifier.md includes calibration-derived gap patterns table', () => { + const content = readFile(path.join(REFS_DIR, 'verifier.md')); + assert.match(content, /## Calibration-Derived Gap Patterns/); + assert.match(content, /Missing wiring/); + assert.match(content, /Missing tests/); + }); + }); +});