feat: add test quality audit step to verify-phase workflow

Adds a new `audit_test_quality` step between `scan_antipatterns` and `identify_human_verification` that catches test-level deceptions: - Disabled tests (it.skip/xit/test.todo) covering phase requirements - Circular tests (system generating its own expected values) - Weak assertions (existence-only when value-level proof needed) - Expected value provenance tracking for parity/migration phases Any blocker from this audit forces `gaps_found` status, preventing phases from being marked complete with inadequate test evidence. Fixes #1457 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-25 17:25:23 +02:00 · 2026-03-28 17:24:59 -03:00
parent 8af7ad96fc
commit 759ff38d44
2 changed files with 295 additions and 2 deletions
--- a/get-shit-done/workflows/verify-phase.md
+++ b/get-shit-done/workflows/verify-phase.md
@@ -13,6 +13,7 @@ Goal-backward verification:
 1. What must be TRUE for the goal to be achieved?
 2. What must EXIST for those truths to hold?
 3. What must be WIRED for those artifacts to function?
 4. What must TESTS PROVE for those truths to be evidenced?
 Then verify each level against the actual codebase.
 </core_principle>
@@ -190,6 +191,93 @@ Extract files modified in this phase from SUMMARY.md, scan each:
 Categorize: 🛑 Blocker (prevents goal) | ⚠️ Warning (incomplete) | ℹ️ Info (notable).
 </step>
 <step name="audit_test_quality">
 **Verify that tests PROVE what they claim to prove.**
 This step catches test-level deceptions that pass all prior checks: files exist, are substantive, are wired, and tests pass — but the tests don't actually validate the requirement.
 **1. Identify requirement-linked test files**
 From PLAN and SUMMARY files, map each requirement to the test files that are supposed to prove it.
 **2. Disabled test scan**
 For ALL test files linked to requirements, search for disabled/skipped patterns:
 ```bash
 grep -rn -E "it\.skip|describe\.skip|test\.skip|xit\(|xdescribe\(|xtest\(|@pytest\.mark\.skip|@unittest\.skip|#\[ignore\]|\.pending|it\.todo|test\.todo" "$TEST_FILE"
 ```
 **Rule:** A disabled test linked to a requirement = requirement NOT tested.
 - 🛑 BLOCKER if the disabled test is the only test proving that requirement
 - ⚠️ WARNING if other active tests also cover the requirement
 **3. Circular test detection**
 Search for scripts/utilities that generate expected values by running the system under test:
 ```bash
 grep -rn -E "writeFileSync|writeFile|fs\.write|open\(.*w\)" "$TEST_DIRS"
 ```
 For each match, check if it also imports the system/service/module being tested. If a script both imports the system-under-test AND writes expected output values → CIRCULAR.
 **Circular test indicators:**
 - Script imports a service AND writes to fixture files
 - Expected values have comments like "computed from engine", "captured from baseline"
 - Script filename contains "capture", "baseline", "generate", "snapshot" in test context
 - Expected values were added in the same commit as the test assertions
 **Rule:** A test comparing system output against values generated by the same system is circular. It proves consistency, not correctness.
 **4. Expected value provenance** (for comparison/parity/migration requirements)
 When a requirement demands comparison with an external source ("identical to X", "matches Y", "same output as Z"):
 - Is the external source actually invoked or referenced in the test pipeline?
 - Do fixture files contain data sourced from the external system?
 - Or do all expected values come from the new system itself or from mathematical formulas?
 **Provenance classification:**
 - VALID: Expected value from external/legacy system output, manual capture, or independent oracle
 - PARTIAL: Expected value from mathematical derivation (proves formula, not system match)
 - CIRCULAR: Expected value from the system being tested
 - UNKNOWN: No provenance information — treat as SUSPECT
 **5. Assertion strength**
 For each test linked to a requirement, classify the strongest assertion:
 | Level | Examples | Proves |
 |-------|---------|--------|
 | Existence | `toBeDefined()`, `!= null` | Something returned |
 | Type | `typeof x === 'number'` | Correct shape |
 | Status | `code === 200` | No error |
 | Value | `toEqual(expected)`, `toBeCloseTo(x)` | Specific value |
 | Behavioral | Multi-step workflow assertions | End-to-end correctness |
 If a requirement demands value-level or behavioral-level proof and the test only has existence/type/status assertions → INSUFFICIENT.
 **6. Coverage quantity**
 If a requirement specifies a quantity of test cases (e.g., "30 calculations"), check if the actual number of active (non-skipped) test cases meets the requirement.
 **Reporting — add to VERIFICATION.md:**
 ```markdown
 ### Test Quality Audit
 | Test File | Linked Req | Active | Skipped | Circular | Assertion Level | Verdict |
 |-----------|-----------|--------|---------|----------|----------------|---------|
 **Disabled tests on requirements:** {N} → {BLOCKER if any req has ONLY disabled tests}
 **Circular patterns detected:** {N} → {BLOCKER if any}
 **Insufficient assertions:** {N} → {WARNING}
 ```
 **Impact on status:** Any BLOCKER from test quality audit <20><><EFBFBD> overall status = `gaps_found`, regardless of other checks passing.
 </step>
 <step name="identify_human_verification">
 **Always needs human:** Visual appearance, user flow completion, real-time behavior (WebSocket/SSE), external service integration, performance feel, error message clarity.
@@ -201,7 +289,7 @@ Format each as: Test Name → What to do → Expected result → Why can't verif
 <step name="determine_status">
 Classify status using this decision tree IN ORDER (most restrictive first):
-1. IF any truth FAILED, artifact MISSING/STUB, key link NOT_WIRED, or blocker found:
+1. IF any truth FAILED, artifact MISSING/STUB, key link NOT_WIRED, blocker found, **or test quality audit found blockers (disabled requirement tests, circular tests)**:
   → **gaps_found**
 2. IF the previous step produced ANY human verification items:
@@ -222,7 +310,7 @@ If gaps_found:
 2. **Generate plan per cluster:** Objective, 2-3 tasks (files/action/verify each), re-verify step. Keep focused: single concern per plan.
-3. **Order by dependency:** Fix missing → fix stubs → fix wiring → verify.
+3. **Order by dependency:** Fix missing → fix stubs → fix wiring → **fix test evidence** → verify.
 </step>
 <step name="create_report">
@@ -253,6 +341,7 @@ Orchestrator routes: `passed` → update_roadmap | `gaps_found` → create/execu
 - [ ] All key links verified
 - [ ] Requirements coverage assessed (if applicable)
 - [ ] Anti-patterns scanned and categorized
 - [ ] Test quality audited (disabled tests, circular patterns, assertion strength, provenance)
 - [ ] Human verification items identified
 - [ ] Overall status determined
 - [ ] Fix plans generated (if gaps_found)
--- a/tests/verify-test-quality.test.cjs
+++ b/tests/verify-test-quality.test.cjs
@@ -0,0 +1,204 @@
 /**
 * Tests for the audit_test_quality step in verify-phase.md
 *
 * Validates that the verifier's test quality audit detects:
 * - Disabled tests (it.skip) covering requirements
 * - Circular tests (system generating its own expected values)
 * - Weak assertions on requirement-linked tests
 */
 const { describe, test, beforeEach, afterEach } = require('node:test');
 const assert = require('node:assert/strict');
 const fs = require('fs');
 const path = require('path');
 const { createTempProject, cleanup } = require('./helpers.cjs');
 describe('audit_test_quality step', () => {
  let tmpDir;
  beforeEach(() => {
    tmpDir = createTempProject();
  });
  afterEach(() => {
    cleanup(tmpDir);
  });
  describe('disabled test detection', () => {
    test('detects it.skip in test files', () => {
      const testDir = path.join(tmpDir, 'tests');
      fs.mkdirSync(testDir, { recursive: true });
      fs.writeFileSync(path.join(testDir, 'parity.test.js'), [
        'describe("parity", () => {',
        '  it.skip("matches PHP output", async () => {',
        '    expect(result).toBeCloseTo(155.96, 2);',
        '  });',
        '});',
      ].join('\n'));
      const content = fs.readFileSync(path.join(testDir, 'parity.test.js'), 'utf8');
      const skipPatterns = /it\.skip|describe\.skip|test\.skip|xit\(|xdescribe\(|xtest\(/g;
      const matches = content.match(skipPatterns);
      assert.ok(matches, 'Should detect skip patterns');
      assert.strictEqual(matches.length, 1);
    });
    test('detects multiple skip patterns across frameworks', () => {
      const testDir = path.join(tmpDir, 'tests');
      fs.mkdirSync(testDir, { recursive: true });
      fs.writeFileSync(path.join(testDir, 'multi.test.js'), [
        'describe.skip("suite", () => {});',
        'xit("old jasmine", () => {});',
        'test.skip("jest skip", () => {});',
        'it.todo("not implemented");',
      ].join('\n'));
      const content = fs.readFileSync(path.join(testDir, 'multi.test.js'), 'utf8');
      const skipPatterns = /it\.skip|describe\.skip|test\.skip|xit\(|xdescribe\(|xtest\(|it\.todo|test\.todo/g;
      const matches = content.match(skipPatterns);
      assert.ok(matches, 'Should detect all skip variants');
      assert.strictEqual(matches.length, 4);
    });
    test('does not flag active tests as skipped', () => {
      const testDir = path.join(tmpDir, 'tests');
      fs.mkdirSync(testDir, { recursive: true });
      fs.writeFileSync(path.join(testDir, 'active.test.js'), [
        'describe("active suite", () => {',
        '  it("does the thing", () => {',
        '    expect(result).toBe(true);',
        '  });',
        '  test("also works", () => {',
        '    expect(other).toBe(42);',
        '  });',
        '});',
      ].join('\n'));
      const content = fs.readFileSync(path.join(testDir, 'active.test.js'), 'utf8');
      const skipPatterns = /it\.skip|describe\.skip|test\.skip|xit\(|xdescribe\(|xtest\(|it\.todo|test\.todo/g;
      const matches = content.match(skipPatterns);
      assert.strictEqual(matches, null, 'Active tests should not match skip patterns');
    });
  });
  describe('circular test detection', () => {
    test('detects script that imports system-under-test and writes fixtures', () => {
      const testDir = path.join(tmpDir, 'tests');
      fs.mkdirSync(testDir, { recursive: true });
      fs.writeFileSync(path.join(testDir, 'captureBaseline.js'), [
        'import { CalculationService } from "../server/services/calculationService.js";',
        'import { writeFileSync } from "fs";',
        '',
        'const result = await CalculationService.execute(input);',
        'fixture.expectedOutput = result.value;',
        'writeFileSync("fixtures/data.json", JSON.stringify(fixture));',
      ].join('\n'));
      const content = fs.readFileSync(path.join(testDir, 'captureBaseline.js'), 'utf8');
      const importsSystem = /import.*(?:Service|Engine|Calculator|Controller)/.test(content);
      const writesFiles = /writeFileSync|writeFile|fs\.write/.test(content);
      assert.ok(importsSystem, 'Should detect system-under-test import');
      assert.ok(writesFiles, 'Should detect file writing');
      assert.ok(importsSystem && writesFiles, 'Script that imports SUT and writes fixtures is CIRCULAR');
    });
    test('does not flag test helpers that only read fixtures', () => {
      const testDir = path.join(tmpDir, 'tests');
      fs.mkdirSync(testDir, { recursive: true });
      fs.writeFileSync(path.join(testDir, 'loadFixtures.js'), [
        'import { readFileSync } from "fs";',
        'export function loadFixture(name) {',
        '  return JSON.parse(readFileSync(`fixtures/${name}.json`, "utf8"));',
        '}',
      ].join('\n'));
      const content = fs.readFileSync(path.join(testDir, 'loadFixtures.js'), 'utf8');
      const importsSystem = /import.*(?:Service|Engine|Calculator|Controller)/.test(content);
      const writesFiles = /writeFileSync|writeFile|fs\.write/.test(content);
      assert.ok(!importsSystem, 'Should not flag read-only helper as importing SUT');
      assert.ok(!writesFiles, 'Should not flag read-only helper as writing files');
    });
  });
  describe('assertion strength classification', () => {
    test('classifies existence-only assertions as INSUFFICIENT for value requirements', () => {
      const assertions = [
        'expect(result).toBeDefined()',
        'expect(result).not.toBeNull()',
        'assert.ok(result)',
      ];
      const existencePattern = /toBeDefined|not\.toBeNull|assert\.ok\(/;
      const valuePattern = /toEqual|toBeCloseTo|strictEqual|deepStrictEqual/;
      for (const assertion of assertions) {
        assert.ok(existencePattern.test(assertion), `"${assertion}" should match existence pattern`);
        assert.ok(!valuePattern.test(assertion), `"${assertion}" should NOT match value pattern`);
      }
    });
    test('classifies value assertions as sufficient', () => {
      const assertions = [
        'expect(result).toBeCloseTo(155.96, 2)',
        'expect(result).toEqual({ amount: 100 })',
        'assert.strictEqual(result, 42)',
      ];
      const valuePattern = /toEqual|toBeCloseTo|strictEqual|deepStrictEqual/;
      for (const assertion of assertions) {
        assert.ok(valuePattern.test(assertion), `"${assertion}" should match value pattern`);
      }
    });
  });
  describe('provenance classification', () => {
    test('fixture with legacy system comment classified as VALID', () => {
      const fixture = {
        legacyId: 10341,
        comment: 'Real PHP fixture - output from legacy system',
        dbDependent: true,
        expectedOutput: { value: 155.96 },
      };
      const hasLegacySource = /legacy|php|real|manual|captured from/i.test(fixture.comment || '');
      assert.ok(hasLegacySource, 'Comment referencing legacy system = VALID provenance');
    });
    test('fixture with synthetic/baseline comment classified as SUSPECT', () => {
      const fixture = {
        legacyId: null,
        comment: 'Synthetic offline fixture - computed from known algorithm',
        dbDependent: false,
        expectedOutput: { value: 1240.68 },
      };
      const hasSyntheticSource = /synthetic|computed|baseline|generated|captured from engine/i.test(fixture.comment || '');
      const hasLegacySource = /legacy|php|real output|manual capture/i.test(fixture.comment || '');
      assert.ok(hasSyntheticSource, 'Comment indicating synthetic source detected');
      assert.ok(!hasLegacySource, 'Should NOT be classified as legacy source');
    });
    test('fixture with no comment classified as UNKNOWN', () => {
      const fixture = {
        expectedOutput: { value: 42 },
      };
      const hasAnyProvenance = (fixture.comment || '').length > 0;
      assert.ok(!hasAnyProvenance, 'No comment = UNKNOWN provenance');
    });
  });
 });