Files
get-shit-done/tests/ai-evals.test.cjs
Fana 33575ba91d feat: /gsd-ai-integration-phase + /gsd-eval-review — AI framework selection and eval coverage layer (#1971)
* feat: /gsd:ai-phase + /gsd:eval-review — AI evals and framework selection layer

Adds a structured AI development layer to GSD with 5 new agents, 2 new
commands, 2 new workflows, 2 reference files, and 1 template.

Commands:
- /gsd:ai-phase [N] — pre-planning AI design contract (inserts between
  discuss-phase and plan-phase). Orchestrates 4 agents in sequence:
  framework-selector → ai-researcher → domain-researcher → eval-planner.
  Output: AI-SPEC.md with framework decision, implementation guidance,
  domain expert context, and evaluation strategy.
- /gsd:eval-review [N] — retroactive eval coverage audit. Scores each
  planned eval dimension as COVERED/PARTIAL/MISSING. Output: EVAL-REVIEW.md
  with 0-100 score, verdict, and remediation plan.

Agents:
- gsd-framework-selector: interactive decision matrix (6 questions) →
  scored framework recommendation for CrewAI, LlamaIndex, LangChain,
  LangGraph, OpenAI Agents SDK, Claude Agent SDK, AutoGen/AG2, Haystack
- gsd-ai-researcher: fetches official framework docs + writes AI systems
  best practices (Pydantic structured outputs, async-first, prompt
  discipline, context window management, cost/latency budget)
- gsd-domain-researcher: researches business domain and use-case context —
  surfaces domain expert evaluation criteria, industry failure modes,
  regulatory constraints, and practitioner rubric ingredients before
  eval-planner writes measurable criteria
- gsd-eval-planner: designs evaluation strategy grounded in domain context;
  defaults to Arize Phoenix (tracing) + RAGAS (RAG eval) with detect-first
  guard for existing tooling
- gsd-eval-auditor: retroactive codebase scan → scores eval coverage

Integration points:
- plan-phase: non-blocking nudge (step 4.5) when AI keywords detected and
  no AI-SPEC.md present
- settings: new workflow.ai_phase toggle (default on)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: refine ai-integration-phase layer — rename, house style, consistency fixes

Amends the ai-evals framework layer (df8cb6c) with post-review improvements
before opening upstream PR.

Rename /gsd:ai-phase → /gsd:ai-integration-phase:
- Renamed commands/gsd/ai-phase.md → ai-integration-phase.md
- Renamed get-shit-done/workflows/ai-phase.md → ai-integration-phase.md
- Updated config key: workflow.ai_phase → workflow.ai_integration_phase
- Updated repair action: addAiPhaseKey → addAiIntegrationPhaseKey
- Updated all 84 cross-references across agents, workflows, templates, tests

Consistency fixes (same class as PR #1380 review):
- commands/gsd: objective described 3-agent chain, missing gsd-domain-researcher
- workflows/ai-integration-phase: purpose tag described 3-agent chain + "locks
  three things" — updated to 4 agents + 4 outputs
- workflows/ai-integration-phase: missing DOMAIN_MODEL resolve-model call in
  step 1 (domain-researcher was spawned in step 7.5 with no model variable)
- workflows/ai-integration-phase: fractional step ## 7.5 renumbered to integers
  (steps 8–12 shifted)

Agent house style (GSD meta-prompting conformance):
- All 5 new agents refactored to execution_flow + step name="" structure
- Role blocks compressed to 2 lines (removed verbose "Core responsibilities")
- Added skills: frontmatter to all 5 agents (agent-frontmatter tests)
- Added # hooks: commented pattern to file-writing agents
- Added ALWAYS use Write tool anti-heredoc instruction to file-writing agents
- Line reductions: ai-researcher −41%, domain-researcher −25%, eval-planner −26%,
  eval-auditor −25%, framework-selector −9%

Test coverage (tests/ai-evals.test.cjs — 48 tests):
- CONFIG: workflow.ai_integration_phase defaults and config-set/get
- HEALTH: W010 warning emission and addAiIntegrationPhaseKey repair
- TEMPLATE: AI-SPEC.md section completeness (10 sections)
- COMMAND: ai-integration-phase + eval-review frontmatter validity
- AGENTS: all 5 new agent files exist
- REFERENCES: ai-evals.md + ai-frameworks.md exist and are non-empty
- WORKFLOW: plan-phase nudge integration, workflow files exist + agent coverage

603/603 tests passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat: add Google ADK to framework selector and reference matrix

Google ADK (released March 2025) was missing from the framework options.
Adds Python + Java multi-agent framework optimised for Gemini / Vertex AI.

- get-shit-done/references/ai-frameworks.md: add Google ADK profile (type,
  language, model support, best for, avoid if, strengths, weaknesses, eval
  concerns); update Quick Picks, By System Type, and By Model Commitment tables
- agents/gsd-framework-selector.md: add "Google (Gemini)" to model provider
  interview question
- agents/gsd-ai-researcher.md: add Google ADK docs URL to documentation_sources

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: adapt to upstream conventions post-rebase

- Remove skills: frontmatter from all 5 new agents (upstream changed
  convention — skills: breaks Gemini CLI and must not be present)
- Add workflow.ai_integration_phase to VALID_CONFIG_KEYS whitelist in
  config.cjs (config-set blocked unknown keys)
- Add ai_integration_phase: true to CONFIG_DEFAULTS in core.cjs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: rephrase 4b.1 line to avoid false-positive in prompt-injection scan

"contract as a Pydantic model" matched the `act as a` pattern case-insensitively.
Rephrased to "output schema using a Pydantic model".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: adapt to upstream conventions (W016, colon refs, config docs)

- Replace verify.cjs from upstream to restore W010-W015 + cmdValidateAgents,
  lost when rebase conflict was resolved with --theirs
- Add W016 (workflow.ai_integration_phase absent) inside the config try block,
  avoids collision with upstream's W010 agent-installation check
- Add addAiIntegrationPhaseKey repair case mirroring addNyquistKey pattern
- Replace /gsd: colon format with /gsd- hyphen format across all new files
  (agents, workflows, templates, verify.cjs) per stale-colon-refs guard (#1748)
- Add workflow.ai_integration_phase to planning-config.md reference table
- Add ai_integration_phase → workflow.ai_integration_phase to NAMESPACE_MAP
  in config-field-docs.test.cjs so CONFIG_DEFAULTS coverage check passes
- Update ai-evals tests to use W016 instead of W010

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: add 5 new agents to E2E Copilot install expected list

gsd-ai-researcher, gsd-domain-researcher, gsd-eval-auditor,
gsd-eval-planner, gsd-framework-selector added to the hardcoded
expected agent list in copilot-install.test.cjs (#1890).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-10 10:49:00 -04:00

401 lines
17 KiB
JavaScript

/**
* GSD AI Evals Framework Tests
*
* Validates the /gsd-ai-integration-phase + /gsd-eval-review contribution:
* - workflow.ai_integration_phase key in config defaults and config-set/get
* - W016 validate-health warning when ai_integration_phase absent
* - addAiIntegrationPhaseKey repair action
* - AI-SPEC.md template section completeness
* - New agent frontmatter (picked up by agent-frontmatter.test.cjs — covered there)
* - plan-phase.md Step 4.5 AI-keyword nudge block
* - ai-integration-phase and eval-review command frontmatter
* - ai-evals.md and ai-frameworks.md reference files exist and are non-empty
*/
const { test, describe, beforeEach, afterEach } = require('node:test');
const assert = require('node:assert');
const fs = require('fs');
const path = require('path');
const { runGsdTools, createTempProject, cleanup } = require('./helpers.cjs');
const REPO_ROOT = path.join(__dirname, '..');
const AGENTS_DIR = path.join(REPO_ROOT, 'agents');
const COMMANDS_DIR = path.join(REPO_ROOT, 'commands', 'gsd');
const WORKFLOWS_DIR = path.join(REPO_ROOT, 'get-shit-done', 'workflows');
const TEMPLATES_DIR = path.join(REPO_ROOT, 'get-shit-done', 'templates');
const REFERENCES_DIR = path.join(REPO_ROOT, 'get-shit-done', 'references');
// ─── Helpers ─────────────────────────────────────────────────────────────────
function readConfig(tmpDir) {
return JSON.parse(fs.readFileSync(path.join(tmpDir, '.planning', 'config.json'), 'utf-8'));
}
function writeConfig(tmpDir, obj) {
fs.writeFileSync(
path.join(tmpDir, '.planning', 'config.json'),
JSON.stringify(obj, null, 2),
'utf-8'
);
}
function writeMinimalHealth(tmpDir) {
fs.writeFileSync(path.join(tmpDir, '.planning', 'PROJECT.md'),
'# Project\n\n## What This Is\n\nFoo.\n\n## Core Value\n\nBar.\n\n## Requirements\n\nBaz.\n');
fs.writeFileSync(path.join(tmpDir, '.planning', 'ROADMAP.md'),
'# Roadmap\n\n### Phase 1: Setup\n');
fs.writeFileSync(path.join(tmpDir, '.planning', 'STATE.md'),
'# Session State\n\nPhase 1 in progress.\n');
fs.mkdirSync(path.join(tmpDir, '.planning', 'phases', '01-setup'), { recursive: true });
}
// ─── Config: workflow.ai_integration_phase default ───────────────────────────────────────
describe('CONFIG: workflow.ai_integration_phase default', () => {
let tmpDir;
beforeEach(() => { tmpDir = createTempProject(); });
afterEach(() => { cleanup(tmpDir); });
test('config-ensure-section includes workflow.ai_integration_phase as boolean', () => {
const result = runGsdTools('config-ensure-section', tmpDir);
assert.ok(result.success, `Command failed: ${result.error}`);
const config = readConfig(tmpDir);
assert.ok(config.workflow && typeof config.workflow === 'object', 'workflow should exist');
assert.strictEqual(typeof config.workflow.ai_integration_phase, 'boolean', 'workflow.ai_integration_phase should be boolean');
});
test('workflow.ai_integration_phase defaults to true', () => {
runGsdTools('config-ensure-section', tmpDir);
const config = readConfig(tmpDir);
assert.strictEqual(config.workflow.ai_integration_phase, true, 'workflow.ai_integration_phase should default to true');
});
});
// ─── Config: config-set / config-get workflow.ai_integration_phase ───────────────────────
describe('CONFIG: config-set / config-get workflow.ai_integration_phase', () => {
let tmpDir;
beforeEach(() => {
tmpDir = createTempProject();
runGsdTools('config-ensure-section', tmpDir);
});
afterEach(() => { cleanup(tmpDir); });
test('config-set workflow.ai_integration_phase false persists as boolean false', () => {
const result = runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
assert.ok(result.success, `config-set failed: ${result.error}`);
const config = readConfig(tmpDir);
assert.strictEqual(config.workflow.ai_integration_phase, false);
assert.strictEqual(typeof config.workflow.ai_integration_phase, 'boolean');
});
test('config-set workflow.ai_integration_phase true persists as boolean true', () => {
runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
const result = runGsdTools('config-set workflow.ai_integration_phase true', tmpDir);
assert.ok(result.success, `config-set failed: ${result.error}`);
const config = readConfig(tmpDir);
assert.strictEqual(config.workflow.ai_integration_phase, true);
});
test('config-get workflow.ai_integration_phase returns the stored value', () => {
runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
const result = runGsdTools('config-get workflow.ai_integration_phase', tmpDir);
assert.ok(result.success, `config-get failed: ${result.error}`);
assert.strictEqual(JSON.parse(result.output), false);
});
});
// ─── Validate Health: W016 ────────────────────────────────────────────────────
describe('HEALTH: W016 — workflow.ai_integration_phase absent', () => {
let tmpDir;
beforeEach(() => { tmpDir = createTempProject(); });
afterEach(() => { cleanup(tmpDir); });
test('emits W016 when workflow.ai_integration_phase absent from config', () => {
writeMinimalHealth(tmpDir);
writeConfig(tmpDir, { model_profile: 'balanced', workflow: { research: true, nyquist_validation: true } });
const result = runGsdTools('validate health', tmpDir);
assert.ok(result.success, `Command failed: ${result.error}`);
const output = JSON.parse(result.output);
assert.ok(
output.warnings.some(w => w.code === 'W016'),
`Expected W016 in warnings: ${JSON.stringify(output.warnings)}`
);
});
test('does not emit W016 when workflow.ai_integration_phase is explicitly set', () => {
writeMinimalHealth(tmpDir);
writeConfig(tmpDir, {
model_profile: 'balanced',
workflow: { research: true, nyquist_validation: true, ai_integration_phase: true },
});
const result = runGsdTools('validate health', tmpDir);
assert.ok(result.success, `Command failed: ${result.error}`);
const output = JSON.parse(result.output);
assert.ok(
!output.warnings.some(w => w.code === 'W016'),
`Should not have W016: ${JSON.stringify(output.warnings)}`
);
});
test('does not emit W016 when workflow.ai_integration_phase is false (explicit opt-out)', () => {
writeMinimalHealth(tmpDir);
writeConfig(tmpDir, {
model_profile: 'balanced',
workflow: { research: true, nyquist_validation: true, ai_integration_phase: false },
});
const result = runGsdTools('validate health', tmpDir);
assert.ok(result.success, `Command failed: ${result.error}`);
const output = JSON.parse(result.output);
assert.ok(
!output.warnings.some(w => w.code === 'W016'),
`Should not have W016: ${JSON.stringify(output.warnings)}`
);
});
});
// ─── Validate Health --repair: addAiIntegrationPhaseKey ─────────────────────────────────
describe('HEALTH --repair: addAiIntegrationPhaseKey', () => {
let tmpDir;
beforeEach(() => { tmpDir = createTempProject(); });
afterEach(() => { cleanup(tmpDir); });
test('adds workflow.ai_integration_phase via addAiIntegrationPhaseKey repair', () => {
writeMinimalHealth(tmpDir);
const configPath = path.join(tmpDir, '.planning', 'config.json');
fs.writeFileSync(configPath,
JSON.stringify({ model_profile: 'balanced', workflow: { research: true, nyquist_validation: true } }, null, 2)
);
const result = runGsdTools('validate health --repair', tmpDir);
assert.ok(result.success, `Command failed: ${result.error}`);
const output = JSON.parse(result.output);
const addAction = output.repairs_performed.find(r => r.action === 'addAiIntegrationPhaseKey');
assert.ok(addAction, `Expected addAiIntegrationPhaseKey action: ${JSON.stringify(output.repairs_performed)}`);
assert.strictEqual(addAction.success, true);
const config = readConfig(tmpDir);
assert.strictEqual(config.workflow.ai_integration_phase, true);
});
});
// ─── AI-SPEC.md Template Structure ───────────────────────────────────────────
describe('TEMPLATE: AI-SPEC.md section completeness', () => {
const templatePath = path.join(TEMPLATES_DIR, 'AI-SPEC.md');
let content;
test('AI-SPEC.md template exists', () => {
assert.ok(fs.existsSync(templatePath), 'AI-SPEC.md template should exist');
content = fs.readFileSync(templatePath, 'utf-8');
assert.ok(content.length > 100, 'AI-SPEC.md should be non-empty');
});
const requiredSections = [
['## 1. System Classification', 'Section 1 (System Classification)'],
['## 1b. Domain Context', 'Section 1b (Domain Context)'],
['## 2. Framework Decision', 'Section 2 (Framework Decision)'],
['## 3. Framework Quick Reference','Section 3 (Framework Quick Reference)'],
['## 4. Implementation Guidance', 'Section 4 (Implementation Guidance)'],
['## 4b. AI Systems Best Practices','Section 4b (AI Systems Best Practices)'],
['## 5. Evaluation Strategy', 'Section 5 (Evaluation Strategy)'],
['## 6. Guardrails', 'Section 6 (Guardrails)'],
['## 7. Production Monitoring', 'Section 7 (Production Monitoring)'],
['## Checklist', 'Checklist section'],
];
for (const [heading, label] of requiredSections) {
test(`template contains ${label}`, () => {
const c = fs.readFileSync(templatePath, 'utf-8');
assert.ok(c.includes(heading), `Template missing: ${heading}`);
});
}
test('template checklist has at least 10 items', () => {
const c = fs.readFileSync(templatePath, 'utf-8');
const items = (c.match(/^- \[[ x]\]/gm) || []);
assert.ok(items.length >= 10, `Expected ≥10 checklist items, found ${items.length}`);
});
test('template Section 1b has domain rubric table columns (Good/Bad/Stakes)', () => {
const c = fs.readFileSync(templatePath, 'utf-8');
assert.ok(c.includes('What Domain Experts Evaluate Against'), 'Missing domain rubric subsection');
});
test('template Section 4b has Pydantic structured outputs guidance', () => {
const c = fs.readFileSync(templatePath, 'utf-8');
assert.ok(c.includes('Pydantic'), 'Section 4b missing Pydantic guidance');
});
test('template Section 6 has online guardrails and offline flywheel tables', () => {
const c = fs.readFileSync(templatePath, 'utf-8');
assert.ok(c.includes('Online'), 'Section 6 missing Online guardrails');
assert.ok(c.includes('Offline'), 'Section 6 missing Offline flywheel');
});
});
// ─── Command Frontmatter ──────────────────────────────────────────────────────
describe('COMMAND: ai-integration-phase and eval-review frontmatter', () => {
const commands = ['ai-integration-phase', 'eval-review'];
for (const cmd of commands) {
test(`${cmd}.md exists`, () => {
const p = path.join(COMMANDS_DIR, `${cmd}.md`);
assert.ok(fs.existsSync(p), `commands/gsd/${cmd}.md should exist`);
});
test(`${cmd}.md has name, description, argument-hint`, () => {
const content = fs.readFileSync(path.join(COMMANDS_DIR, `${cmd}.md`), 'utf-8');
const frontmatter = content.split('---')[1] || '';
assert.ok(frontmatter.includes('name:'), `${cmd}.md missing name:`);
assert.ok(frontmatter.includes('description:'), `${cmd}.md missing description:`);
assert.ok(frontmatter.includes('argument-hint:'), `${cmd}.md missing argument-hint:`);
});
}
test('ai-integration-phase.md name is gsd:ai-integration-phase', () => {
const content = fs.readFileSync(path.join(COMMANDS_DIR, 'ai-integration-phase.md'), 'utf-8');
assert.ok(content.includes('name: gsd:ai-integration-phase'), 'ai-integration-phase command name mismatch');
});
test('eval-review.md name is gsd:eval-review', () => {
const content = fs.readFileSync(path.join(COMMANDS_DIR, 'eval-review.md'), 'utf-8');
assert.ok(content.includes('name: gsd:eval-review'), 'eval-review command name mismatch');
});
});
// ─── New Agents Exist ─────────────────────────────────────────────────────────
describe('AGENTS: new AI-evals agents exist', () => {
const newAgents = [
'gsd-framework-selector',
'gsd-ai-researcher',
'gsd-domain-researcher',
'gsd-eval-planner',
'gsd-eval-auditor',
];
for (const agent of newAgents) {
test(`${agent}.md exists`, () => {
assert.ok(
fs.existsSync(path.join(AGENTS_DIR, `${agent}.md`)),
`agents/${agent}.md should exist`
);
});
}
});
// ─── Reference Files ──────────────────────────────────────────────────────────
describe('REFERENCES: ai-frameworks.md and ai-evals.md', () => {
const refs = ['ai-frameworks.md', 'ai-evals.md'];
for (const ref of refs) {
test(`${ref} exists and is non-empty`, () => {
const p = path.join(REFERENCES_DIR, ref);
assert.ok(fs.existsSync(p), `references/${ref} should exist`);
const content = fs.readFileSync(p, 'utf-8');
assert.ok(content.length > 200, `references/${ref} should have substantial content`);
});
}
test('ai-frameworks.md covers key frameworks', () => {
const content = fs.readFileSync(path.join(REFERENCES_DIR, 'ai-frameworks.md'), 'utf-8');
for (const fw of ['CrewAI', 'LlamaIndex', 'LangChain', 'LangGraph']) {
assert.ok(content.includes(fw), `ai-frameworks.md should mention ${fw}`);
}
});
test('ai-evals.md covers eval tooling defaults', () => {
const content = fs.readFileSync(path.join(REFERENCES_DIR, 'ai-evals.md'), 'utf-8');
assert.ok(content.includes('Arize Phoenix') || content.includes('Phoenix'), 'ai-evals.md should mention Arize Phoenix');
assert.ok(content.includes('RAGAS'), 'ai-evals.md should mention RAGAS');
});
});
// ─── Workflow: plan-phase Step 4.5 AI keyword nudge ──────────────────────────
describe('WORKFLOW: plan-phase.md AI nudge integration', () => {
const planPhasePath = path.join(WORKFLOWS_DIR, 'plan-phase.md');
test('plan-phase.md exists', () => {
assert.ok(fs.existsSync(planPhasePath), 'workflows/plan-phase.md should exist');
});
test('plan-phase.md contains AI keyword detection for LLM/agent/RAG terms', () => {
const content = fs.readFileSync(planPhasePath, 'utf-8');
assert.ok(
content.includes('agent') && content.includes('llm') || content.includes('rag') || content.includes('AI'),
'plan-phase.md should contain AI keyword detection'
);
});
test('plan-phase.md references /gsd-ai-integration-phase nudge', () => {
const content = fs.readFileSync(planPhasePath, 'utf-8');
assert.ok(
content.includes('ai-integration-phase') || content.includes('ai_integration_phase'),
'plan-phase.md should reference ai-integration-phase workflow'
);
});
test('plan-phase.md references workflow.ai_integration_phase config toggle', () => {
const content = fs.readFileSync(planPhasePath, 'utf-8');
assert.ok(
content.includes('ai_integration_phase'),
'plan-phase.md should check workflow.ai_integration_phase config'
);
});
});
// ─── Workflow: ai-integration-phase and eval-review workflows exist ──────────────────────
describe('WORKFLOW: ai-integration-phase and eval-review workflow files', () => {
const workflows = ['ai-integration-phase', 'eval-review'];
for (const wf of workflows) {
test(`${wf}.md workflow exists`, () => {
assert.ok(
fs.existsSync(path.join(WORKFLOWS_DIR, `${wf}.md`)),
`workflows/${wf}.md should exist`
);
});
}
test('ai-integration-phase.md orchestrates 4 agents', () => {
const content = fs.readFileSync(path.join(WORKFLOWS_DIR, 'ai-integration-phase.md'), 'utf-8');
for (const agent of ['gsd-framework-selector', 'gsd-ai-researcher', 'gsd-domain-researcher', 'gsd-eval-planner']) {
assert.ok(content.includes(agent), `ai-integration-phase.md should reference ${agent}`);
}
});
test('eval-review.md references gsd-eval-auditor', () => {
const content = fs.readFileSync(path.join(WORKFLOWS_DIR, 'eval-review.md'), 'utf-8');
assert.ok(content.includes('gsd-eval-auditor'), 'eval-review.md should reference gsd-eval-auditor');
});
test('select-framework.md does NOT exist (removed per design)', () => {
assert.ok(
!fs.existsSync(path.join(WORKFLOWS_DIR, 'select-framework.md')),
'select-framework.md should not exist — removed in favour of ai-integration-phase nudge'
);
});
});