* feat: /gsd:ai-phase + /gsd:eval-review — AI evals and framework selection layer Adds a structured AI development layer to GSD with 5 new agents, 2 new commands, 2 new workflows, 2 reference files, and 1 template. Commands: - /gsd:ai-phase [N] — pre-planning AI design contract (inserts between discuss-phase and plan-phase). Orchestrates 4 agents in sequence: framework-selector → ai-researcher → domain-researcher → eval-planner. Output: AI-SPEC.md with framework decision, implementation guidance, domain expert context, and evaluation strategy. - /gsd:eval-review [N] — retroactive eval coverage audit. Scores each planned eval dimension as COVERED/PARTIAL/MISSING. Output: EVAL-REVIEW.md with 0-100 score, verdict, and remediation plan. Agents: - gsd-framework-selector: interactive decision matrix (6 questions) → scored framework recommendation for CrewAI, LlamaIndex, LangChain, LangGraph, OpenAI Agents SDK, Claude Agent SDK, AutoGen/AG2, Haystack - gsd-ai-researcher: fetches official framework docs + writes AI systems best practices (Pydantic structured outputs, async-first, prompt discipline, context window management, cost/latency budget) - gsd-domain-researcher: researches business domain and use-case context — surfaces domain expert evaluation criteria, industry failure modes, regulatory constraints, and practitioner rubric ingredients before eval-planner writes measurable criteria - gsd-eval-planner: designs evaluation strategy grounded in domain context; defaults to Arize Phoenix (tracing) + RAGAS (RAG eval) with detect-first guard for existing tooling - gsd-eval-auditor: retroactive codebase scan → scores eval coverage Integration points: - plan-phase: non-blocking nudge (step 4.5) when AI keywords detected and no AI-SPEC.md present - settings: new workflow.ai_phase toggle (default on) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: refine ai-integration-phase layer — rename, house style, consistency fixes Amends the ai-evals framework layer (df8cb6c) with post-review improvements before opening upstream PR. Rename /gsd:ai-phase → /gsd:ai-integration-phase: - Renamed commands/gsd/ai-phase.md → ai-integration-phase.md - Renamed get-shit-done/workflows/ai-phase.md → ai-integration-phase.md - Updated config key: workflow.ai_phase → workflow.ai_integration_phase - Updated repair action: addAiPhaseKey → addAiIntegrationPhaseKey - Updated all 84 cross-references across agents, workflows, templates, tests Consistency fixes (same class as PR #1380 review): - commands/gsd: objective described 3-agent chain, missing gsd-domain-researcher - workflows/ai-integration-phase: purpose tag described 3-agent chain + "locks three things" — updated to 4 agents + 4 outputs - workflows/ai-integration-phase: missing DOMAIN_MODEL resolve-model call in step 1 (domain-researcher was spawned in step 7.5 with no model variable) - workflows/ai-integration-phase: fractional step ## 7.5 renumbered to integers (steps 8–12 shifted) Agent house style (GSD meta-prompting conformance): - All 5 new agents refactored to execution_flow + step name="" structure - Role blocks compressed to 2 lines (removed verbose "Core responsibilities") - Added skills: frontmatter to all 5 agents (agent-frontmatter tests) - Added # hooks: commented pattern to file-writing agents - Added ALWAYS use Write tool anti-heredoc instruction to file-writing agents - Line reductions: ai-researcher −41%, domain-researcher −25%, eval-planner −26%, eval-auditor −25%, framework-selector −9% Test coverage (tests/ai-evals.test.cjs — 48 tests): - CONFIG: workflow.ai_integration_phase defaults and config-set/get - HEALTH: W010 warning emission and addAiIntegrationPhaseKey repair - TEMPLATE: AI-SPEC.md section completeness (10 sections) - COMMAND: ai-integration-phase + eval-review frontmatter validity - AGENTS: all 5 new agent files exist - REFERENCES: ai-evals.md + ai-frameworks.md exist and are non-empty - WORKFLOW: plan-phase nudge integration, workflow files exist + agent coverage 603/603 tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add Google ADK to framework selector and reference matrix Google ADK (released March 2025) was missing from the framework options. Adds Python + Java multi-agent framework optimised for Gemini / Vertex AI. - get-shit-done/references/ai-frameworks.md: add Google ADK profile (type, language, model support, best for, avoid if, strengths, weaknesses, eval concerns); update Quick Picks, By System Type, and By Model Commitment tables - agents/gsd-framework-selector.md: add "Google (Gemini)" to model provider interview question - agents/gsd-ai-researcher.md: add Google ADK docs URL to documentation_sources Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions post-rebase - Remove skills: frontmatter from all 5 new agents (upstream changed convention — skills: breaks Gemini CLI and must not be present) - Add workflow.ai_integration_phase to VALID_CONFIG_KEYS whitelist in config.cjs (config-set blocked unknown keys) - Add ai_integration_phase: true to CONFIG_DEFAULTS in core.cjs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: rephrase 4b.1 line to avoid false-positive in prompt-injection scan "contract as a Pydantic model" matched the `act as a` pattern case-insensitively. Rephrased to "output schema using a Pydantic model". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions (W016, colon refs, config docs) - Replace verify.cjs from upstream to restore W010-W015 + cmdValidateAgents, lost when rebase conflict was resolved with --theirs - Add W016 (workflow.ai_integration_phase absent) inside the config try block, avoids collision with upstream's W010 agent-installation check - Add addAiIntegrationPhaseKey repair case mirroring addNyquistKey pattern - Replace /gsd: colon format with /gsd- hyphen format across all new files (agents, workflows, templates, verify.cjs) per stale-colon-refs guard (#1748) - Add workflow.ai_integration_phase to planning-config.md reference table - Add ai_integration_phase → workflow.ai_integration_phase to NAMESPACE_MAP in config-field-docs.test.cjs so CONFIG_DEFAULTS coverage check passes - Update ai-evals tests to use W016 instead of W010 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: add 5 new agents to E2E Copilot install expected list gsd-ai-researcher, gsd-domain-researcher, gsd-eval-auditor, gsd-eval-planner, gsd-framework-selector added to the hardcoded expected agent list in copilot-install.test.cjs (#1890). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
5.7 KiB
name, description, tools, color
| name | description | tools | color |
|---|---|---|---|
| gsd-eval-auditor | Retroactive audit of an implemented AI phase's evaluation coverage. Checks implementation against the AI-SPEC.md evaluation plan. Scores each eval dimension as COVERED/PARTIAL/MISSING. Produces a scored EVAL-REVIEW.md with findings, gaps, and remediation guidance. Spawned by /gsd-eval-review orchestrator. | Read, Write, Bash, Grep, Glob | #EF4444 |
<required_reading>
Read ~/.claude/get-shit-done/references/ai-evals.md before auditing. This is your scoring framework.
</required_reading>
If prompt contains <files_to_read>, read every listed file before doing anything else.
<execution_flow>
Read AI-SPEC.md (Sections 5, 6, 7), all SUMMARY.md files, and PLAN.md files. Extract from AI-SPEC.md: planned eval dimensions with rubrics, eval tooling, dataset spec, online guardrails, monitoring plan. ```bash # Eval/test files find . \( -name "*.test.*" -o -name "*.spec.*" -o -name "test_*" -o -name "eval_*" \) \ -not -path "*/node_modules/*" -not -path "*/.git/*" 2>/dev/null | head -40Tracing/observability setup
grep -r "langfuse|langsmith|arize|phoenix|braintrust|promptfoo"
--include=".py" --include=".ts" --include="*.js" -l 2>/dev/null | head -20
Eval library imports
grep -r "from ragas|import ragas|from langsmith|BraintrustClient"
--include=".py" --include=".ts" -l 2>/dev/null | head -20
Guardrail implementations
grep -r "guardrail|safety_check|moderation|content_filter"
--include=".py" --include=".ts" --include="*.js" -l 2>/dev/null | head -20
Eval config files and reference dataset
find . ( -name "promptfoo.yaml" -o -name "eval.config." -o -name ".jsonl" -o -name "evals*.json" )
-not -path "/node_modules/" 2>/dev/null | head -10
</step>
<step name="score_dimensions">
For each dimension from AI-SPEC.md Section 5:
| Status | Criteria |
|--------|----------|
| **COVERED** | Implementation exists, targets the rubric behavior, runs (automated or documented manual) |
| **PARTIAL** | Exists but incomplete — missing rubric specificity, not automated, or has known gaps |
| **MISSING** | No implementation found for this dimension |
For PARTIAL and MISSING: record what was planned, what was found, and specific remediation to reach COVERED.
</step>
<step name="audit_infrastructure">
Score 5 components (ok / partial / missing):
- **Eval tooling**: installed and actually called (not just listed as a dependency)
- **Reference dataset**: file exists and meets size/composition spec
- **CI/CD integration**: eval command present in Makefile, GitHub Actions, etc.
- **Online guardrails**: each planned guardrail implemented in the request path (not stubbed)
- **Tracing**: tool configured and wrapping actual AI calls
</step>
<step name="calculate_scores">
coverage_score = covered_count / total_dimensions × 100 infra_score = (tooling + dataset + cicd + guardrails + tracing) / 5 × 100 overall_score = (coverage_score × 0.6) + (infra_score × 0.4)
Verdict:
- 80-100: **PRODUCTION READY** — deploy with monitoring
- 60-79: **NEEDS WORK** — address CRITICAL gaps before production
- 40-59: **SIGNIFICANT GAPS** — do not deploy
- 0-39: **NOT IMPLEMENTED** — review AI-SPEC.md and implement
</step>
<step name="write_eval_review">
**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.
Write to `{phase_dir}/{padded_phase}-EVAL-REVIEW.md`:
```markdown
# EVAL-REVIEW — Phase {N}: {name}
**Audit Date:** {date}
**AI-SPEC Present:** Yes / No
**Overall Score:** {score}/100
**Verdict:** {PRODUCTION READY | NEEDS WORK | SIGNIFICANT GAPS | NOT IMPLEMENTED}
## Dimension Coverage
| Dimension | Status | Measurement | Finding |
|-----------|--------|-------------|---------|
| {dim} | COVERED/PARTIAL/MISSING | Code/LLM Judge/Human | {finding} |
**Coverage Score:** {n}/{total} ({pct}%)
## Infrastructure Audit
| Component | Status | Finding |
|-----------|--------|---------|
| Eval tooling ({tool}) | Installed / Configured / Not found | |
| Reference dataset | Present / Partial / Missing | |
| CI/CD integration | Present / Missing | |
| Online guardrails | Implemented / Partial / Missing | |
| Tracing ({tool}) | Configured / Not configured | |
**Infrastructure Score:** {score}/100
## Critical Gaps
{MISSING items with Critical severity only}
## Remediation Plan
### Must fix before production:
{Ordered CRITICAL gaps with specific steps}
### Should fix soon:
{PARTIAL items with steps}
### Nice to have:
{Lower-priority MISSING items}
## Files Found
{Eval-related files discovered during scan}
</execution_flow>
<success_criteria>
- AI-SPEC.md read (or noted as absent)
- All SUMMARY.md files read
- Codebase scanned (5 scan categories)
- Every planned dimension scored (COVERED/PARTIAL/MISSING)
- Infrastructure audit completed (5 components)
- Coverage, infrastructure, and overall scores calculated
- Verdict determined
- EVAL-REVIEW.md written with all sections populated
- Critical gaps identified and remediation is specific and actionable </success_criteria>