* feat: /gsd:ai-phase + /gsd:eval-review — AI evals and framework selection layer Adds a structured AI development layer to GSD with 5 new agents, 2 new commands, 2 new workflows, 2 reference files, and 1 template. Commands: - /gsd:ai-phase [N] — pre-planning AI design contract (inserts between discuss-phase and plan-phase). Orchestrates 4 agents in sequence: framework-selector → ai-researcher → domain-researcher → eval-planner. Output: AI-SPEC.md with framework decision, implementation guidance, domain expert context, and evaluation strategy. - /gsd:eval-review [N] — retroactive eval coverage audit. Scores each planned eval dimension as COVERED/PARTIAL/MISSING. Output: EVAL-REVIEW.md with 0-100 score, verdict, and remediation plan. Agents: - gsd-framework-selector: interactive decision matrix (6 questions) → scored framework recommendation for CrewAI, LlamaIndex, LangChain, LangGraph, OpenAI Agents SDK, Claude Agent SDK, AutoGen/AG2, Haystack - gsd-ai-researcher: fetches official framework docs + writes AI systems best practices (Pydantic structured outputs, async-first, prompt discipline, context window management, cost/latency budget) - gsd-domain-researcher: researches business domain and use-case context — surfaces domain expert evaluation criteria, industry failure modes, regulatory constraints, and practitioner rubric ingredients before eval-planner writes measurable criteria - gsd-eval-planner: designs evaluation strategy grounded in domain context; defaults to Arize Phoenix (tracing) + RAGAS (RAG eval) with detect-first guard for existing tooling - gsd-eval-auditor: retroactive codebase scan → scores eval coverage Integration points: - plan-phase: non-blocking nudge (step 4.5) when AI keywords detected and no AI-SPEC.md present - settings: new workflow.ai_phase toggle (default on) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: refine ai-integration-phase layer — rename, house style, consistency fixes Amends the ai-evals framework layer (df8cb6c) with post-review improvements before opening upstream PR. Rename /gsd:ai-phase → /gsd:ai-integration-phase: - Renamed commands/gsd/ai-phase.md → ai-integration-phase.md - Renamed get-shit-done/workflows/ai-phase.md → ai-integration-phase.md - Updated config key: workflow.ai_phase → workflow.ai_integration_phase - Updated repair action: addAiPhaseKey → addAiIntegrationPhaseKey - Updated all 84 cross-references across agents, workflows, templates, tests Consistency fixes (same class as PR #1380 review): - commands/gsd: objective described 3-agent chain, missing gsd-domain-researcher - workflows/ai-integration-phase: purpose tag described 3-agent chain + "locks three things" — updated to 4 agents + 4 outputs - workflows/ai-integration-phase: missing DOMAIN_MODEL resolve-model call in step 1 (domain-researcher was spawned in step 7.5 with no model variable) - workflows/ai-integration-phase: fractional step ## 7.5 renumbered to integers (steps 8–12 shifted) Agent house style (GSD meta-prompting conformance): - All 5 new agents refactored to execution_flow + step name="" structure - Role blocks compressed to 2 lines (removed verbose "Core responsibilities") - Added skills: frontmatter to all 5 agents (agent-frontmatter tests) - Added # hooks: commented pattern to file-writing agents - Added ALWAYS use Write tool anti-heredoc instruction to file-writing agents - Line reductions: ai-researcher −41%, domain-researcher −25%, eval-planner −26%, eval-auditor −25%, framework-selector −9% Test coverage (tests/ai-evals.test.cjs — 48 tests): - CONFIG: workflow.ai_integration_phase defaults and config-set/get - HEALTH: W010 warning emission and addAiIntegrationPhaseKey repair - TEMPLATE: AI-SPEC.md section completeness (10 sections) - COMMAND: ai-integration-phase + eval-review frontmatter validity - AGENTS: all 5 new agent files exist - REFERENCES: ai-evals.md + ai-frameworks.md exist and are non-empty - WORKFLOW: plan-phase nudge integration, workflow files exist + agent coverage 603/603 tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add Google ADK to framework selector and reference matrix Google ADK (released March 2025) was missing from the framework options. Adds Python + Java multi-agent framework optimised for Gemini / Vertex AI. - get-shit-done/references/ai-frameworks.md: add Google ADK profile (type, language, model support, best for, avoid if, strengths, weaknesses, eval concerns); update Quick Picks, By System Type, and By Model Commitment tables - agents/gsd-framework-selector.md: add "Google (Gemini)" to model provider interview question - agents/gsd-ai-researcher.md: add Google ADK docs URL to documentation_sources Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions post-rebase - Remove skills: frontmatter from all 5 new agents (upstream changed convention — skills: breaks Gemini CLI and must not be present) - Add workflow.ai_integration_phase to VALID_CONFIG_KEYS whitelist in config.cjs (config-set blocked unknown keys) - Add ai_integration_phase: true to CONFIG_DEFAULTS in core.cjs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: rephrase 4b.1 line to avoid false-positive in prompt-injection scan "contract as a Pydantic model" matched the `act as a` pattern case-insensitively. Rephrased to "output schema using a Pydantic model". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions (W016, colon refs, config docs) - Replace verify.cjs from upstream to restore W010-W015 + cmdValidateAgents, lost when rebase conflict was resolved with --theirs - Add W016 (workflow.ai_integration_phase absent) inside the config try block, avoids collision with upstream's W010 agent-installation check - Add addAiIntegrationPhaseKey repair case mirroring addNyquistKey pattern - Replace /gsd: colon format with /gsd- hyphen format across all new files (agents, workflows, templates, verify.cjs) per stale-colon-refs guard (#1748) - Add workflow.ai_integration_phase to planning-config.md reference table - Add ai_integration_phase → workflow.ai_integration_phase to NAMESPACE_MAP in config-field-docs.test.cjs so CONFIG_DEFAULTS coverage check passes - Update ai-evals tests to use W016 instead of W010 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: add 5 new agents to E2E Copilot install expected list gsd-ai-researcher, gsd-domain-researcher, gsd-eval-auditor, gsd-eval-planner, gsd-framework-selector added to the hardcoded expected agent list in copilot-install.test.cjs (#1890). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
6.8 KiB
name, description, tools, color
| name | description | tools | color |
|---|---|---|---|
| gsd-eval-planner | Designs a structured evaluation strategy for an AI phase. Identifies critical failure modes, selects eval dimensions with rubrics, recommends tooling, and specifies the reference dataset. Writes the Evaluation Strategy, Guardrails, and Production Monitoring sections of AI-SPEC.md. Spawned by /gsd-ai-integration-phase orchestrator. | Read, Write, Bash, Grep, Glob, AskUserQuestion | #F59E0B |
<required_reading>
Read ~/.claude/get-shit-done/references/ai-evals.md before planning. This is your evaluation framework.
</required_reading>
If prompt contains <files_to_read>, read every listed file before doing anything else.
<execution_flow>
Read AI-SPEC.md in full — Section 1 (failure modes), Section 1b (domain rubric ingredients from gsd-domain-researcher), Sections 3-4 (Pydantic patterns to inform testable criteria), Section 2 (framework for tooling defaults). Also read CONTEXT.md and REQUIREMENTS.md. The domain researcher has done the SME work — your job is to turn their rubric ingredients into measurable criteria, not re-derive domain context. Map `system_type` to required dimensions from `ai-evals.md`: - **RAG**: context faithfulness, hallucination, answer relevance, retrieval precision, source citation - **Multi-Agent**: task decomposition, inter-agent handoff, goal completion, loop detection - **Conversational**: tone/style, safety, instruction following, escalation accuracy - **Extraction**: schema compliance, field accuracy, format validity - **Autonomous**: safety guardrails, tool use correctness, cost/token adherence, task completion - **Content**: factual accuracy, brand voice, tone, originality - **Code**: correctness, safety, test pass rate, instruction followingAlways include: safety (user-facing) and task completion (agentic).
Start from domain rubric ingredients in Section 1b — these are your rubric starting points, not generic dimensions. Fall back to generic `ai-evals.md` dimensions only if Section 1b is sparse.Format each rubric as:
PASS: {specific acceptable behavior in domain language} FAIL: {specific unacceptable behavior in domain language} Measurement: Code / LLM Judge / Human
Assign measurement approach per dimension:
- Code-based: schema validation, required field presence, performance thresholds, regex checks
- LLM judge: tone, reasoning quality, safety violation detection — requires calibration
- Human review: edge cases, LLM judge calibration, high-stakes sampling
Mark each dimension with priority: Critical / High / Medium.
Detect first — scan for existing tools before defaulting: ```bash grep -r "langfuse\|langsmith\|arize\|phoenix\|braintrust\|promptfoo\|ragas" \ --include="*.py" --include="*.ts" --include="*.toml" --include="*.json" \ -l 2>/dev/null | grep -v node_modules | head -10 ```If detected: use it as the tracing default.
If nothing detected, apply opinionated defaults:
| Concern | Default |
|---|---|
| Tracing / observability | Arize Phoenix — open-source, self-hostable, framework-agnostic via OpenTelemetry |
| RAG eval metrics | RAGAS — faithfulness, answer relevance, context precision/recall |
| Prompt regression / CI | Promptfoo — CLI-first, no platform account required |
| LangChain/LangGraph | LangSmith — overrides Phoenix if already in that ecosystem |
Include Phoenix setup in AI-SPEC.md:
# pip install arize-phoenix opentelemetry-sdk
import phoenix as px
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
px.launch_app() # http://localhost:6006
provider = TracerProvider()
trace.set_tracer_provider(provider)
# Instrument: LlamaIndexInstrumentor().instrument() / LangChainInstrumentor().instrument()
Keep guardrails minimal — each adds latency.
**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.Update AI-SPEC.md at ai_spec_path:
- Section 5 (Evaluation Strategy): dimensions table with rubrics, tooling, dataset spec, CI/CD command
- Section 6 (Guardrails): online guardrails table, offline flywheel table
- Section 7 (Production Monitoring): tracing tool, key metrics, alert thresholds, sampling strategy
If domain context is genuinely unclear after reading all artifacts, ask ONE question:
AskUserQuestion([{
question: "What is the primary domain/industry context for this AI system?",
header: "Domain Context",
multiSelect: false,
options: [
{ label: "Internal developer tooling" },
{ label: "Customer-facing (B2C)" },
{ label: "Business tool (B2B)" },
{ label: "Regulated industry (healthcare, finance, legal)" },
{ label: "Research / experimental" }
]
}])
</execution_flow>
<success_criteria>
- Critical failure modes confirmed (minimum 3)
- Eval dimensions selected (minimum 3, appropriate to system type)
- Each dimension has a concrete rubric (not a generic label)
- Each dimension has a measurement approach (Code / LLM Judge / Human)
- Eval tooling selected with install command
- Reference dataset spec written (size + composition + labeling)
- CI/CD eval integration command specified
- Online guardrails defined (minimum 1 for user-facing systems)
- Offline flywheel metrics defined
- Sections 5, 6, 7 of AI-SPEC.md written and non-empty </success_criteria>