feat: /gsd-ai-integration-phase + /gsd-eval-review — AI framework selection and eval coverage layer (#1971)

* feat: /gsd:ai-phase + /gsd:eval-review — AI evals and framework selection layer Adds a structured AI development layer to GSD with 5 new agents, 2 new commands, 2 new workflows, 2 reference files, and 1 template. Commands: - /gsd:ai-phase [N] — pre-planning AI design contract (inserts between discuss-phase and plan-phase). Orchestrates 4 agents in sequence: framework-selector → ai-researcher → domain-researcher → eval-planner. Output: AI-SPEC.md with framework decision, implementation guidance, domain expert context, and evaluation strategy. - /gsd:eval-review [N] — retroactive eval coverage audit. Scores each planned eval dimension as COVERED/PARTIAL/MISSING. Output: EVAL-REVIEW.md with 0-100 score, verdict, and remediation plan. Agents: - gsd-framework-selector: interactive decision matrix (6 questions) → scored framework recommendation for CrewAI, LlamaIndex, LangChain, LangGraph, OpenAI Agents SDK, Claude Agent SDK, AutoGen/AG2, Haystack - gsd-ai-researcher: fetches official framework docs + writes AI systems best practices (Pydantic structured outputs, async-first, prompt discipline, context window management, cost/latency budget) - gsd-domain-researcher: researches business domain and use-case context — surfaces domain expert evaluation criteria, industry failure modes, regulatory constraints, and practitioner rubric ingredients before eval-planner writes measurable criteria - gsd-eval-planner: designs evaluation strategy grounded in domain context; defaults to Arize Phoenix (tracing) + RAGAS (RAG eval) with detect-first guard for existing tooling - gsd-eval-auditor: retroactive codebase scan → scores eval coverage Integration points: - plan-phase: non-blocking nudge (step 4.5) when AI keywords detected and no AI-SPEC.md present - settings: new workflow.ai_phase toggle (default on) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: refine ai-integration-phase layer — rename, house style, consistency fixes Amends the ai-evals framework layer (df8cb6c) with post-review improvements before opening upstream PR. Rename /gsd:ai-phase → /gsd:ai-integration-phase: - Renamed commands/gsd/ai-phase.md → ai-integration-phase.md - Renamed get-shit-done/workflows/ai-phase.md → ai-integration-phase.md - Updated config key: workflow.ai_phase → workflow.ai_integration_phase - Updated repair action: addAiPhaseKey → addAiIntegrationPhaseKey - Updated all 84 cross-references across agents, workflows, templates, tests Consistency fixes (same class as PR #1380 review): - commands/gsd: objective described 3-agent chain, missing gsd-domain-researcher - workflows/ai-integration-phase: purpose tag described 3-agent chain + "locks three things" — updated to 4 agents + 4 outputs - workflows/ai-integration-phase: missing DOMAIN_MODEL resolve-model call in step 1 (domain-researcher was spawned in step 7.5 with no model variable) - workflows/ai-integration-phase: fractional step ## 7.5 renumbered to integers (steps 8–12 shifted) Agent house style (GSD meta-prompting conformance): - All 5 new agents refactored to execution_flow + step name="" structure - Role blocks compressed to 2 lines (removed verbose "Core responsibilities") - Added skills: frontmatter to all 5 agents (agent-frontmatter tests) - Added # hooks: commented pattern to file-writing agents - Added ALWAYS use Write tool anti-heredoc instruction to file-writing agents - Line reductions: ai-researcher −41%, domain-researcher −25%, eval-planner −26%, eval-auditor −25%, framework-selector −9% Test coverage (tests/ai-evals.test.cjs — 48 tests): - CONFIG: workflow.ai_integration_phase defaults and config-set/get - HEALTH: W010 warning emission and addAiIntegrationPhaseKey repair - TEMPLATE: AI-SPEC.md section completeness (10 sections) - COMMAND: ai-integration-phase + eval-review frontmatter validity - AGENTS: all 5 new agent files exist - REFERENCES: ai-evals.md + ai-frameworks.md exist and are non-empty - WORKFLOW: plan-phase nudge integration, workflow files exist + agent coverage 603/603 tests passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat: add Google ADK to framework selector and reference matrix Google ADK (released March 2025) was missing from the framework options. Adds Python + Java multi-agent framework optimised for Gemini / Vertex AI. - get-shit-done/references/ai-frameworks.md: add Google ADK profile (type, language, model support, best for, avoid if, strengths, weaknesses, eval concerns); update Quick Picks, By System Type, and By Model Commitment tables - agents/gsd-framework-selector.md: add "Google (Gemini)" to model provider interview question - agents/gsd-ai-researcher.md: add Google ADK docs URL to documentation_sources Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions post-rebase - Remove skills: frontmatter from all 5 new agents (upstream changed convention — skills: breaks Gemini CLI and must not be present) - Add workflow.ai_integration_phase to VALID_CONFIG_KEYS whitelist in config.cjs (config-set blocked unknown keys) - Add ai_integration_phase: true to CONFIG_DEFAULTS in core.cjs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: rephrase 4b.1 line to avoid false-positive in prompt-injection scan "contract as a Pydantic model" matched the `act as a` pattern case-insensitively. Rephrased to "output schema using a Pydantic model". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: adapt to upstream conventions (W016, colon refs, config docs) - Replace verify.cjs from upstream to restore W010-W015 + cmdValidateAgents, lost when rebase conflict was resolved with --theirs - Add W016 (workflow.ai_integration_phase absent) inside the config try block, avoids collision with upstream's W010 agent-installation check - Add addAiIntegrationPhaseKey repair case mirroring addNyquistKey pattern - Replace /gsd: colon format with /gsd- hyphen format across all new files (agents, workflows, templates, verify.cjs) per stale-colon-refs guard (#1748) - Add workflow.ai_integration_phase to planning-config.md reference table - Add ai_integration_phase → workflow.ai_integration_phase to NAMESPACE_MAP in config-field-docs.test.cjs so CONFIG_DEFAULTS coverage check passes - Update ai-evals tests to use W016 instead of W010 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: add 5 new agents to E2E Copilot install expected list gsd-ai-researcher, gsd-domain-researcher, gsd-eval-auditor, gsd-eval-planner, gsd-framework-selector added to the hardcoded expected agent list in copilot-install.test.cjs (#1890). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-25 17:25:23 +02:00 · 2026-04-10 16:49:00 +02:00
parent bad9c63fcb
commit 33575ba91d
21 changed files with 2294 additions and 2 deletions
--- a/agents/gsd-ai-researcher.md
+++ b/agents/gsd-ai-researcher.md
@@ -0,0 +1,110 @@
+---
+name: gsd-ai-researcher
+description: Researches a chosen AI framework's official docs to produce implementation-ready guidance — best practices, syntax, core patterns, and pitfalls distilled for the specific use case. Writes the Framework Quick Reference and Implementation Guidance sections of AI-SPEC.md. Spawned by /gsd-ai-integration-phase orchestrator.
+tools: Read, Write, Bash, Grep, Glob, WebFetch, WebSearch, mcp__context7__*
+color: "#34D399"
+# hooks:
+#   PostToolUse:
+#     - matcher: "Write|Edit"
+#       hooks:
+#         - type: command
+#           command: "echo 'AI-SPEC written' 2>/dev/null || true"
+---
+
+<role>
+You are a GSD AI researcher. Answer: "How do I correctly implement this AI system with the chosen framework?"
+Write Sections 3–4b of AI-SPEC.md: framework quick reference, implementation guidance, and AI systems best practices.
+</role>
+
+<required_reading>
+Read `~/.claude/get-shit-done/references/ai-frameworks.md` for framework profiles and known pitfalls before fetching docs.
+</required_reading>
+
+<input>
+- `framework`: selected framework name and version
+- `system_type`: RAG | Multi-Agent | Conversational | Extraction | Autonomous | Content | Code | Hybrid
+- `model_provider`: OpenAI | Anthropic | Model-agnostic
+- `ai_spec_path`: path to AI-SPEC.md
+- `phase_context`: phase name and goal
+- `context_path`: path to CONTEXT.md if it exists
+
+**If prompt contains `<files_to_read>`, read every listed file before doing anything else.**
+</input>
+
+<documentation_sources>
+Use context7 MCP first (fastest). Fall back to WebFetch.
+
+| Framework | Official Docs URL |
+|-----------|------------------|
+| CrewAI | https://docs.crewai.com |
+| LlamaIndex | https://docs.llamaindex.ai |
+| LangChain | https://python.langchain.com/docs |
+| LangGraph | https://langchain-ai.github.io/langgraph |
+| OpenAI Agents SDK | https://openai.github.io/openai-agents-python |
+| Claude Agent SDK | https://docs.anthropic.com/en/docs/claude-code/sdk |
+| AutoGen / AG2 | https://ag2ai.github.io/ag2 |
+| Google ADK | https://google.github.io/adk-docs |
+| Haystack | https://docs.haystack.deepset.ai |
+</documentation_sources>
+
+<execution_flow>
+
+<step name="fetch_docs">
+Fetch 2-4 pages maximum — prioritize depth over breadth: quickstart, the `system_type`-specific pattern page, best practices/pitfalls.
+Extract: installation command, key imports, minimal entry point for `system_type`, 3-5 abstractions, 3-5 pitfalls (prefer GitHub issues over docs), folder structure.
+</step>
+
+<step name="detect_integrations">
+Based on `system_type` and `model_provider`, identify required supporting libraries: vector DB (RAG), embedding model, tracing tool, eval library.
+Fetch brief setup docs for each.
+</step>
+
+<step name="write_sections_3_4">
+**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.
+
+Update AI-SPEC.md at `ai_spec_path`:
+
+**Section 3 — Framework Quick Reference:** real installation command, actual imports, working entry point pattern for `system_type`, abstractions table (3-5 rows), pitfall list with why-it's-a-pitfall notes, folder structure, Sources subsection with URLs.
+
+**Section 4 — Implementation Guidance:** specific model (e.g., `claude-sonnet-4-6`, `gpt-4o`) with params, core pattern as code snippet with inline comments, tool use config, state management approach, context window strategy.
+</step>
+
+<step name="write_section_4b">
+Add **Section 4b — AI Systems Best Practices** to AI-SPEC.md. Always included, independent of framework choice.
+
+**4b.1 Structured Outputs with Pydantic** — Define the output schema using a Pydantic model; LLM must validate or retry. Write for this specific `framework` + `system_type`:
+- Example Pydantic model for the use case
+- How the framework integrates (LangChain `.with_structured_output()`, `instructor` for direct API, LlamaIndex `PydanticOutputParser`, OpenAI `response_format`)
+- Retry logic: how many retries, what to log, when to surface
+
+**4b.2 Async-First Design** — Cover: how async works in this framework; the one common mistake (e.g., `asyncio.run()` in an event loop); stream vs. await (stream for UX, await for structured output validation).
+
+**4b.3 Prompt Engineering Discipline** — System vs. user prompt separation; few-shot: inline vs. dynamic retrieval; set `max_tokens` explicitly, never leave unbounded in production.
+
+**4b.4 Context Window Management** — RAG: reranking/truncation when context exceeds window. Multi-agent/Conversational: summarisation patterns. Autonomous: framework compaction handling.
+
+**4b.5 Cost and Latency Budget** — Per-call cost estimate at expected volume; exact-match + semantic caching; cheaper models for sub-tasks (classification, routing, summarisation).
+</step>
+
+</execution_flow>
+
+<quality_standards>
+- All code snippets syntactically correct for the fetched version
+- Imports match actual package structure (not approximate)
+- Pitfalls specific — "use async where supported" is useless
+- Entry point pattern is copy-paste runnable
+- No hallucinated API methods — note "verify in docs" if unsure
+- Section 4b examples specific to `framework` + `system_type`, not generic
+</quality_standards>
+
+<success_criteria>
+- [ ] Official docs fetched (2-4 pages, not just homepage)
+- [ ] Installation command correct for latest stable version
+- [ ] Entry point pattern runs for `system_type`
+- [ ] 3-5 abstractions in context of use case
+- [ ] 3-5 specific pitfalls with explanations
+- [ ] Sections 3 and 4 written and non-empty
+- [ ] Section 4b: Pydantic example for this framework + system_type
+- [ ] Section 4b: async pattern, prompt discipline, context management, cost budget
+- [ ] Sources listed in Section 3
+</success_criteria>
--- a/agents/gsd-domain-researcher.md
+++ b/agents/gsd-domain-researcher.md
@@ -0,0 +1,130 @@
+---
+name: gsd-domain-researcher
+description: Researches the business domain and real-world application context of the AI system being built. Surfaces domain expert evaluation criteria, industry-specific failure modes, regulatory context, and what "good" looks like for practitioners in this field — before the eval-planner turns it into measurable rubrics. Spawned by /gsd-ai-integration-phase orchestrator.
+tools: Read, Write, Bash, Grep, Glob, WebSearch, WebFetch, mcp__context7__*
+color: "#A78BFA"
+# hooks:
+#   PostToolUse:
+#     - matcher: "Write|Edit"
+#       hooks:
+#         - type: command
+#           command: "echo 'AI-SPEC domain section written' 2>/dev/null || true"
+---
+
+<role>
+You are a GSD domain researcher. Answer: "What do domain experts actually care about when evaluating this AI system?"
+Research the business domain — not the technical framework. Write Section 1b of AI-SPEC.md.
+</role>
+
+<required_reading>
+Read `~/.claude/get-shit-done/references/ai-evals.md` — specifically the rubric design and domain expert sections.
+</required_reading>
+
+<input>
+- `system_type`: RAG | Multi-Agent | Conversational | Extraction | Autonomous | Content | Code | Hybrid
+- `phase_name`, `phase_goal`: from ROADMAP.md
+- `ai_spec_path`: path to AI-SPEC.md (partially written)
+- `context_path`: path to CONTEXT.md if exists
+- `requirements_path`: path to REQUIREMENTS.md if exists
+
+**If prompt contains `<files_to_read>`, read every listed file before doing anything else.**
+</input>
+
+<execution_flow>
+
+<step name="extract_domain_signal">
+Read AI-SPEC.md, CONTEXT.md, REQUIREMENTS.md. Extract: industry vertical, user population, stakes level, output type.
+If domain is unclear, infer from phase name and goal — "contract review" → legal, "support ticket" → customer service, "medical intake" → healthcare.
+</step>
+
+<step name="research_domain">
+Run 2-3 targeted searches:
+- `"{domain} AI system evaluation criteria site:arxiv.org OR site:research.google"`
+- `"{domain} LLM failure modes production"`
+- `"{domain} AI compliance requirements {current_year}"`
+
+Extract: practitioner eval criteria (not generic "accuracy"), known failure modes from production deployments, directly relevant regulations (HIPAA, GDPR, FCA, etc.), domain expert roles.
+</step>
+
+<step name="synthesize_rubric_ingredients">
+Produce 3-5 domain-specific rubric building blocks. Format each as:
+
+```
+Dimension: {name in domain language, not AI jargon}
+Good (domain expert would accept): {specific description}
+Bad (domain expert would flag): {specific description}
+Stakes: Critical / High / Medium
+Source: {practitioner knowledge, regulation, or research}
+```
+
+Example:
+```
+Dimension: Citation precision
+Good: Response cites the specific clause, section number, and jurisdiction
+Bad: Response states a legal principle without citing a source
+Stakes: Critical
+Source: Legal professional standards — unsourced legal advice constitutes malpractice risk
+```
+</step>
+
+<step name="identify_domain_experts">
+Specify who should be involved in evaluation: dataset labeling, rubric calibration, edge case review, production sampling.
+If internal tooling with no regulated domain, "domain expert" = product owner or senior team practitioner.
+</step>
+
+<step name="write_section_1b">
+**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.
+
+Update AI-SPEC.md at `ai_spec_path`. Add/update Section 1b:
+
+```markdown
+## 1b. Domain Context
+
+**Industry Vertical:** {vertical}
+**User Population:** {who uses this}
+**Stakes Level:** Low | Medium | High | Critical
+**Output Consequence:** {what happens downstream when the AI output is acted on}
+
+### What Domain Experts Evaluate Against
+
+{3-5 rubric ingredients in Dimension/Good/Bad/Stakes/Source format}
+
+### Known Failure Modes in This Domain
+
+{2-4 domain-specific failure modes — not generic hallucination}
+
+### Regulatory / Compliance Context
+
+{Relevant constraints — or "None identified for this deployment context"}
+
+### Domain Expert Roles for Evaluation
+
+| Role | Responsibility in Eval |
+|------|----------------------|
+| {role} | Reference dataset labeling / rubric calibration / production sampling |
+
+### Research Sources
+- {sources used}
+```
+</step>
+
+</execution_flow>
+
+<quality_standards>
+- Rubric ingredients in practitioner language, not AI/ML jargon
+- Good/Bad specific enough that two domain experts would agree — not "accurate" or "helpful"
+- Regulatory context: only what is directly relevant — do not list every possible regulation
+- If domain genuinely unclear, write a minimal section noting what to clarify with domain experts
+- Do not fabricate criteria — only surface research or well-established practitioner knowledge
+</quality_standards>
+
+<success_criteria>
+- [ ] Domain signal extracted from phase artifacts
+- [ ] 2-3 targeted domain research queries run
+- [ ] 3-5 rubric ingredients written (Good/Bad/Stakes/Source format)
+- [ ] Known failure modes identified (domain-specific, not generic)
+- [ ] Regulatory/compliance context identified or noted as none
+- [ ] Domain expert roles specified
+- [ ] Section 1b of AI-SPEC.md written and non-empty
+- [ ] Research sources listed
+</success_criteria>
--- a/agents/gsd-eval-auditor.md
+++ b/agents/gsd-eval-auditor.md
@@ -0,0 +1,164 @@
+---
+name: gsd-eval-auditor
+description: Retroactive audit of an implemented AI phase's evaluation coverage. Checks implementation against the AI-SPEC.md evaluation plan. Scores each eval dimension as COVERED/PARTIAL/MISSING. Produces a scored EVAL-REVIEW.md with findings, gaps, and remediation guidance. Spawned by /gsd-eval-review orchestrator.
+tools: Read, Write, Bash, Grep, Glob
+color: "#EF4444"
+# hooks:
+#   PostToolUse:
+#     - matcher: "Write|Edit"
+#       hooks:
+#         - type: command
+#           command: "echo 'EVAL-REVIEW written' 2>/dev/null || true"
+---
+
+<role>
+You are a GSD eval auditor. Answer: "Did the implemented AI system actually deliver its planned evaluation strategy?"
+Scan the codebase, score each dimension COVERED/PARTIAL/MISSING, write EVAL-REVIEW.md.
+</role>
+
+<required_reading>
+Read `~/.claude/get-shit-done/references/ai-evals.md` before auditing. This is your scoring framework.
+</required_reading>
+
+<input>
+- `ai_spec_path`: path to AI-SPEC.md (planned eval strategy)
+- `summary_paths`: all SUMMARY.md files in the phase directory
+- `phase_dir`: phase directory path
+- `phase_number`, `phase_name`
+
+**If prompt contains `<files_to_read>`, read every listed file before doing anything else.**
+</input>
+
+<execution_flow>
+
+<step name="read_phase_artifacts">
+Read AI-SPEC.md (Sections 5, 6, 7), all SUMMARY.md files, and PLAN.md files.
+Extract from AI-SPEC.md: planned eval dimensions with rubrics, eval tooling, dataset spec, online guardrails, monitoring plan.
+</step>
+
+<step name="scan_codebase">
+```bash
+# Eval/test files
+find . \( -name "*.test.*" -o -name "*.spec.*" -o -name "test_*" -o -name "eval_*" \) \
+  -not -path "*/node_modules/*" -not -path "*/.git/*" 2>/dev/null | head -40
+
+# Tracing/observability setup
+grep -r "langfuse\|langsmith\|arize\|phoenix\|braintrust\|promptfoo" \
+  --include="*.py" --include="*.ts" --include="*.js" -l 2>/dev/null | head -20
+
+# Eval library imports
+grep -r "from ragas\|import ragas\|from langsmith\|BraintrustClient" \
+  --include="*.py" --include="*.ts" -l 2>/dev/null | head -20
+
+# Guardrail implementations
+grep -r "guardrail\|safety_check\|moderation\|content_filter" \
+  --include="*.py" --include="*.ts" --include="*.js" -l 2>/dev/null | head -20
+
+# Eval config files and reference dataset
+find . \( -name "promptfoo.yaml" -o -name "eval.config.*" -o -name "*.jsonl" -o -name "evals*.json" \) \
+  -not -path "*/node_modules/*" 2>/dev/null | head -10
+```
+</step>
+
+<step name="score_dimensions">
+For each dimension from AI-SPEC.md Section 5:
+
+| Status | Criteria |
+|--------|----------|
+| **COVERED** | Implementation exists, targets the rubric behavior, runs (automated or documented manual) |
+| **PARTIAL** | Exists but incomplete — missing rubric specificity, not automated, or has known gaps |
+| **MISSING** | No implementation found for this dimension |
+
+For PARTIAL and MISSING: record what was planned, what was found, and specific remediation to reach COVERED.
+</step>
+
+<step name="audit_infrastructure">
+Score 5 components (ok / partial / missing):
+- **Eval tooling**: installed and actually called (not just listed as a dependency)
+- **Reference dataset**: file exists and meets size/composition spec
+- **CI/CD integration**: eval command present in Makefile, GitHub Actions, etc.
+- **Online guardrails**: each planned guardrail implemented in the request path (not stubbed)
+- **Tracing**: tool configured and wrapping actual AI calls
+</step>
+
+<step name="calculate_scores">
+```
+coverage_score  = covered_count / total_dimensions × 100
+infra_score     = (tooling + dataset + cicd + guardrails + tracing) / 5 × 100
+overall_score   = (coverage_score × 0.6) + (infra_score × 0.4)
+```
+
+Verdict:
+- 80-100: **PRODUCTION READY** — deploy with monitoring
+- 60-79: **NEEDS WORK** — address CRITICAL gaps before production
+- 40-59: **SIGNIFICANT GAPS** — do not deploy
+- 0-39: **NOT IMPLEMENTED** — review AI-SPEC.md and implement
+</step>
+
+<step name="write_eval_review">
+**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.
+
+Write to `{phase_dir}/{padded_phase}-EVAL-REVIEW.md`:
+
+```markdown
+# EVAL-REVIEW — Phase {N}: {name}
+
+**Audit Date:** {date}
+**AI-SPEC Present:** Yes / No
+**Overall Score:** {score}/100
+**Verdict:** {PRODUCTION READY | NEEDS WORK | SIGNIFICANT GAPS | NOT IMPLEMENTED}
+
+## Dimension Coverage
+
+| Dimension | Status | Measurement | Finding |
+|-----------|--------|-------------|---------|
+| {dim} | COVERED/PARTIAL/MISSING | Code/LLM Judge/Human | {finding} |
+
+**Coverage Score:** {n}/{total} ({pct}%)
+
+## Infrastructure Audit
+
+| Component | Status | Finding |
+|-----------|--------|---------|
+| Eval tooling ({tool}) | Installed / Configured / Not found | |
+| Reference dataset | Present / Partial / Missing | |
+| CI/CD integration | Present / Missing | |
+| Online guardrails | Implemented / Partial / Missing | |
+| Tracing ({tool}) | Configured / Not configured | |
+
+**Infrastructure Score:** {score}/100
+
+## Critical Gaps
+
+{MISSING items with Critical severity only}
+
+## Remediation Plan
+
+### Must fix before production:
+{Ordered CRITICAL gaps with specific steps}
+
+### Should fix soon:
+{PARTIAL items with steps}
+
+### Nice to have:
+{Lower-priority MISSING items}
+
+## Files Found
+
+{Eval-related files discovered during scan}
+```
+</step>
+
+</execution_flow>
+
+<success_criteria>
+- [ ] AI-SPEC.md read (or noted as absent)
+- [ ] All SUMMARY.md files read
+- [ ] Codebase scanned (5 scan categories)
+- [ ] Every planned dimension scored (COVERED/PARTIAL/MISSING)
+- [ ] Infrastructure audit completed (5 components)
+- [ ] Coverage, infrastructure, and overall scores calculated
+- [ ] Verdict determined
+- [ ] EVAL-REVIEW.md written with all sections populated
+- [ ] Critical gaps identified and remediation is specific and actionable
+</success_criteria>
--- a/agents/gsd-eval-planner.md
+++ b/agents/gsd-eval-planner.md
@@ -0,0 +1,154 @@
+---
+name: gsd-eval-planner
+description: Designs a structured evaluation strategy for an AI phase. Identifies critical failure modes, selects eval dimensions with rubrics, recommends tooling, and specifies the reference dataset. Writes the Evaluation Strategy, Guardrails, and Production Monitoring sections of AI-SPEC.md. Spawned by /gsd-ai-integration-phase orchestrator.
+tools: Read, Write, Bash, Grep, Glob, AskUserQuestion
+color: "#F59E0B"
+# hooks:
+#   PostToolUse:
+#     - matcher: "Write|Edit"
+#       hooks:
+#         - type: command
+#           command: "echo 'AI-SPEC eval sections written' 2>/dev/null || true"
+---
+
+<role>
+You are a GSD eval planner. Answer: "How will we know this AI system is working correctly?"
+Turn domain rubric ingredients into measurable, tooled evaluation criteria. Write Sections 5–7 of AI-SPEC.md.
+</role>
+
+<required_reading>
+Read `~/.claude/get-shit-done/references/ai-evals.md` before planning. This is your evaluation framework.
+</required_reading>
+
+<input>
+- `system_type`: RAG | Multi-Agent | Conversational | Extraction | Autonomous | Content | Code | Hybrid
+- `framework`: selected framework
+- `model_provider`: OpenAI | Anthropic | Model-agnostic
+- `phase_name`, `phase_goal`: from ROADMAP.md
+- `ai_spec_path`: path to AI-SPEC.md
+- `context_path`: path to CONTEXT.md if exists
+- `requirements_path`: path to REQUIREMENTS.md if exists
+
+**If prompt contains `<files_to_read>`, read every listed file before doing anything else.**
+</input>
+
+<execution_flow>
+
+<step name="read_phase_context">
+Read AI-SPEC.md in full — Section 1 (failure modes), Section 1b (domain rubric ingredients from gsd-domain-researcher), Sections 3-4 (Pydantic patterns to inform testable criteria), Section 2 (framework for tooling defaults).
+Also read CONTEXT.md and REQUIREMENTS.md.
+The domain researcher has done the SME work — your job is to turn their rubric ingredients into measurable criteria, not re-derive domain context.
+</step>
+
+<step name="select_eval_dimensions">
+Map `system_type` to required dimensions from `ai-evals.md`:
+- **RAG**: context faithfulness, hallucination, answer relevance, retrieval precision, source citation
+- **Multi-Agent**: task decomposition, inter-agent handoff, goal completion, loop detection
+- **Conversational**: tone/style, safety, instruction following, escalation accuracy
+- **Extraction**: schema compliance, field accuracy, format validity
+- **Autonomous**: safety guardrails, tool use correctness, cost/token adherence, task completion
+- **Content**: factual accuracy, brand voice, tone, originality
+- **Code**: correctness, safety, test pass rate, instruction following
+
+Always include: **safety** (user-facing) and **task completion** (agentic).
+</step>
+
+<step name="write_rubrics">
+Start from domain rubric ingredients in Section 1b — these are your rubric starting points, not generic dimensions. Fall back to generic `ai-evals.md` dimensions only if Section 1b is sparse.
+
+Format each rubric as:
+> PASS: {specific acceptable behavior in domain language}
+> FAIL: {specific unacceptable behavior in domain language}
+> Measurement: Code / LLM Judge / Human
+
+Assign measurement approach per dimension:
+- **Code-based**: schema validation, required field presence, performance thresholds, regex checks
+- **LLM judge**: tone, reasoning quality, safety violation detection — requires calibration
+- **Human review**: edge cases, LLM judge calibration, high-stakes sampling
+
+Mark each dimension with priority: Critical / High / Medium.
+</step>
+
+<step name="select_eval_tooling">
+Detect first — scan for existing tools before defaulting:
+```bash
+grep -r "langfuse\|langsmith\|arize\|phoenix\|braintrust\|promptfoo\|ragas" \
+  --include="*.py" --include="*.ts" --include="*.toml" --include="*.json" \
+  -l 2>/dev/null | grep -v node_modules | head -10
+```
+
+If detected: use it as the tracing default.
+
+If nothing detected, apply opinionated defaults:
+| Concern | Default |
+|---------|---------|
+| Tracing / observability | **Arize Phoenix** — open-source, self-hostable, framework-agnostic via OpenTelemetry |
+| RAG eval metrics | **RAGAS** — faithfulness, answer relevance, context precision/recall |
+| Prompt regression / CI | **Promptfoo** — CLI-first, no platform account required |
+| LangChain/LangGraph | **LangSmith** — overrides Phoenix if already in that ecosystem |
+
+Include Phoenix setup in AI-SPEC.md:
+```python
+# pip install arize-phoenix opentelemetry-sdk
+import phoenix as px
+from opentelemetry import trace
+from opentelemetry.sdk.trace import TracerProvider
+
+px.launch_app()  # http://localhost:6006
+provider = TracerProvider()
+trace.set_tracer_provider(provider)
+# Instrument: LlamaIndexInstrumentor().instrument() / LangChainInstrumentor().instrument()
+```
+</step>
+
+<step name="specify_reference_dataset">
+Define: size (10 examples minimum, 20 for production), composition (critical paths, edge cases, failure modes, adversarial inputs), labeling approach (domain expert / LLM judge with calibration / automated), creation timeline (start during implementation, not after).
+</step>
+
+<step name="design_guardrails">
+For each critical failure mode, classify:
+- **Online guardrail** (catastrophic) → runs on every request, real-time, must be fast
+- **Offline flywheel** (quality signal) → sampled batch, feeds improvement loop
+
+Keep guardrails minimal — each adds latency.
+</step>
+
+<step name="write_sections_5_6_7">
+**ALWAYS use the Write tool to create files** — never use `Bash(cat << 'EOF')` or heredoc commands for file creation.
+
+Update AI-SPEC.md at `ai_spec_path`:
+- Section 5 (Evaluation Strategy): dimensions table with rubrics, tooling, dataset spec, CI/CD command
+- Section 6 (Guardrails): online guardrails table, offline flywheel table
+- Section 7 (Production Monitoring): tracing tool, key metrics, alert thresholds, sampling strategy
+
+If domain context is genuinely unclear after reading all artifacts, ask ONE question:
+```
+AskUserQuestion([{
+  question: "What is the primary domain/industry context for this AI system?",
+  header: "Domain Context",
+  multiSelect: false,
+  options: [
+    { label: "Internal developer tooling" },
+    { label: "Customer-facing (B2C)" },
+    { label: "Business tool (B2B)" },
+    { label: "Regulated industry (healthcare, finance, legal)" },
+    { label: "Research / experimental" }
+  ]
+}])
+```
+</step>
+
+</execution_flow>
+
+<success_criteria>
+- [ ] Critical failure modes confirmed (minimum 3)
+- [ ] Eval dimensions selected (minimum 3, appropriate to system type)
+- [ ] Each dimension has a concrete rubric (not a generic label)
+- [ ] Each dimension has a measurement approach (Code / LLM Judge / Human)
+- [ ] Eval tooling selected with install command
+- [ ] Reference dataset spec written (size + composition + labeling)
+- [ ] CI/CD eval integration command specified
+- [ ] Online guardrails defined (minimum 1 for user-facing systems)
+- [ ] Offline flywheel metrics defined
+- [ ] Sections 5, 6, 7 of AI-SPEC.md written and non-empty
+</success_criteria>
--- a/agents/gsd-framework-selector.md
+++ b/agents/gsd-framework-selector.md
@@ -0,0 +1,160 @@
+---
+name: gsd-framework-selector
+description: Presents an interactive decision matrix to surface the right AI/LLM framework for the user's specific use case. Produces a scored recommendation with rationale. Spawned by /gsd-ai-integration-phase and /gsd-select-framework orchestrators.
+tools: Read, Bash, Grep, Glob, WebSearch, AskUserQuestion
+color: "#38BDF8"
+---
+
+<role>
+You are a GSD framework selector. Answer: "What AI/LLM framework is right for this project?"
+Run a ≤6-question interview, score frameworks, return a ranked recommendation to the orchestrator.
+</role>
+
+<required_reading>
+Read `~/.claude/get-shit-done/references/ai-frameworks.md` before asking questions. This is your decision matrix.
+</required_reading>
+
+<project_context>
+Scan for existing technology signals before the interview:
+```bash
+find . -maxdepth 2 \( -name "package.json" -o -name "pyproject.toml" -o -name "requirements*.txt" \) -not -path "*/node_modules/*" 2>/dev/null | head -5
+```
+Read found files to extract: existing AI libraries, model providers, language, team size signals. This prevents recommending a framework the team has already rejected.
+</project_context>
+
+<interview>
+Use a single AskUserQuestion call with ≤ 6 questions. Skip what the codebase scan or upstream CONTEXT.md already answers.
+
+```
+AskUserQuestion([
+  {
+    question: "What type of AI system are you building?",
+    header: "System Type",
+    multiSelect: false,
+    options: [
+      { label: "RAG / Document Q&A", description: "Answer questions from documents, PDFs, knowledge bases" },
+      { label: "Multi-Agent Workflow", description: "Multiple AI agents collaborating on structured tasks" },
+      { label: "Conversational Assistant / Chatbot", description: "Single-model chat interface with optional tool use" },
+      { label: "Structured Data Extraction", description: "Extract fields, entities, or structured output from unstructured text" },
+      { label: "Autonomous Task Agent", description: "Agent that plans and executes multi-step tasks independently" },
+      { label: "Content Generation Pipeline", description: "Generate text, summaries, drafts, or creative content at scale" },
+      { label: "Code Automation Agent", description: "Agent that reads, writes, or executes code autonomously" },
+      { label: "Not sure yet / Exploratory" }
+    ]
+  },
+  {
+    question: "Which model provider are you committing to?",
+    header: "Model Provider",
+    multiSelect: false,
+    options: [
+      { label: "OpenAI (GPT-4o, o3, etc.)", description: "Comfortable with OpenAI vendor lock-in" },
+      { label: "Anthropic (Claude)", description: "Comfortable with Anthropic vendor lock-in" },
+      { label: "Google (Gemini)", description: "Committed to Gemini / Google Cloud / Vertex AI" },
+      { label: "Model-agnostic", description: "Need ability to swap models or use local models" },
+      { label: "Undecided / Want flexibility" }
+    ]
+  },
+  {
+    question: "What is your development stage and team context?",
+    header: "Stage",
+    multiSelect: false,
+    options: [
+      { label: "Solo dev, rapid prototype", description: "Speed to working demo matters most" },
+      { label: "Small team (2-5), building toward production", description: "Balance speed and maintainability" },
+      { label: "Production system, needs fault tolerance", description: "Checkpointing, observability, and reliability required" },
+      { label: "Enterprise / regulated environment", description: "Audit trails, compliance, human-in-the-loop required" }
+    ]
+  },
+  {
+    question: "What programming language is this project using?",
+    header: "Language",
+    multiSelect: false,
+    options: [
+      { label: "Python", description: "Primary language is Python" },
+      { label: "TypeScript / JavaScript", description: "Node.js / frontend-adjacent stack" },
+      { label: "Both Python and TypeScript needed" },
+      { label: ".NET / C#", description: "Microsoft ecosystem" }
+    ]
+  },
+  {
+    question: "What is the most important requirement?",
+    header: "Priority",
+    multiSelect: false,
+    options: [
+      { label: "Fastest time to working prototype" },
+      { label: "Best retrieval/RAG quality" },
+      { label: "Most control over agent state and flow" },
+      { label: "Simplest API surface area (least abstraction)" },
+      { label: "Largest community and integrations" },
+      { label: "Safety and compliance first" }
+    ]
+  },
+  {
+    question: "Any hard constraints?",
+    header: "Constraints",
+    multiSelect: true,
+    options: [
+      { label: "No vendor lock-in" },
+      { label: "Must be open-source licensed" },
+      { label: "TypeScript required (no Python)" },
+      { label: "Must support local/self-hosted models" },
+      { label: "Enterprise SLA / support required" },
+      { label: "No new infrastructure (use existing DB)" },
+      { label: "None of the above" }
+    ]
+  }
+])
+```
+</interview>
+
+<scoring>
+Apply decision matrix from `ai-frameworks.md`:
+1. Eliminate frameworks failing any hard constraint
+2. Score remaining 1-5 on each answered dimension
+3. Weight by user's stated priority
+4. Produce ranked top 3 — show only the recommendation, not the scoring table
+</scoring>
+
+<output_format>
+Return to orchestrator:
+
+```
+FRAMEWORK_RECOMMENDATION:
+  primary: {framework name and version}
+  rationale: {2-3 sentences — why this fits their specific answers}
+  alternative: {second choice if primary doesn't work out}
+  alternative_reason: {1 sentence}
+  system_type: {RAG | Multi-Agent | Conversational | Extraction | Autonomous | Content | Code | Hybrid}
+  model_provider: {OpenAI | Anthropic | Model-agnostic}
+  eval_concerns: {comma-separated primary eval dimensions for this system type}
+  hard_constraints: {list of constraints}
+  existing_ecosystem: {detected libraries from codebase scan}
+```
+
+Display to user:
+
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+ FRAMEWORK RECOMMENDATION
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+◆ Primary Pick: {framework}
+  {rationale}
+
+◆ Alternative: {alternative}
+  {alternative_reason}
+
+◆ System Type Classified: {system_type}
+◆ Key Eval Dimensions: {eval_concerns}
+```
+</output_format>
+
+<success_criteria>
+- [ ] Codebase scanned for existing framework signals
+- [ ] Interview completed (≤ 6 questions, single AskUserQuestion call)
+- [ ] Hard constraints applied to eliminate incompatible frameworks
+- [ ] Primary recommendation with clear rationale
+- [ ] Alternative identified
+- [ ] System type classified
+- [ ] Structured result returned to orchestrator
+</success_criteria>
--- a/commands/gsd/ai-integration-phase.md
+++ b/commands/gsd/ai-integration-phase.md
@@ -0,0 +1,36 @@
+---
+name: gsd:ai-integration-phase
+description: Generate AI design contract (AI-SPEC.md) for phases that involve building AI systems — framework selection, implementation guidance from official docs, and evaluation strategy
+argument-hint: "[phase number]"
+allowed-tools:
+  - Read
+  - Write
+  - Bash
+  - Glob
+  - Grep
+  - Task
+  - WebFetch
+  - WebSearch
+  - AskUserQuestion
+  - mcp__context7__*
+---
+<objective>
+Create an AI design contract (AI-SPEC.md) for a phase involving AI system development.
+Orchestrates gsd-framework-selector → gsd-ai-researcher → gsd-domain-researcher → gsd-eval-planner.
+Flow: Select Framework → Research Docs → Research Domain → Design Eval Strategy → Done
+</objective>
+
+<execution_context>
+@~/.claude/get-shit-done/workflows/ai-integration-phase.md
+@~/.claude/get-shit-done/references/ai-frameworks.md
+@~/.claude/get-shit-done/references/ai-evals.md
+</execution_context>
+
+<context>
+Phase number: $ARGUMENTS — optional, auto-detects next unplanned phase if omitted.
+</context>
+
+<process>
+Execute @~/.claude/get-shit-done/workflows/ai-integration-phase.md end-to-end.
+Preserve all workflow gates.
+</process>
--- a/commands/gsd/eval-review.md
+++ b/commands/gsd/eval-review.md
@@ -0,0 +1,32 @@
+---
+name: gsd:eval-review
+description: Retroactively audit an executed AI phase's evaluation coverage — scores each eval dimension as COVERED/PARTIAL/MISSING and produces an actionable EVAL-REVIEW.md with remediation plan
+argument-hint: "[phase number]"
+allowed-tools:
+  - Read
+  - Write
+  - Bash
+  - Glob
+  - Grep
+  - Task
+  - AskUserQuestion
+---
+<objective>
+Conduct a retroactive evaluation coverage audit of a completed AI phase.
+Checks whether the evaluation strategy from AI-SPEC.md was implemented.
+Produces EVAL-REVIEW.md with score, verdict, gaps, and remediation plan.
+</objective>
+
+<execution_context>
+@~/.claude/get-shit-done/workflows/eval-review.md
+@~/.claude/get-shit-done/references/ai-evals.md
+</execution_context>
+
+<context>
+Phase: $ARGUMENTS — optional, defaults to last completed phase.
+</context>
+
+<process>
+Execute @~/.claude/get-shit-done/workflows/eval-review.md end-to-end.
+Preserve all workflow gates.
+</process>
--- a/get-shit-done/bin/lib/config.cjs
+++ b/get-shit-done/bin/lib/config.cjs
@@ -15,7 +15,7 @@ const VALID_CONFIG_KEYS = new Set([
  'mode', 'granularity', 'parallelization', 'commit_docs', 'model_profile',
  'search_gitignored', 'brave_search', 'firecrawl', 'exa_search',
  'workflow.research', 'workflow.plan_check', 'workflow.verifier',
-  'workflow.nyquist_validation', 'workflow.ui_phase', 'workflow.ui_safety_gate',
+  'workflow.nyquist_validation', 'workflow.ai_integration_phase', 'workflow.ui_phase', 'workflow.ui_safety_gate',
  'workflow.auto_advance', 'workflow.node_repair', 'workflow.node_repair_budget',
  'workflow.text_mode',
  'workflow.research_before_questions',
@@ -146,6 +146,7 @@ function buildNewProjectConfig(userChoices) {
      node_repair_budget: 2,
      ui_phase: true,
      ui_safety_gate: true,
+      ai_integration_phase: true,
      text_mode: false,
      research_before_questions: false,
      discuss_mode: 'discuss',
--- a/get-shit-done/bin/lib/core.cjs
+++ b/get-shit-done/bin/lib/core.cjs
@@ -239,6 +239,7 @@ const CONFIG_DEFAULTS = {
  plan_checker: true,
  verifier: true,
  nyquist_validation: true,
+  ai_integration_phase: true,
  parallelization: true,
  brave_search: false,
  firecrawl: false,
--- a/get-shit-done/bin/lib/verify.cjs
+++ b/get-shit-done/bin/lib/verify.cjs
@@ -649,6 +649,10 @@ function cmdValidateHealth(cwd, options, raw) {
        addIssue('warning', 'W008', 'config.json: workflow.nyquist_validation absent (defaults to enabled but agents may skip)', 'Run /gsd-health --repair to add key', true);
        if (!repairs.includes('addNyquistKey')) repairs.push('addNyquistKey');
      }
+      if (configParsed.workflow && configParsed.workflow.ai_integration_phase === undefined) {
+        addIssue('warning', 'W016', 'config.json: workflow.ai_integration_phase absent (defaults to enabled — run /gsd-ai-integration-phase before planning AI system phases)', 'Run /gsd-health --repair to add key', true);
+        if (!repairs.includes('addAiIntegrationPhaseKey')) repairs.push('addAiIntegrationPhaseKey');
+      }
    } catch { /* intentionally empty */ }
  }

@@ -891,6 +895,23 @@ function cmdValidateHealth(cwd, options, raw) {
            }
            break;
          }
+          case 'addAiIntegrationPhaseKey': {
+            if (fs.existsSync(configPath)) {
+              try {
+                const configRaw = fs.readFileSync(configPath, 'utf-8');
+                const configParsed = JSON.parse(configRaw);
+                if (!configParsed.workflow) configParsed.workflow = {};
+                if (configParsed.workflow.ai_integration_phase === undefined) {
+                  configParsed.workflow.ai_integration_phase = true;
+                  fs.writeFileSync(configPath, JSON.stringify(configParsed, null, 2), 'utf-8');
+                }
+                repairActions.push({ action: repair, success: true, path: 'config.json' });
+              } catch (err) {
+                repairActions.push({ action: repair, success: false, error: err.message });
+              }
+            }
+            break;
+          }
        }
      } catch (err) {
        repairActions.push({ action: repair, success: false, error: err.message });
--- a/get-shit-done/references/ai-evals.md
+++ b/get-shit-done/references/ai-evals.md
@@ -0,0 +1,156 @@
+# AI Evaluation Reference
+
+> Reference used by `gsd-eval-planner` and `gsd-eval-auditor`.
+> Based on "AI Evals for Everyone" course (Reganti & Badam) + industry practice.
+
+---
+
+## Core Concepts
+
+### Why Evals Exist
+AI systems are non-deterministic. Input X does not reliably produce output Y across runs, users, or edge cases. Evals are the continuous process of assessing whether your system's behavior meets expectations under real-world conditions — unit tests and integration tests alone are insufficient.
+
+### Model vs. Product Evaluation
+- **Model evals** (MMLU, HumanEval, GSM8K) — measure general capability in standardized conditions. Use as initial filter only.
+- **Product evals** — measure behavior inside your specific system, with your data, your users, your domain rules. This is where 80% of eval effort belongs.
+
+### The Three Components of Every Eval
+- **Input** — everything affecting the system: query, history, retrieved docs, system prompt, config
+- **Expected** — what good behavior looks like, defined through rubrics
+- **Actual** — what the system produced, including intermediate steps, tool calls, and reasoning traces
+
+### Three Measurement Approaches
+1. **Code-based metrics** — deterministic checks: JSON validation, required disclaimers, performance thresholds, classification flags. Fast, cheap, reliable. Use first.
+2. **LLM judges** — one model evaluates another against a rubric. Powerful for subjective qualities (tone, reasoning, escalation). Requires calibration against human judgment before trusting.
+3. **Human evaluation** — gold standard for nuanced judgment. Doesn't scale. Use for calibration, edge cases, periodic sampling, and high-stakes decisions.
+
+Most effective systems combine all three.
+
+---
+
+## Evaluation Dimensions
+
+### Pre-Deployment (Development Phase)
+
+| Dimension | What It Measures | When It Matters |
+|-----------|-----------------|-----------------|
+| **Factual accuracy** | Correctness of claims against ground truth | RAG, knowledge bases, any factual assertions |
+| **Context faithfulness** | Response grounded in provided context vs. fabricated | RAG pipelines, document Q&A, retrieval-augmented systems |
+| **Hallucination detection** | Plausible but unsupported claims | All generative systems, high-stakes domains |
+| **Escalation accuracy** | Correct identification of when human intervention needed | Customer service, healthcare, financial advisory |
+| **Policy compliance** | Adherence to business rules, legal requirements, disclaimers | Regulated industries, enterprise deployments |
+| **Tone/style appropriateness** | Match with brand voice, audience expectations, emotional context | Customer-facing systems, content generation |
+| **Output structure validity** | Schema compliance, required fields, format correctness | Structured extraction, API integrations, data pipelines |
+| **Task completion** | Whether the system accomplished the stated goal | Agentic workflows, multi-step tasks |
+| **Tool use correctness** | Correct selection and invocation of tools | Agent systems with tool calls |
+| **Safety** | Absence of harmful, biased, or inappropriate outputs | All user-facing systems |
+
+### Production Monitoring
+
+| Dimension | Monitoring Approach |
+|-----------|---------------------|
+| **Safety violations** | Online guardrail — real-time, immediate intervention |
+| **Compliance failures** | Online guardrail — block or escalate before user sees output |
+| **Quality degradation trends** | Offline flywheel — batch analysis of sampled interactions |
+| **Emerging failure modes** | Signal-metric divergence — when user behavior signals diverge from metric scores, investigate manually |
+| **Cost/latency drift** | Code-based metrics — automated threshold alerts |
+
+---
+
+## The Guardrail vs. Flywheel Decision
+
+Ask: "If this behavior goes wrong, would it be catastrophic for my business?"
+
+- **Yes → Guardrail** — run online, real-time, with immediate intervention (block, escalate, hand off). Be selective: guardrails add latency.
+- **No → Flywheel** — run offline as batch analysis feeding system refinements over time.
+
+---
+
+## Rubric Design
+
+Generic metrics are meaningless without context. "Helpfulness" in real estate means summarizing listings clearly. In healthcare it means knowing when *not* to answer.
+
+A rubric must define:
+1. The dimension being measured
+2. What scores 1, 3, and 5 on a 5-point scale (or pass/fail criteria)
+3. Domain-specific examples of acceptable vs. unacceptable behavior
+
+Without rubrics, LLM judges produce noise rather than signal.
+
+---
+
+## Reference Dataset Guidelines
+
+- Start with **10-20 high-quality examples** — not 200 mediocre ones
+- Cover: critical success scenarios, common user workflows, known edge cases, historical failure modes
+- Have domain experts label the examples (not just engineers)
+- Expand based on what you learn in production — don't build for hypothetical coverage
+
+---
+
+## Eval Tooling Guide
+
+| Tool | Type | Best For | Key Strength |
+|------|------|----------|-------------|
+| **RAGAS** | Python library | RAG evaluation | Purpose-built metrics: faithfulness, answer relevance, context precision/recall |
+| **Langfuse** | Platform (open-source, self-hostable) | All system types | Strong tracing, prompt management, good for teams wanting infrastructure control |
+| **LangSmith** | Platform (commercial) | LangChain/LangGraph ecosystems | Tightest integration with LangChain; best if already in that ecosystem |
+| **Arize Phoenix** | Platform (open-source + hosted) | RAG + multi-agent tracing | Strong RAG eval + trace visualization; open-source with hosted option |
+| **Braintrust** | Platform (commercial) | Model-agnostic evaluation | Dataset and experiment management; good for comparing across frameworks |
+| **Promptfoo** | CLI tool (open-source) | Prompt testing, CI/CD | CLI-first, excellent for CI/CD prompt regression testing |
+
+### Tool Selection by System Type
+
+| System Type | Recommended Tooling |
+|-------------|---------------------|
+| RAG / Knowledge Q&A | RAGAS + Arize Phoenix or Braintrust |
+| Multi-agent systems | Langfuse + Arize Phoenix |
+| Conversational / single-model | Promptfoo + Braintrust |
+| Structured extraction | Promptfoo + code-based validators |
+| LangChain/LangGraph projects | LangSmith (native integration) |
+| Production monitoring (all types) | Langfuse, Arize Phoenix, or LangSmith |
+
+---
+
+## Evals in the Development Lifecycle
+
+### Plan Phase (Evaluation-Aware Design)
+Before writing code, define:
+1. What type of AI system is being built → determines framework and dominant eval concerns
+2. Critical failure modes (3-5 behaviors that cannot go wrong)
+3. Rubrics — explicit definitions of acceptable/unacceptable behavior per dimension
+4. Evaluation strategy — which dimensions use code metrics, LLM judges, or human review
+5. Reference dataset requirements — size, composition, labeling approach
+6. Eval tooling selection
+
+Output: EVALS-SPEC section of AI-SPEC.md
+
+### Execute Phase (Instrument While Building)
+- Add tracing from day one (Langfuse, Arize Phoenix, or LangSmith)
+- Build reference dataset concurrently with implementation
+- Implement code-based checks first; add LLM judges only for subjective dimensions
+- Run evals in CI/CD via Promptfoo or Braintrust
+
+### Verify Phase (Pre-Deployment Validation)
+- Run full reference dataset against all metrics
+- Conduct human review of edge cases and LLM judge disagreements
+- Calibrate LLM judges against human scores (target ≥ 0.7 correlation before trusting)
+- Define and configure production guardrails
+- Establish monitoring baseline
+
+### Monitor Phase (Production Evaluation Loop)
+- Smart sampling — weight toward interactions with concerning signals (retries, unusual length, explicit escalations)
+- Online guardrails on every interaction
+- Offline flywheel on sampled batch
+- Watch for signal-metric divergence — the early warning system for evaluation gaps
+
+---
+
+## Common Pitfalls
+
+1. **Assuming benchmarks predict product success** — they don't; model evals are a filter, not a verdict
+2. **Engineering evals in isolation** — domain experts must co-define rubrics; engineers alone miss critical nuances
+3. **Building comprehensive coverage on day one** — start small (10-20 examples), expand from real failure modes
+4. **Trusting uncalibrated LLM judges** — validate against human judgment before relying on them
+5. **Measuring everything** — only track metrics that drive decisions; "collect it all" produces noise
+6. **Treating evaluation as one-time setup** — user behavior evolves, requirements change, failure modes emerge; evaluation is continuous
--- a/get-shit-done/references/ai-frameworks.md
+++ b/get-shit-done/references/ai-frameworks.md
@@ -0,0 +1,186 @@
+# AI Framework Decision Matrix
+
+> Reference used by `gsd-framework-selector` and `gsd-ai-researcher`.
+> Distilled from official docs, benchmarks, and developer reports (2026).
+
+---
+
+## Quick Picks
+
+| Situation | Pick |
+|-----------|------|
+| Simplest path to a working agent (OpenAI) | OpenAI Agents SDK |
+| Simplest path to a working agent (model-agnostic) | CrewAI |
+| Production RAG / document Q&A | LlamaIndex |
+| Complex stateful workflows with branching | LangGraph |
+| Multi-agent teams with defined roles | CrewAI |
+| Code-aware autonomous agents (Anthropic) | Claude Agent SDK |
+| "I don't know my requirements yet" | LangChain |
+| Regulated / audit-trail required | LangGraph |
+| Enterprise Microsoft/.NET shops | AutoGen/AG2 |
+| Google Cloud / Gemini-committed teams | Google ADK |
+| Pure NLP pipelines with explicit control | Haystack |
+
+---
+
+## Framework Profiles
+
+### CrewAI
+- **Type:** Multi-agent orchestration
+- **Language:** Python only
+- **Model support:** Model-agnostic
+- **Learning curve:** Beginner (role/task/crew maps to real teams)
+- **Best for:** Content pipelines, research automation, business process workflows, rapid prototyping
+- **Avoid if:** Fine-grained state management, TypeScript, fault-tolerant checkpointing, complex conditional branching
+- **Strengths:** Fastest multi-agent prototyping, 5.76x faster than LangGraph on QA tasks, built-in memory (short/long/entity/contextual), Flows architecture, standalone (no LangChain dep)
+- **Weaknesses:** Limited checkpointing, coarse error handling, Python only
+- **Eval concerns:** Task decomposition accuracy, inter-agent handoff, goal completion rate, loop detection
+
+### LlamaIndex
+- **Type:** RAG and data ingestion
+- **Language:** Python + TypeScript
+- **Model support:** Model-agnostic
+- **Learning curve:** Intermediate
+- **Best for:** Legal research, internal knowledge assistants, enterprise document search, any system where retrieval quality is the #1 priority
+- **Avoid if:** Primary need is agent orchestration, multi-agent collaboration, or chatbot conversation flow
+- **Strengths:** Best-in-class document parsing (LlamaParse), 35% retrieval accuracy improvement, 20-30% faster queries, mixed retrieval strategies (vector + graph + reranker)
+- **Weaknesses:** Data framework first — agent orchestration is secondary
+- **Eval concerns:** Context faithfulness, hallucination, answer relevance, retrieval precision/recall
+
+### LangChain
+- **Type:** General-purpose LLM framework
+- **Language:** Python + TypeScript
+- **Model support:** Model-agnostic (widest ecosystem)
+- **Learning curve:** Intermediate–Advanced
+- **Best for:** Evolving requirements, many third-party integrations, teams wanting one framework for everything, RAG + agents + chains
+- **Avoid if:** Simple well-defined use case, RAG-primary (use LlamaIndex), complex stateful workflows (use LangGraph), performance at scale is critical
+- **Strengths:** Largest community and integration ecosystem, 25% faster development vs scratch, covers RAG/agents/chains/memory
+- **Weaknesses:** Abstraction overhead, p99 latency degrades under load, complexity creep risk
+- **Eval concerns:** End-to-end task completion, chain correctness, retrieval quality
+
+### LangGraph
+- **Type:** Stateful agent workflows (graph-based)
+- **Language:** Python + TypeScript (full parity)
+- **Model support:** Model-agnostic (inherits LangChain integrations)
+- **Learning curve:** Intermediate–Advanced (graph mental model)
+- **Best for:** Production-grade stateful workflows, regulated industries, audit trails, human-in-the-loop flows, fault-tolerant multi-step agents
+- **Avoid if:** Simple chatbot, purely linear workflow, rapid prototyping
+- **Strengths:** Best checkpointing (every node), time-travel debugging, native Postgres/Redis persistence, streaming support, chosen by 62% of developers for stateful agent work (2026)
+- **Weaknesses:** More upfront scaffolding, steeper curve, overkill for simple cases
+- **Eval concerns:** State transition correctness, goal completion rate, tool use accuracy, safety guardrails
+
+### OpenAI Agents SDK
+- **Type:** Native OpenAI agent framework
+- **Language:** Python + TypeScript
+- **Model support:** Optimized for OpenAI (supports 100+ via Chat Completions compatibility)
+- **Learning curve:** Beginner (4 primitives: Agents, Handoffs, Guardrails, Tracing)
+- **Best for:** OpenAI-committed teams, rapid agent prototyping, voice agents (gpt-realtime), teams wanting visual builder (AgentKit)
+- **Avoid if:** Model flexibility needed, complex multi-agent collaboration, persistent state management required, vendor lock-in concern
+- **Strengths:** Simplest mental model, built-in tracing and guardrails, Handoffs for agent delegation, Realtime Agents for voice
+- **Weaknesses:** OpenAI vendor lock-in, no built-in persistent state, younger ecosystem
+- **Eval concerns:** Instruction following, safety guardrails, escalation accuracy, tone consistency
+
+### Claude Agent SDK (Anthropic)
+- **Type:** Code-aware autonomous agent framework
+- **Language:** Python + TypeScript
+- **Model support:** Claude models only
+- **Learning curve:** Intermediate (18 hook events, MCP, tool decorators)
+- **Best for:** Developer tooling, code generation/review agents, autonomous coding assistants, MCP-heavy architectures, safety-critical applications
+- **Avoid if:** Model flexibility needed, stable/mature API required, use case unrelated to code/tool-use
+- **Strengths:** Deepest MCP integration, built-in filesystem/shell access, 18 lifecycle hooks, automatic context compaction, extended thinking, safety-first design
+- **Weaknesses:** Claude-only vendor lock-in, newer/evolving API, smaller community
+- **Eval concerns:** Tool use correctness, safety, code quality, instruction following
+
+### AutoGen / AG2 / Microsoft Agent Framework
+- **Type:** Multi-agent conversational framework
+- **Language:** Python (AG2), Python + .NET (Microsoft Agent Framework)
+- **Model support:** Model-agnostic
+- **Learning curve:** Intermediate–Advanced
+- **Best for:** Research applications, conversational problem-solving, code generation + execution loops, Microsoft/.NET shops
+- **Avoid if:** You want ecosystem stability, deterministic workflows, or "safest long-term bet" (fragmentation risk)
+- **Strengths:** Most sophisticated conversational agent patterns, code generation + execution loop, async event-driven (v0.4+), cross-language interop (Microsoft Agent Framework)
+- **Weaknesses:** Ecosystem fragmented (AutoGen maintenance mode, AG2 fork, Microsoft Agent Framework preview) — genuine long-term risk
+- **Eval concerns:** Conversation goal completion, consensus quality, code execution correctness
+
+### Google ADK (Agent Development Kit)
+- **Type:** Multi-agent orchestration framework
+- **Language:** Python + Java
+- **Model support:** Optimized for Gemini; supports other models via LiteLLM
+- **Learning curve:** Intermediate (agent/tool/session model, familiar if you know LangGraph)
+- **Best for:** Google Cloud / Vertex AI shops, multi-agent workflows needing built-in session management and memory, teams already committed to Gemini, agent pipelines that need Google Search / BigQuery tool integration
+- **Avoid if:** Model flexibility is required beyond Gemini, no Google Cloud dependency acceptable, TypeScript-only stack
+- **Strengths:** First-party Google support, built-in session/memory/artifact management, tight Vertex AI and Google Search integration, own eval framework (RAGAS-compatible), multi-agent by design (sequential, parallel, loop patterns), Java SDK for enterprise teams
+- **Weaknesses:** Gemini vendor lock-in in practice, younger community than LangChain/LlamaIndex, less third-party integration depth
+- **Eval concerns:** Multi-agent task decomposition, tool use correctness, session state consistency, goal completion rate
+
+### Haystack
+- **Type:** NLP pipeline framework
+- **Language:** Python
+- **Model support:** Model-agnostic
+- **Learning curve:** Intermediate
+- **Best for:** Explicit, auditable NLP pipelines, document processing with fine-grained control, enterprise search, regulated industries needing transparency
+- **Avoid if:** Rapid prototyping, multi-agent workflows, or you want a large community
+- **Strengths:** Explicit pipeline control, strong for structured data pipelines, good documentation
+- **Weaknesses:** Smaller community, less agent-oriented than alternatives
+- **Eval concerns:** Extraction accuracy, pipeline output validity, retrieval quality
+
+---
+
+## Decision Dimensions
+
+### By System Type
+
+| System Type | Primary Framework(s) | Key Eval Concerns |
+|-------------|---------------------|-------------------|
+| RAG / Knowledge Q&A | LlamaIndex, LangChain | Context faithfulness, hallucination, retrieval precision/recall |
+| Multi-agent orchestration | CrewAI, LangGraph, Google ADK | Task decomposition, handoff quality, goal completion |
+| Conversational assistants | OpenAI Agents SDK, Claude Agent SDK | Tone, safety, instruction following, escalation |
+| Structured data extraction | LangChain, LlamaIndex | Schema compliance, extraction accuracy |
+| Autonomous task agents | LangGraph, OpenAI Agents SDK | Safety guardrails, tool correctness, cost adherence |
+| Content generation | Claude Agent SDK, OpenAI Agents SDK | Brand voice, factual accuracy, tone |
+| Code automation | Claude Agent SDK | Code correctness, safety, test pass rate |
+
+### By Team Size and Stage
+
+| Context | Recommendation |
+|---------|----------------|
+| Solo dev, prototyping | OpenAI Agents SDK or CrewAI (fastest to running) |
+| Solo dev, RAG | LlamaIndex (batteries included) |
+| Team, production, stateful | LangGraph (best fault tolerance) |
+| Team, evolving requirements | LangChain (broadest escape hatches) |
+| Team, multi-agent | CrewAI (simplest role abstraction) |
+| Enterprise, .NET | AutoGen AG2 / Microsoft Agent Framework |
+
+### By Model Commitment
+
+| Preference | Framework |
+|-----------|-----------|
+| OpenAI-only | OpenAI Agents SDK |
+| Anthropic/Claude-only | Claude Agent SDK |
+| Google/Gemini-committed | Google ADK |
+| Model-agnostic (full flexibility) | LangChain, LlamaIndex, CrewAI, LangGraph, Haystack |
+
+---
+
+## Anti-Patterns
+
+1. **Using LangChain for simple chatbots** — Direct SDK call is less code, faster, and easier to debug
+2. **Using CrewAI for complex stateful workflows** — Checkpointing gaps will bite you in production
+3. **Using OpenAI Agents SDK with non-OpenAI models** — Loses the integration benefits you chose it for
+4. **Using LlamaIndex as a multi-agent framework** — It can do agents, but that's not its strength
+5. **Defaulting to LangChain without evaluating alternatives** — "Everyone uses it" ≠ right for your use case
+6. **Starting a new project on AutoGen (not AG2)** — AutoGen is in maintenance mode; use AG2 or wait for Microsoft Agent Framework GA
+7. **Choosing LangGraph for simple linear flows** — The graph overhead is not worth it; use LangChain chains instead
+8. **Ignoring vendor lock-in** — Provider-native SDKs (OpenAI, Claude) trade flexibility for integration depth; decide consciously
+
+---
+
+## Combination Plays (Multi-Framework Stacks)
+
+| Production Pattern | Stack |
+|-------------------|-------|
+| RAG with observability | LlamaIndex + LangSmith or Langfuse |
+| Stateful agent with RAG | LangGraph + LlamaIndex |
+| Multi-agent with tracing | CrewAI + Langfuse |
+| OpenAI agents with evals | OpenAI Agents SDK + Promptfoo or Braintrust |
+| Claude agents with MCP | Claude Agent SDK + LangSmith or Arize Phoenix |
--- a/get-shit-done/references/planning-config.md
+++ b/get-shit-done/references/planning-config.md
@@ -250,6 +250,7 @@ Set via `workflow.*` namespace in config.json (e.g., `"workflow": { "research":
 | `workflow.auto_advance` | boolean | `false` | `true`, `false` | Auto-advance to next phase after completion |
 | `workflow.node_repair` | boolean | `true` | `true`, `false` | Attempt automatic repair of failed plan nodes |
 | `workflow.node_repair_budget` | number | `2` | Any positive integer | Max repair retries per failed node |
+| `workflow.ai_integration_phase` | boolean | `true` | `true`, `false` | Run /gsd-ai-integration-phase before planning AI system phases |
 | `workflow.ui_phase` | boolean | `true` | `true`, `false` | Generate UI-SPEC.md for frontend phases |
 | `workflow.ui_safety_gate` | boolean | `true` | `true`, `false` | Require safety gate approval for UI changes |
 | `workflow.text_mode` | boolean | `false` | `true`, `false` | Use plain-text numbered lists instead of AskUserQuestion menus |
--- a/get-shit-done/templates/AI-SPEC.md
+++ b/get-shit-done/templates/AI-SPEC.md
@@ -0,0 +1,246 @@
+# AI-SPEC — Phase {N}: {phase_name}
+
+> AI design contract generated by `/gsd-ai-integration-phase`. Consumed by `gsd-planner` and `gsd-eval-auditor`.
+> Locks framework selection, implementation guidance, and evaluation strategy before planning begins.
+
+---
+
+## 1. System Classification
+
+**System Type:** <!-- RAG | Multi-Agent | Conversational | Extraction | Autonomous Agent | Content Generation | Code Automation | Hybrid -->
+
+**Description:**
+<!-- One-paragraph description of what this AI system does, who uses it, and what "good" looks like -->
+
+**Critical Failure Modes:**
+<!-- The 3-5 behaviors that absolutely cannot go wrong in this system -->
+1.
+2.
+3.
+
+---
+
+## 1b. Domain Context
+
+> Researched by `gsd-domain-researcher`. Grounds the evaluation strategy in domain expert knowledge.
+
+**Industry Vertical:** <!-- healthcare | legal | finance | customer service | education | developer tooling | e-commerce | etc. -->
+
+**User Population:** <!-- who uses this system and in what context -->
+
+**Stakes Level:** <!-- Low | Medium | High | Critical -->
+
+**Output Consequence:** <!-- what happens downstream when the AI output is acted on -->
+
+### What Domain Experts Evaluate Against
+
+<!-- Domain-specific rubric ingredients — in practitioner language, not AI jargon -->
+<!-- Format: Dimension / Good (expert accepts) / Bad (expert flags) / Stakes / Source -->
+
+### Known Failure Modes in This Domain
+
+<!-- Domain-specific failure modes from research — not generic hallucination, but how it manifests here -->
+
+### Regulatory / Compliance Context
+
+<!-- Relevant regulations or constraints — or "None identified" if genuinely none apply -->
+
+### Domain Expert Roles for Evaluation
+
+| Role | Responsibility |
+|------|---------------|
+| <!-- e.g., Senior practitioner --> | <!-- Dataset labeling / rubric calibration / production sampling --> |
+
+---
+
+## 2. Framework Decision
+
+**Selected Framework:** <!-- e.g., LlamaIndex v0.10.x -->
+
+**Version:** <!-- Pin the version -->
+
+**Rationale:**
+<!-- Why this framework fits this system type, team context, and production requirements -->
+
+**Alternatives Considered:**
+
+| Framework | Ruled Out Because |
+|-----------|------------------|
+| | |
+
+**Vendor Lock-In Accepted:** <!-- Yes / No / Partial — document the trade-off consciously -->
+
+---
+
+## 3. Framework Quick Reference
+
+> Fetched from official docs by `gsd-ai-researcher`. Distilled for this specific use case.
+
+### Installation
+```bash
+# Install command(s)
+```
+
+### Core Imports
+```python
+# Key imports for this use case
+```
+
+### Entry Point Pattern
+```python
+# Minimal working example for this system type
+```
+
+### Key Abstractions
+<!-- Framework-specific concepts the developer must understand before coding -->
+| Concept | What It Is | When You Use It |
+|---------|-----------|-----------------|
+| | | |
+
+### Common Pitfalls
+<!-- Gotchas specific to this framework and system type — from docs, issues, and community reports -->
+1.
+2.
+3.
+
+### Recommended Project Structure
+```
+project/
+├── # Framework-specific folder layout
+```
+
+---
+
+## 4. Implementation Guidance
+
+**Model Configuration:**
+<!-- Which model(s), temperature, max tokens, and other key parameters -->
+
+**Core Pattern:**
+<!-- The primary implementation pattern for this system type in this framework -->
+
+**Tool Use:**
+<!-- Tools/integrations needed and how to configure them -->
+
+**State Management:**
+<!-- How state is persisted, retrieved, and updated -->
+
+**Context Window Strategy:**
+<!-- How to manage context limits for this system type -->
+
+---
+
+## 4b. AI Systems Best Practices
+
+> Written by `gsd-ai-researcher`. Cross-cutting patterns every developer building AI systems needs — independent of framework choice.
+
+### Structured Outputs with Pydantic
+
+<!-- Framework-specific Pydantic integration pattern for this use case -->
+<!-- Include: output model definition, how the framework uses it, retry logic on validation failure -->
+
+```python
+# Pydantic output model for this system type
+```
+
+### Async-First Design
+
+<!-- How async is handled in this framework, the one common mistake, and when to stream vs. await -->
+
+### Prompt Engineering Discipline
+
+<!-- System vs. user prompt separation, few-shot guidance, token budget strategy -->
+
+### Context Window Management
+
+<!-- Strategy specific to this system type: RAG chunking / conversation summarisation / agent compaction -->
+
+### Cost and Latency Budget
+
+<!-- Per-call cost estimate, caching strategy, sub-task model routing -->
+
+---
+
+## 5. Evaluation Strategy
+
+### Dimensions
+
+| Dimension | Rubric (Pass/Fail or 1-5) | Measurement Approach | Priority |
+|-----------|--------------------------|---------------------|----------|
+| | | Code / LLM Judge / Human | Critical / High / Medium |
+
+### Eval Tooling
+
+**Primary Tool:** <!-- e.g., RAGAS + Langfuse -->
+
+**Setup:**
+```bash
+# Install and configure
+```
+
+**CI/CD Integration:**
+```bash
+# Command to run evals in CI/CD pipeline
+```
+
+### Reference Dataset
+
+**Size:** <!-- e.g., 20 examples to start -->
+
+**Composition:**
+<!-- What scenario types the dataset covers: critical paths, edge cases, failure modes -->
+
+**Labeling:**
+<!-- Who labels examples and how (domain expert, LLM judge with calibration, etc.) -->
+
+---
+
+## 6. Guardrails
+
+### Online (Real-Time)
+
+| Guardrail | Trigger | Intervention |
+|-----------|---------|--------------|
+| | | Block / Escalate / Flag |
+
+### Offline (Flywheel)
+
+| Metric | Sampling Strategy | Action on Degradation |
+|--------|------------------|----------------------|
+| | | |
+
+---
+
+## 7. Production Monitoring
+
+**Tracing Tool:** <!-- e.g., Langfuse self-hosted -->
+
+**Key Metrics to Track:**
+<!-- 3-5 metrics that will be monitored in production -->
+
+**Alert Thresholds:**
+<!-- When to page/alert -->
+
+**Smart Sampling Strategy:**
+<!-- How to select interactions for human review — signal-based filters -->
+
+---
+
+## Checklist
+
+- [ ] System type classified
+- [ ] Critical failure modes identified (≥ 3)
+- [ ] Domain context researched (Section 1b: vertical, stakes, expert criteria, failure modes)
+- [ ] Regulatory/compliance context identified or explicitly noted as none
+- [ ] Domain expert roles defined for evaluation involvement
+- [ ] Framework selected with rationale documented
+- [ ] Alternatives considered and ruled out
+- [ ] Framework quick reference written (install, imports, pattern, pitfalls)
+- [ ] AI systems best practices written (Section 4b: Pydantic, async, prompt discipline, context)
+- [ ] Evaluation dimensions grounded in domain rubric ingredients
+- [ ] Each eval dimension has a concrete rubric (Good/Bad in domain language)
+- [ ] Eval tooling selected — Arize Phoenix default confirmed or override noted
+- [ ] Reference dataset spec written (size ≥ 10, composition + labeling defined)
+- [ ] CI/CD eval integration specified
+- [ ] Online guardrails defined
+- [ ] Production monitoring configured (tracing tool + sampling strategy)
--- a/get-shit-done/workflows/ai-integration-phase.md
+++ b/get-shit-done/workflows/ai-integration-phase.md
@@ -0,0 +1,282 @@
+<purpose>
+Generate an AI design contract (AI-SPEC.md) for phases that involve building AI systems. Orchestrates gsd-framework-selector → gsd-ai-researcher → gsd-domain-researcher → gsd-eval-planner with a validation gate. Inserts between discuss-phase and plan-phase in the GSD lifecycle.
+
+AI-SPEC.md locks four things before the planner creates tasks:
+1. Framework selection (with rationale and alternatives)
+2. Implementation guidance (correct syntax, patterns, pitfalls from official docs)
+3. Domain context (practitioner rubric ingredients, failure modes, regulatory constraints)
+4. Evaluation strategy (dimensions, rubrics, tooling, reference dataset, guardrails)
+
+This prevents the two most common AI development failures: choosing the wrong framework for the use case, and treating evaluation as an afterthought.
+</purpose>
+
+<required_reading>
+@~/.claude/get-shit-done/references/ai-frameworks.md
+@~/.claude/get-shit-done/references/ai-evals.md
+</required_reading>
+
+<process>
+
+## 1. Initialize
+
+```bash
+INIT=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" init plan-phase "$PHASE")
+if [[ "$INIT" == @file:* ]]; then INIT=$(cat "${INIT#@file:}"); fi
+```
+
+Parse JSON for: `phase_dir`, `phase_number`, `phase_name`, `phase_slug`, `padded_phase`, `has_context`, `has_research`, `commit_docs`.
+
+**File paths:** `state_path`, `roadmap_path`, `requirements_path`, `context_path`.
+
+Resolve agent models:
+```bash
+SELECTOR_MODEL=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" resolve-model gsd-framework-selector --raw)
+RESEARCHER_MODEL=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" resolve-model gsd-ai-researcher --raw)
+DOMAIN_MODEL=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" resolve-model gsd-domain-researcher --raw)
+PLANNER_MODEL=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" resolve-model gsd-eval-planner --raw)
+```
+
+Check config:
+```bash
+AI_PHASE_ENABLED=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" config-get workflow.ai_integration_phase 2>/dev/null || echo "true")
+```
+
+**If `AI_PHASE_ENABLED` is `false`:**
+```
+AI phase is disabled in config. Enable via /gsd-settings.
+```
+Exit workflow.
+
+**If `planning_exists` is false:** Error — run `/gsd-new-project` first.
+
+## 2. Parse and Validate Phase
+
+Extract phase number from $ARGUMENTS. If not provided, detect next unplanned phase.
+
+```bash
+PHASE_INFO=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" roadmap get-phase "${PHASE}")
+```
+
+**If `found` is false:** Error with available phases.
+
+## 3. Check Prerequisites
+
+**If `has_context` is false:**
+```
+No CONTEXT.md found for Phase {N}.
+Recommended: run /gsd-discuss-phase {N} first to capture framework preferences.
+Continuing without user decisions — framework selector will ask all questions.
+```
+Continue (non-blocking).
+
+## 4. Check Existing AI-SPEC
+
+```bash
+AI_SPEC_FILE=$(ls "${PHASE_DIR}"/*-AI-SPEC.md 2>/dev/null | head -1)
+```
+
+**If exists:** Use AskUserQuestion:
+- header: "Existing AI-SPEC"
+- question: "AI-SPEC.md already exists for Phase {N}. What would you like to do?"
+- options:
+  - "Update — re-run with existing as baseline"
+  - "View — display current AI-SPEC and exit"
+  - "Skip — keep current AI-SPEC and exit"
+
+If "View": display file contents, exit.
+If "Skip": exit.
+If "Update": continue to step 5.
+
+## 5. Spawn gsd-framework-selector
+
+Display:
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+ GSD ► AI DESIGN CONTRACT — PHASE {N}: {name}
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+◆ Step 1/4 — Framework Selection...
+```
+
+Spawn `gsd-framework-selector` with:
+```markdown
+Read ~/.claude/agents/gsd-framework-selector.md for instructions.
+
+<objective>
+Select the right AI framework for Phase {phase_number}: {phase_name}
+Goal: {phase_goal}
+</objective>
+
+<files_to_read>
+{context_path if exists}
+{requirements_path if exists}
+</files_to_read>
+
+<phase_context>
+Phase: {phase_number} — {phase_name}
+Goal: {phase_goal}
+</phase_context>
+```
+
+Parse selector output for: `primary_framework`, `system_type`, `model_provider`, `eval_concerns`, `alternative_framework`.
+
+**If selector fails or returns empty:** Exit with error — "Framework selection failed. Re-run /gsd-ai-integration-phase {N} or answer the framework question in /gsd-discuss-phase {N} first."
+
+## 6. Initialize AI-SPEC.md
+
+Copy template:
+```bash
+cp "$HOME/.claude/get-shit-done/templates/AI-SPEC.md" "${PHASE_DIR}/${PADDED_PHASE}-AI-SPEC.md"
+```
+
+Fill in header fields:
+- Phase number and name
+- System classification (from selector)
+- Selected framework (from selector)
+- Alternative considered (from selector)
+
+## 7. Spawn gsd-ai-researcher
+
+Display:
+```
+◆ Step 2/4 — Researching {primary_framework} docs + AI systems best practices...
+```
+
+Spawn `gsd-ai-researcher` with:
+```markdown
+Read ~/.claude/agents/gsd-ai-researcher.md for instructions.
+
+<objective>
+Research {primary_framework} for Phase {phase_number}: {phase_name}
+Write Sections 3 and 4 of AI-SPEC.md
+</objective>
+
+<files_to_read>
+{ai_spec_path}
+{context_path if exists}
+</files_to_read>
+
+<input>
+framework: {primary_framework}
+system_type: {system_type}
+model_provider: {model_provider}
+ai_spec_path: {ai_spec_path}
+phase_context: Phase {phase_number}: {phase_name} — {phase_goal}
+</input>
+```
+
+## 8. Spawn gsd-domain-researcher
+
+Display:
+```
+◆ Step 3/4 — Researching domain context and expert evaluation criteria...
+```
+
+Spawn `gsd-domain-researcher` with:
+```markdown
+Read ~/.claude/agents/gsd-domain-researcher.md for instructions.
+
+<objective>
+Research the business domain and expert evaluation criteria for Phase {phase_number}: {phase_name}
+Write Section 1b (Domain Context) of AI-SPEC.md
+</objective>
+
+<files_to_read>
+{ai_spec_path}
+{context_path if exists}
+{requirements_path if exists}
+</files_to_read>
+
+<input>
+system_type: {system_type}
+phase_name: {phase_name}
+phase_goal: {phase_goal}
+ai_spec_path: {ai_spec_path}
+</input>
+```
+
+## 9. Spawn gsd-eval-planner
+
+Display:
+```
+◆ Step 4/4 — Designing evaluation strategy from domain + technical context...
+```
+
+Spawn `gsd-eval-planner` with:
+```markdown
+Read ~/.claude/agents/gsd-eval-planner.md for instructions.
+
+<objective>
+Design evaluation strategy for Phase {phase_number}: {phase_name}
+Write Sections 5, 6, and 7 of AI-SPEC.md
+AI-SPEC.md now contains domain context (Section 1b) — use it as your rubric starting point.
+</objective>
+
+<files_to_read>
+{ai_spec_path}
+{context_path if exists}
+{requirements_path if exists}
+</files_to_read>
+
+<input>
+system_type: {system_type}
+framework: {primary_framework}
+model_provider: {model_provider}
+phase_name: {phase_name}
+phase_goal: {phase_goal}
+ai_spec_path: {ai_spec_path}
+</input>
+```
+
+## 10. Validate AI-SPEC Completeness
+
+Read the completed AI-SPEC.md. Check that:
+- Section 2 has a framework name (not placeholder)
+- Section 1b has at least one domain rubric ingredient (Good/Bad/Stakes)
+- Section 3 has a non-empty code block (entry point pattern)
+- Section 4b has a Pydantic example
+- Section 5 has at least one row in the dimensions table
+- Section 6 has at least one guardrail or explicit "N/A for internal tool" note
+- Checklist section at end has 3+ items checked
+
+**If validation fails:** Display specific missing sections. Ask user if they want to re-run the specific step or continue anyway.
+
+## 11. Commit
+
+**If `commit_docs` is true:**
+```bash
+git add "${AI_SPEC_FILE}"
+git commit -m "docs({phase_slug}): generate AI-SPEC.md — {primary_framework} + domain context + eval strategy"
+```
+
+## 12. Display Completion
+
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+ GSD ► AI-SPEC COMPLETE — PHASE {N}: {name}
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+◆ Framework: {primary_framework}
+◆ System Type: {system_type}
+◆ Domain: {domain_vertical from Section 1b}
+◆ Eval Dimensions: {eval_concerns}
+◆ Tracing Default: Arize Phoenix (or detected existing tool)
+◆ Output: {ai_spec_path}
+
+Next step:
+  /gsd-plan-phase {N}   — planner will consume AI-SPEC.md
+```
+
+</process>
+
+<success_criteria>
+- [ ] Framework selected with rationale (Section 2)
+- [ ] AI-SPEC.md created from template
+- [ ] Framework docs + AI best practices researched (Sections 3, 4, 4b populated)
+- [ ] Domain context + expert rubric ingredients researched (Section 1b populated)
+- [ ] Eval strategy grounded in domain context (Sections 5-7 populated)
+- [ ] Arize Phoenix (or detected tool) set as tracing default in Section 7
+- [ ] AI-SPEC.md validated (Sections 1b, 2, 3, 4b, 5, 6 all non-empty)
+- [ ] Committed if commit_docs enabled
+- [ ] Next step surfaced to user
+</success_criteria>
--- a/get-shit-done/workflows/eval-review.md
+++ b/get-shit-done/workflows/eval-review.md
@@ -0,0 +1,153 @@
+<purpose>
+Retroactive audit of an implemented AI phase's evaluation coverage. Standalone command that works on any GSD-managed AI phase. Produces a scored EVAL-REVIEW.md with gap analysis and remediation plan.
+
+Use after /gsd-execute-phase to verify that the evaluation strategy from AI-SPEC.md was actually implemented. Mirrors the pattern of /gsd-ui-review and /gsd-validate-phase.
+</purpose>
+
+<required_reading>
+@~/.claude/get-shit-done/references/ai-evals.md
+</required_reading>
+
+<process>
+
+## 0. Initialize
+
+```bash
+INIT=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" init phase-op "${PHASE_ARG}")
+if [[ "$INIT" == @file:* ]]; then INIT=$(cat "${INIT#@file:}"); fi
+```
+
+Parse: `phase_dir`, `phase_number`, `phase_name`, `phase_slug`, `padded_phase`, `commit_docs`.
+
+```bash
+AUDITOR_MODEL=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" resolve-model gsd-eval-auditor --raw)
+```
+
+Display banner:
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+ GSD ► EVAL AUDIT — PHASE {N}: {name}
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+```
+
+## 1. Detect Input State
+
+```bash
+SUMMARY_FILES=$(ls "${PHASE_DIR}"/*-SUMMARY.md 2>/dev/null)
+AI_SPEC_FILE=$(ls "${PHASE_DIR}"/*-AI-SPEC.md 2>/dev/null | head -1)
+EVAL_REVIEW_FILE=$(ls "${PHASE_DIR}"/*-EVAL-REVIEW.md 2>/dev/null | head -1)
+```
+
+**State A** — AI-SPEC.md + SUMMARY.md exist: Full audit against spec
+**State B** — SUMMARY.md exists, no AI-SPEC.md: Audit against general best practices
+**State C** — No SUMMARY.md: Exit — "Phase {N} not executed. Run /gsd-execute-phase {N} first."
+
+**If `EVAL_REVIEW_FILE` non-empty:** Use AskUserQuestion:
+- header: "Existing Eval Review"
+- question: "EVAL-REVIEW.md already exists for Phase {N}."
+- options:
+  - "Re-audit — run fresh audit"
+  - "View — display current review and exit"
+
+If "View": display file, exit.
+If "Re-audit": continue.
+
+**If State B (no AI-SPEC.md):** Warn:
+```
+No AI-SPEC.md found for Phase {N}.
+Audit will evaluate against general AI eval best practices rather than a phase-specific plan.
+Consider running /gsd-ai-integration-phase {N} before implementation next time.
+```
+Continue (non-blocking).
+
+## 2. Gather Context Paths
+
+Build file list for auditor:
+- AI-SPEC.md (if exists — the planned eval strategy)
+- All SUMMARY.md files in phase dir
+- All PLAN.md files in phase dir
+
+## 3. Spawn gsd-eval-auditor
+
+```
+◆ Spawning eval auditor...
+```
+
+Build prompt:
+
+```markdown
+Read ~/.claude/agents/gsd-eval-auditor.md for instructions.
+
+<objective>
+Conduct evaluation coverage audit of Phase {phase_number}: {phase_name}
+{If AI-SPEC exists: "Audit against AI-SPEC.md evaluation plan."}
+{If no AI-SPEC: "Audit against general AI eval best practices."}
+</objective>
+
+<files_to_read>
+- {summary_paths}
+- {plan_paths}
+- {ai_spec_path if exists}
+</files_to_read>
+
+<input>
+ai_spec_path: {ai_spec_path or "none"}
+phase_dir: {phase_dir}
+phase_number: {phase_number}
+phase_name: {phase_name}
+padded_phase: {padded_phase}
+state: {A or B}
+</input>
+```
+
+Spawn as Task with model `AUDITOR_MODEL`.
+
+## 4. Parse Auditor Result
+
+Read the written EVAL-REVIEW.md. Extract:
+- `overall_score`
+- `verdict` (PRODUCTION READY | NEEDS WORK | SIGNIFICANT GAPS | NOT IMPLEMENTED)
+- `critical_gap_count`
+
+## 5. Display Summary
+
+```
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+ GSD ► EVAL AUDIT COMPLETE — PHASE {N}: {name}
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+◆ Score: {overall_score}/100
+◆ Verdict: {verdict}
+◆ Critical Gaps: {critical_gap_count}
+◆ Output: {eval_review_path}
+
+{If PRODUCTION READY:}
+  Next step: /gsd-plan-phase (next phase) or deploy
+
+{If NEEDS WORK:}
+  Address critical gaps in EVAL-REVIEW.md, then re-run /gsd-eval-review {N}
+
+{If SIGNIFICANT GAPS or NOT IMPLEMENTED:}
+  Review AI-SPEC.md evaluation plan. Critical eval dimensions are not implemented.
+  Do not deploy until gaps are addressed.
+```
+
+## 6. Commit
+
+**If `commit_docs` is true:**
+```bash
+git add "${EVAL_REVIEW_FILE}"
+git commit -m "docs({phase_slug}): add EVAL-REVIEW.md — score {overall_score}/100 ({verdict})"
+```
+
+</process>
+
+<success_criteria>
+- [ ] Phase execution state detected correctly
+- [ ] AI-SPEC.md presence handled (with or without)
+- [ ] gsd-eval-auditor spawned with correct context
+- [ ] EVAL-REVIEW.md written (by auditor)
+- [ ] Score and verdict displayed to user
+- [ ] Appropriate next steps surfaced based on verdict
+- [ ] Committed if commit_docs enabled
+</success_criteria>
--- a/get-shit-done/workflows/plan-phase.md
+++ b/get-shit-done/workflows/plan-phase.md
@@ -241,6 +241,46 @@ If "Run discuss-phase first":
  ```
  **Exit the plan-phase workflow. Do not continue.**

+## 4.5. Check AI-SPEC
+
+**Skip if:** `ai_integration_phase_enabled` from config is false, or `--skip-ai-spec` flag provided.
+
+```bash
+AI_SPEC_FILE=$(ls "${PHASE_DIR}"/*-AI-SPEC.md 2>/dev/null | head -1)
+AI_PHASE_CFG=$(node "$HOME/.claude/get-shit-done/bin/gsd-tools.cjs" config-get workflow.ai_integration_phase 2>/dev/null || echo "true")
+```
+
+**Skip if `AI_PHASE_CFG` is `false`.**
+
+**If `AI_SPEC_FILE` is empty:** Check phase goal for AI keywords:
+```bash
+echo "${phase_goal}" | grep -qi "agent\|llm\|rag\|chatbot\|embedding\|langchain\|llamaindex\|crewai\|langgraph\|openai\|anthropic\|vector\|eval\|ai system"
+```
+
+**If AI keywords detected AND no AI-SPEC.md:**
+```
+◆ Note: This phase appears to involve AI system development.
+  Consider running /gsd-ai-integration-phase {N} before planning to:
+  - Select the right framework for your use case
+  - Research its docs and best practices
+  - Design an evaluation strategy
+
+  Continue planning without AI-SPEC? (non-blocking — /gsd-ai-integration-phase can be run after)
+```
+
+Use AskUserQuestion with options:
+- "Continue — plan without AI-SPEC"
+- "Stop — I'll run /gsd-ai-integration-phase {N} first"
+
+If "Stop": Exit with `/gsd-ai-integration-phase {N}` reminder.
+If "Continue": Proceed. (Non-blocking — planner will note AI-SPEC is absent.)
+
+**If `AI_SPEC_FILE` is non-empty:** Extract framework for planner context:
+```bash
+FRAMEWORK_LINE=$(grep "Selected Framework:" "${AI_SPEC_FILE}" | head -1)
+```
+Pass `ai_spec_path` and `framework_line` to planner in step 7 so it can reference the AI design contract.
+
 ## 5. Handle Research

 **Skip if:** `--gaps` flag or `--skip-research` flag or `--reviews` flag.
--- a/get-shit-done/workflows/settings.md
+++ b/get-shit-done/workflows/settings.md
@@ -32,6 +32,7 @@ Parse current values (default to `true` if not present):
 - `workflow.nyquist_validation` — validation architecture research during plan-phase (default: true if absent)
 - `workflow.ui_phase` — generate UI-SPEC.md design contracts for frontend phases (default: true if absent)
 - `workflow.ui_safety_gate` — prompt to run /gsd-ui-phase before planning frontend phases (default: true if absent)
+- `workflow.ai_integration_phase` — framework selection + eval strategy for AI phases (default: true if absent)
 - `model_profile` — which model each agent uses (default: `balanced`)
 - `git.branching_strategy` — branching approach (default: `"none"`)
 - `workflow.use_worktrees` — whether parallel executor agents run in worktree isolation (default: `true`)
@@ -118,6 +119,15 @@ AskUserQuestion([
      { label: "No", description: "No prompt — plan-phase proceeds without UI-SPEC check." }
    ]
  },
+  {
+    question: "Enable AI Phase? (framework selection + eval strategy for AI phases)",
+    header: "AI Phase",
+    multiSelect: false,
+    options: [
+      { label: "Yes (Recommended)", description: "Run /gsd-ai-phase before planning AI system phases. Surfaces the right framework, researches its docs, and designs the evaluation strategy." },
+      { label: "No", description: "Skip AI design contract. Good for non-AI phases or when framework is already decided." }
+    ]
+  },
  {
    question: "Git branching strategy?",
    header: "Branching",
@@ -183,6 +193,7 @@ Merge new settings into existing config.json:
    "nyquist_validation": true/false,
    "ui_phase": true/false,
    "ui_safety_gate": true/false,
+    "ai_integration_phase": true/false,
    "text_mode": true/false,
    "research_before_questions": true/false,
    "discuss_mode": "discuss" | "assumptions",
@@ -244,6 +255,7 @@ Write `~/.gsd/defaults.json` with:
    "nyquist_validation": <current>,
    "ui_phase": <current>,
    "ui_safety_gate": <current>,
+    "ai_integration_phase": <current>,
    "skip_discuss": <current>
  }
 }
@@ -268,6 +280,7 @@ Display:
 | Nyquist Validation   | {On/Off} |
 | UI Phase             | {On/Off} |
 | UI Safety Gate       | {On/Off} |
+| AI Integration Phase | {On/Off} |
 | Git Branching        | {None/Per Phase/Per Milestone} |
 | Skip Discuss         | {On/Off} |
 | Context Warnings     | {On/Off} |
@@ -287,7 +300,7 @@ Quick commands:

 <success_criteria>
 - [ ] Current config read
- [ ] User presented with 10 settings (profile + 8 workflow toggles + git branching)
+- [ ] User presented with 14 settings (profile + 11 workflow toggles + git branching + ctx warnings)
 - [ ] Config updated with model_profile, workflow, and git sections
 - [ ] User offered to save as global defaults (~/.gsd/defaults.json)
 - [ ] Changes confirmed to user
--- a/tests/ai-evals.test.cjs
+++ b/tests/ai-evals.test.cjs
@@ -0,0 +1,400 @@
+/**
+ * GSD AI Evals Framework Tests
+ *
+ * Validates the /gsd-ai-integration-phase + /gsd-eval-review contribution:
+ * - workflow.ai_integration_phase key in config defaults and config-set/get
+ * - W016 validate-health warning when ai_integration_phase absent
+ * - addAiIntegrationPhaseKey repair action
+ * - AI-SPEC.md template section completeness
+ * - New agent frontmatter (picked up by agent-frontmatter.test.cjs — covered there)
+ * - plan-phase.md Step 4.5 AI-keyword nudge block
+ * - ai-integration-phase and eval-review command frontmatter
+ * - ai-evals.md and ai-frameworks.md reference files exist and are non-empty
+ */
+
+const { test, describe, beforeEach, afterEach } = require('node:test');
+const assert = require('node:assert');
+const fs = require('fs');
+const path = require('path');
+const { runGsdTools, createTempProject, cleanup } = require('./helpers.cjs');
+
+const REPO_ROOT      = path.join(__dirname, '..');
+const AGENTS_DIR     = path.join(REPO_ROOT, 'agents');
+const COMMANDS_DIR   = path.join(REPO_ROOT, 'commands', 'gsd');
+const WORKFLOWS_DIR  = path.join(REPO_ROOT, 'get-shit-done', 'workflows');
+const TEMPLATES_DIR  = path.join(REPO_ROOT, 'get-shit-done', 'templates');
+const REFERENCES_DIR = path.join(REPO_ROOT, 'get-shit-done', 'references');
+
+// ─── Helpers ─────────────────────────────────────────────────────────────────
+
+function readConfig(tmpDir) {
+  return JSON.parse(fs.readFileSync(path.join(tmpDir, '.planning', 'config.json'), 'utf-8'));
+}
+
+function writeConfig(tmpDir, obj) {
+  fs.writeFileSync(
+    path.join(tmpDir, '.planning', 'config.json'),
+    JSON.stringify(obj, null, 2),
+    'utf-8'
+  );
+}
+
+function writeMinimalHealth(tmpDir) {
+  fs.writeFileSync(path.join(tmpDir, '.planning', 'PROJECT.md'),
+    '# Project\n\n## What This Is\n\nFoo.\n\n## Core Value\n\nBar.\n\n## Requirements\n\nBaz.\n');
+  fs.writeFileSync(path.join(tmpDir, '.planning', 'ROADMAP.md'),
+    '# Roadmap\n\n### Phase 1: Setup\n');
+  fs.writeFileSync(path.join(tmpDir, '.planning', 'STATE.md'),
+    '# Session State\n\nPhase 1 in progress.\n');
+  fs.mkdirSync(path.join(tmpDir, '.planning', 'phases', '01-setup'), { recursive: true });
+}
+
+// ─── Config: workflow.ai_integration_phase default ───────────────────────────────────────
+
+describe('CONFIG: workflow.ai_integration_phase default', () => {
+  let tmpDir;
+
+  beforeEach(() => { tmpDir = createTempProject(); });
+  afterEach(() => { cleanup(tmpDir); });
+
+  test('config-ensure-section includes workflow.ai_integration_phase as boolean', () => {
+    const result = runGsdTools('config-ensure-section', tmpDir);
+    assert.ok(result.success, `Command failed: ${result.error}`);
+
+    const config = readConfig(tmpDir);
+    assert.ok(config.workflow && typeof config.workflow === 'object', 'workflow should exist');
+    assert.strictEqual(typeof config.workflow.ai_integration_phase, 'boolean', 'workflow.ai_integration_phase should be boolean');
+  });
+
+  test('workflow.ai_integration_phase defaults to true', () => {
+    runGsdTools('config-ensure-section', tmpDir);
+    const config = readConfig(tmpDir);
+    assert.strictEqual(config.workflow.ai_integration_phase, true, 'workflow.ai_integration_phase should default to true');
+  });
+});
+
+// ─── Config: config-set / config-get workflow.ai_integration_phase ───────────────────────
+
+describe('CONFIG: config-set / config-get workflow.ai_integration_phase', () => {
+  let tmpDir;
+
+  beforeEach(() => {
+    tmpDir = createTempProject();
+    runGsdTools('config-ensure-section', tmpDir);
+  });
+
+  afterEach(() => { cleanup(tmpDir); });
+
+  test('config-set workflow.ai_integration_phase false persists as boolean false', () => {
+    const result = runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
+    assert.ok(result.success, `config-set failed: ${result.error}`);
+
+    const config = readConfig(tmpDir);
+    assert.strictEqual(config.workflow.ai_integration_phase, false);
+    assert.strictEqual(typeof config.workflow.ai_integration_phase, 'boolean');
+  });
+
+  test('config-set workflow.ai_integration_phase true persists as boolean true', () => {
+    runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
+    const result = runGsdTools('config-set workflow.ai_integration_phase true', tmpDir);
+    assert.ok(result.success, `config-set failed: ${result.error}`);
+
+    const config = readConfig(tmpDir);
+    assert.strictEqual(config.workflow.ai_integration_phase, true);
+  });
+
+  test('config-get workflow.ai_integration_phase returns the stored value', () => {
+    runGsdTools('config-set workflow.ai_integration_phase false', tmpDir);
+    const result = runGsdTools('config-get workflow.ai_integration_phase', tmpDir);
+    assert.ok(result.success, `config-get failed: ${result.error}`);
+    assert.strictEqual(JSON.parse(result.output), false);
+  });
+});
+
+// ─── Validate Health: W016 ────────────────────────────────────────────────────
+
+describe('HEALTH: W016 — workflow.ai_integration_phase absent', () => {
+  let tmpDir;
+
+  beforeEach(() => { tmpDir = createTempProject(); });
+  afterEach(() => { cleanup(tmpDir); });
+
+  test('emits W016 when workflow.ai_integration_phase absent from config', () => {
+    writeMinimalHealth(tmpDir);
+    writeConfig(tmpDir, { model_profile: 'balanced', workflow: { research: true, nyquist_validation: true } });
+
+    const result = runGsdTools('validate health', tmpDir);
+    assert.ok(result.success, `Command failed: ${result.error}`);
+
+    const output = JSON.parse(result.output);
+    assert.ok(
+      output.warnings.some(w => w.code === 'W016'),
+      `Expected W016 in warnings: ${JSON.stringify(output.warnings)}`
+    );
+  });
+
+  test('does not emit W016 when workflow.ai_integration_phase is explicitly set', () => {
+    writeMinimalHealth(tmpDir);
+    writeConfig(tmpDir, {
+      model_profile: 'balanced',
+      workflow: { research: true, nyquist_validation: true, ai_integration_phase: true },
+    });
+
+    const result = runGsdTools('validate health', tmpDir);
+    assert.ok(result.success, `Command failed: ${result.error}`);
+
+    const output = JSON.parse(result.output);
+    assert.ok(
+      !output.warnings.some(w => w.code === 'W016'),
+      `Should not have W016: ${JSON.stringify(output.warnings)}`
+    );
+  });
+
+  test('does not emit W016 when workflow.ai_integration_phase is false (explicit opt-out)', () => {
+    writeMinimalHealth(tmpDir);
+    writeConfig(tmpDir, {
+      model_profile: 'balanced',
+      workflow: { research: true, nyquist_validation: true, ai_integration_phase: false },
+    });
+
+    const result = runGsdTools('validate health', tmpDir);
+    assert.ok(result.success, `Command failed: ${result.error}`);
+
+    const output = JSON.parse(result.output);
+    assert.ok(
+      !output.warnings.some(w => w.code === 'W016'),
+      `Should not have W016: ${JSON.stringify(output.warnings)}`
+    );
+  });
+});
+
+// ─── Validate Health --repair: addAiIntegrationPhaseKey ─────────────────────────────────
+
+describe('HEALTH --repair: addAiIntegrationPhaseKey', () => {
+  let tmpDir;
+
+  beforeEach(() => { tmpDir = createTempProject(); });
+  afterEach(() => { cleanup(tmpDir); });
+
+  test('adds workflow.ai_integration_phase via addAiIntegrationPhaseKey repair', () => {
+    writeMinimalHealth(tmpDir);
+    const configPath = path.join(tmpDir, '.planning', 'config.json');
+    fs.writeFileSync(configPath,
+      JSON.stringify({ model_profile: 'balanced', workflow: { research: true, nyquist_validation: true } }, null, 2)
+    );
+
+    const result = runGsdTools('validate health --repair', tmpDir);
+    assert.ok(result.success, `Command failed: ${result.error}`);
+
+    const output = JSON.parse(result.output);
+    const addAction = output.repairs_performed.find(r => r.action === 'addAiIntegrationPhaseKey');
+    assert.ok(addAction, `Expected addAiIntegrationPhaseKey action: ${JSON.stringify(output.repairs_performed)}`);
+    assert.strictEqual(addAction.success, true);
+
+    const config = readConfig(tmpDir);
+    assert.strictEqual(config.workflow.ai_integration_phase, true);
+  });
+});
+
+// ─── AI-SPEC.md Template Structure ───────────────────────────────────────────
+
+describe('TEMPLATE: AI-SPEC.md section completeness', () => {
+  const templatePath = path.join(TEMPLATES_DIR, 'AI-SPEC.md');
+  let content;
+
+  test('AI-SPEC.md template exists', () => {
+    assert.ok(fs.existsSync(templatePath), 'AI-SPEC.md template should exist');
+    content = fs.readFileSync(templatePath, 'utf-8');
+    assert.ok(content.length > 100, 'AI-SPEC.md should be non-empty');
+  });
+
+  const requiredSections = [
+    ['## 1. System Classification',   'Section 1 (System Classification)'],
+    ['## 1b. Domain Context',          'Section 1b (Domain Context)'],
+    ['## 2. Framework Decision',       'Section 2 (Framework Decision)'],
+    ['## 3. Framework Quick Reference','Section 3 (Framework Quick Reference)'],
+    ['## 4. Implementation Guidance',  'Section 4 (Implementation Guidance)'],
+    ['## 4b. AI Systems Best Practices','Section 4b (AI Systems Best Practices)'],
+    ['## 5. Evaluation Strategy',      'Section 5 (Evaluation Strategy)'],
+    ['## 6. Guardrails',               'Section 6 (Guardrails)'],
+    ['## 7. Production Monitoring',    'Section 7 (Production Monitoring)'],
+    ['## Checklist',                   'Checklist section'],
+  ];
+
+  for (const [heading, label] of requiredSections) {
+    test(`template contains ${label}`, () => {
+      const c = fs.readFileSync(templatePath, 'utf-8');
+      assert.ok(c.includes(heading), `Template missing: ${heading}`);
+    });
+  }
+
+  test('template checklist has at least 10 items', () => {
+    const c = fs.readFileSync(templatePath, 'utf-8');
+    const items = (c.match(/^- \[[ x]\]/gm) || []);
+    assert.ok(items.length >= 10, `Expected ≥10 checklist items, found ${items.length}`);
+  });
+
+  test('template Section 1b has domain rubric table columns (Good/Bad/Stakes)', () => {
+    const c = fs.readFileSync(templatePath, 'utf-8');
+    assert.ok(c.includes('What Domain Experts Evaluate Against'), 'Missing domain rubric subsection');
+  });
+
+  test('template Section 4b has Pydantic structured outputs guidance', () => {
+    const c = fs.readFileSync(templatePath, 'utf-8');
+    assert.ok(c.includes('Pydantic'), 'Section 4b missing Pydantic guidance');
+  });
+
+  test('template Section 6 has online guardrails and offline flywheel tables', () => {
+    const c = fs.readFileSync(templatePath, 'utf-8');
+    assert.ok(c.includes('Online'), 'Section 6 missing Online guardrails');
+    assert.ok(c.includes('Offline'), 'Section 6 missing Offline flywheel');
+  });
+});
+
+// ─── Command Frontmatter ──────────────────────────────────────────────────────
+
+describe('COMMAND: ai-integration-phase and eval-review frontmatter', () => {
+  const commands = ['ai-integration-phase', 'eval-review'];
+
+  for (const cmd of commands) {
+    test(`${cmd}.md exists`, () => {
+      const p = path.join(COMMANDS_DIR, `${cmd}.md`);
+      assert.ok(fs.existsSync(p), `commands/gsd/${cmd}.md should exist`);
+    });
+
+    test(`${cmd}.md has name, description, argument-hint`, () => {
+      const content = fs.readFileSync(path.join(COMMANDS_DIR, `${cmd}.md`), 'utf-8');
+      const frontmatter = content.split('---')[1] || '';
+      assert.ok(frontmatter.includes('name:'), `${cmd}.md missing name:`);
+      assert.ok(frontmatter.includes('description:'), `${cmd}.md missing description:`);
+      assert.ok(frontmatter.includes('argument-hint:'), `${cmd}.md missing argument-hint:`);
+    });
+  }
+
+  test('ai-integration-phase.md name is gsd:ai-integration-phase', () => {
+    const content = fs.readFileSync(path.join(COMMANDS_DIR, 'ai-integration-phase.md'), 'utf-8');
+    assert.ok(content.includes('name: gsd:ai-integration-phase'), 'ai-integration-phase command name mismatch');
+  });
+
+  test('eval-review.md name is gsd:eval-review', () => {
+    const content = fs.readFileSync(path.join(COMMANDS_DIR, 'eval-review.md'), 'utf-8');
+    assert.ok(content.includes('name: gsd:eval-review'), 'eval-review command name mismatch');
+  });
+});
+
+// ─── New Agents Exist ─────────────────────────────────────────────────────────
+
+describe('AGENTS: new AI-evals agents exist', () => {
+  const newAgents = [
+    'gsd-framework-selector',
+    'gsd-ai-researcher',
+    'gsd-domain-researcher',
+    'gsd-eval-planner',
+    'gsd-eval-auditor',
+  ];
+
+  for (const agent of newAgents) {
+    test(`${agent}.md exists`, () => {
+      assert.ok(
+        fs.existsSync(path.join(AGENTS_DIR, `${agent}.md`)),
+        `agents/${agent}.md should exist`
+      );
+    });
+  }
+});
+
+// ─── Reference Files ──────────────────────────────────────────────────────────
+
+describe('REFERENCES: ai-frameworks.md and ai-evals.md', () => {
+  const refs = ['ai-frameworks.md', 'ai-evals.md'];
+
+  for (const ref of refs) {
+    test(`${ref} exists and is non-empty`, () => {
+      const p = path.join(REFERENCES_DIR, ref);
+      assert.ok(fs.existsSync(p), `references/${ref} should exist`);
+      const content = fs.readFileSync(p, 'utf-8');
+      assert.ok(content.length > 200, `references/${ref} should have substantial content`);
+    });
+  }
+
+  test('ai-frameworks.md covers key frameworks', () => {
+    const content = fs.readFileSync(path.join(REFERENCES_DIR, 'ai-frameworks.md'), 'utf-8');
+    for (const fw of ['CrewAI', 'LlamaIndex', 'LangChain', 'LangGraph']) {
+      assert.ok(content.includes(fw), `ai-frameworks.md should mention ${fw}`);
+    }
+  });
+
+  test('ai-evals.md covers eval tooling defaults', () => {
+    const content = fs.readFileSync(path.join(REFERENCES_DIR, 'ai-evals.md'), 'utf-8');
+    assert.ok(content.includes('Arize Phoenix') || content.includes('Phoenix'), 'ai-evals.md should mention Arize Phoenix');
+    assert.ok(content.includes('RAGAS'), 'ai-evals.md should mention RAGAS');
+  });
+});
+
+// ─── Workflow: plan-phase Step 4.5 AI keyword nudge ──────────────────────────
+
+describe('WORKFLOW: plan-phase.md AI nudge integration', () => {
+  const planPhasePath = path.join(WORKFLOWS_DIR, 'plan-phase.md');
+
+  test('plan-phase.md exists', () => {
+    assert.ok(fs.existsSync(planPhasePath), 'workflows/plan-phase.md should exist');
+  });
+
+  test('plan-phase.md contains AI keyword detection for LLM/agent/RAG terms', () => {
+    const content = fs.readFileSync(planPhasePath, 'utf-8');
+    assert.ok(
+      content.includes('agent') && content.includes('llm') || content.includes('rag') || content.includes('AI'),
+      'plan-phase.md should contain AI keyword detection'
+    );
+  });
+
+  test('plan-phase.md references /gsd-ai-integration-phase nudge', () => {
+    const content = fs.readFileSync(planPhasePath, 'utf-8');
+    assert.ok(
+      content.includes('ai-integration-phase') || content.includes('ai_integration_phase'),
+      'plan-phase.md should reference ai-integration-phase workflow'
+    );
+  });
+
+  test('plan-phase.md references workflow.ai_integration_phase config toggle', () => {
+    const content = fs.readFileSync(planPhasePath, 'utf-8');
+    assert.ok(
+      content.includes('ai_integration_phase'),
+      'plan-phase.md should check workflow.ai_integration_phase config'
+    );
+  });
+});
+
+// ─── Workflow: ai-integration-phase and eval-review workflows exist ──────────────────────
+
+describe('WORKFLOW: ai-integration-phase and eval-review workflow files', () => {
+  const workflows = ['ai-integration-phase', 'eval-review'];
+
+  for (const wf of workflows) {
+    test(`${wf}.md workflow exists`, () => {
+      assert.ok(
+        fs.existsSync(path.join(WORKFLOWS_DIR, `${wf}.md`)),
+        `workflows/${wf}.md should exist`
+      );
+    });
+  }
+
+  test('ai-integration-phase.md orchestrates 4 agents', () => {
+    const content = fs.readFileSync(path.join(WORKFLOWS_DIR, 'ai-integration-phase.md'), 'utf-8');
+    for (const agent of ['gsd-framework-selector', 'gsd-ai-researcher', 'gsd-domain-researcher', 'gsd-eval-planner']) {
+      assert.ok(content.includes(agent), `ai-integration-phase.md should reference ${agent}`);
+    }
+  });
+
+  test('eval-review.md references gsd-eval-auditor', () => {
+    const content = fs.readFileSync(path.join(WORKFLOWS_DIR, 'eval-review.md'), 'utf-8');
+    assert.ok(content.includes('gsd-eval-auditor'), 'eval-review.md should reference gsd-eval-auditor');
+  });
+
+  test('select-framework.md does NOT exist (removed per design)', () => {
+    assert.ok(
+      !fs.existsSync(path.join(WORKFLOWS_DIR, 'select-framework.md')),
+      'select-framework.md should not exist — removed in favour of ai-integration-phase nudge'
+    );
+  });
+});
--- a/tests/config-field-docs.test.cjs
+++ b/tests/config-field-docs.test.cjs
@@ -73,6 +73,7 @@ describe('config-field-docs', () => {
      plan_checker: 'workflow.plan_check',
      verifier: 'workflow.verifier',
      nyquist_validation: 'workflow.nyquist_validation',
+      ai_integration_phase: 'workflow.ai_integration_phase',
      text_mode: 'workflow.text_mode',
      subagent_timeout: 'workflow.subagent_timeout',
      branching_strategy: 'git.branching_strategy',
--- a/tests/copilot-install.test.cjs
+++ b/tests/copilot-install.test.cjs
@@ -1180,6 +1180,7 @@ describe('E2E: Copilot full install verification', () => {
    const gsdAgents = files.filter(f => f.startsWith('gsd-') && f.endsWith('.agent.md')).sort();
    const expected = [
      'gsd-advisor-researcher.agent.md',
+      'gsd-ai-researcher.agent.md',
      'gsd-assumptions-analyzer.agent.md',
      'gsd-code-fixer.agent.md',
      'gsd-code-reviewer.agent.md',
@@ -1187,7 +1188,11 @@ describe('E2E: Copilot full install verification', () => {
      'gsd-debugger.agent.md',
      'gsd-doc-verifier.agent.md',
      'gsd-doc-writer.agent.md',
+      'gsd-domain-researcher.agent.md',
+      'gsd-eval-auditor.agent.md',
+      'gsd-eval-planner.agent.md',
      'gsd-executor.agent.md',
+      'gsd-framework-selector.agent.md',
      'gsd-integration-checker.agent.md',
      'gsd-intel-updater.agent.md',
      'gsd-nyquist-auditor.agent.md',