mirror of
https://github.com/thedotmack/claude-mem
synced 2026-04-25 17:15:04 +02:00
* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement
Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md.
Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin;
run-instance.sh executes the two-turn ingest/fix protocol per instance;
run-batch.py orchestrates parallel Docker runs with per-instance isolation;
eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports.
Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends;
agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in
container mode) and the orchestrator reads it back. Scaffolding only — no
Docker build or smoke test run yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions
Skips per-call API billing by extracting OAuth creds from host Keychain
(macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them
read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude
at container start so the per-instance isolation model still holds.
Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls
back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or
CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end
using OAuth for quick verification before batch runs.
Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is
framed for individual developer use — batch evaluation may exhaust the
quota or raise compliance questions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(docker): basic claude-mem container for ad-hoc testing
Adds docker/claude-mem/ with a fresh spin-up image:
- Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer
pattern — Anthropic ships the Dockerfile, not a pullable image); layers
Bun + uv + locally-built plugin/; runs as non-root node user
- entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into
$HOME/.claude/.credentials.json, then exec's the command (default: bash)
- build.sh: npm run build + docker build
- run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain
(security find-generic-password) or ~/.claude/.credentials.json on Linux,
mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the
observations DB survives container exit
Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK
compression runs under subscription OAuth, observations row lands with
populated facts/concepts/files_read, Chroma sync triggers.
Also updates .gitignore/.dockerignore for the new runtime-output paths.
Built plugin artifacts refreshed by the build step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(evals/swebench): non-root user, OAuth mount, Lite dataset default
- Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code
refuses --permission-mode bypassPermissions when euid==0, which made every
agent run exit 1 before producing a diff. Also move Bun + uv installs to
system paths so the non-root user can exec them.
- run-batch.py: add extract_oauth_credentials() that pulls from macOS
Keychain / Linux ~/.claude/.credentials.json into a temp file and bind-
mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE.
New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can
target SWE-bench_Lite without editing the script.
- smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite
contains sympy__sympy-24152, Verified does not); accept DATASET env
override.
Caveat surfaced during testing: Max/Pro subscriptions have per-window usage
limits; running 5 instances in parallel with the "read every source file"
ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address PR #2076 review comments
- docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds
to match what `claude login` writes; avoids exposing tokens to other
host users. Verified readable inside the container under Docker
Desktop's UID translation.
- docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION
/ UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`;
uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`.
- evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c`
so paths with spaces/special chars can't break shell interpolation.
- evals/swebench/run-batch.py: add --overwrite flag; abort by default
when predictions.jsonl for the run-id already exists, preventing
accidental silent discard of partial results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address coderabbit review on PR #2076
Actionable (4):
- Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing
`|| true` no longer masks failures from `curl|sh` via bash operator
precedence (&& binds tighter than ||). Applied to both docker/claude-mem/
and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines.
- docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB).
- run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`)
and force-remove via `docker rm -f <name>` in the TimeoutExpired handler
so timed-out runs don't leave orphan containers.
Nitpicks (2):
- smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance
JSON once, print `repo base_commit`, and write problem.txt in the same
call.
- run-instance.sh: shallow clone via `--depth 1 --no-single-branch` +
`fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the
server rejects the by-commit fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address second coderabbit review on PR #2076
Actionable (3):
- docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json
when the Keychain lookup misses (some setups still have file-only creds).
Unified into a single creds_obtained gate so the error surface lists both
sources tried.
- docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell
so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted
OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so
the trap runs on exit.
- evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick
`timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with
a warning. Name the container so exit-124 from timeout can `docker rm -f`
it deterministically.
Nitpick from the same review (consolidated python3 calls in smoke-test.sh)
was already addressed in the prior commit ef621e00.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address third coderabbit review on PR #2076
Actionable (1):
- evals/swebench/smoke-test.sh: the consolidated python heredoc had competing
stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data).
The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse
would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it
inside the script instead; the heredoc is now only the script body, which
is what `python3 -` needs.
Nitpicks (2):
- evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to
~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh).
- evals/swebench/run-batch.py: extract_oauth_credentials() no longer
early-returns on Darwin keychain miss; falls through to the on-disk creds
file so macOS setups with file-only credentials work in batch mode too.
Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated
and problem.txt written from a synthetic INSTANCE_JSON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
309 lines
10 KiB
Python
Executable File
309 lines
10 KiB
Python
Executable File
#!/usr/bin/env python3
|
|
"""Summarize SWE-bench evaluation run results.
|
|
|
|
Walks the SWE-bench harness output directory, tallies resolved/unresolved/error
|
|
counts, and emits a markdown summary. Optionally diffs against another run.
|
|
"""
|
|
|
|
import argparse
|
|
import json
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
|
|
def load_expected_instance_ids(predictions_path: Path) -> list[str]:
|
|
"""Read instance_ids from a predictions.jsonl file (one JSON object per line)."""
|
|
instance_ids: list[str] = []
|
|
if not predictions_path.exists():
|
|
print(
|
|
f"warning: predictions file not found: {predictions_path}",
|
|
file=sys.stderr,
|
|
)
|
|
return instance_ids
|
|
with predictions_path.open("r", encoding="utf-8") as handle:
|
|
for line_number, raw_line in enumerate(handle, start=1):
|
|
stripped = raw_line.strip()
|
|
if not stripped:
|
|
continue
|
|
try:
|
|
record = json.loads(stripped)
|
|
except json.JSONDecodeError as exc:
|
|
print(
|
|
f"warning: could not parse predictions line {line_number}: {exc}",
|
|
file=sys.stderr,
|
|
)
|
|
continue
|
|
instance_id = record.get("instance_id")
|
|
if instance_id:
|
|
instance_ids.append(instance_id)
|
|
return instance_ids
|
|
|
|
|
|
def load_run_results(
|
|
run_id: str,
|
|
model_name: str,
|
|
expected_instance_ids: list[str],
|
|
repo_root: Path,
|
|
) -> dict:
|
|
"""Walk logs/run_evaluation/<run_id>/<model_name>/*/report.json and tally results.
|
|
|
|
Returns a dict:
|
|
{
|
|
"per_instance": {instance_id: {"resolved": bool|None, "notes": str}},
|
|
"resolved_count": int,
|
|
"unresolved_count": int,
|
|
"error_count": int,
|
|
}
|
|
"""
|
|
run_logs_root = repo_root / "logs" / "run_evaluation" / run_id / model_name
|
|
per_instance: dict[str, dict] = {}
|
|
resolved_count = 0
|
|
unresolved_count = 0
|
|
error_count = 0
|
|
|
|
for instance_id in expected_instance_ids:
|
|
report_path = run_logs_root / instance_id / "report.json"
|
|
if not report_path.exists():
|
|
per_instance[instance_id] = {
|
|
"resolved": None,
|
|
"notes": "missing report.json",
|
|
}
|
|
error_count += 1
|
|
continue
|
|
try:
|
|
with report_path.open("r", encoding="utf-8") as handle:
|
|
report_data = json.load(handle)
|
|
except (json.JSONDecodeError, OSError) as exc:
|
|
per_instance[instance_id] = {
|
|
"resolved": None,
|
|
"notes": f"failed to parse report.json: {exc}",
|
|
}
|
|
error_count += 1
|
|
continue
|
|
|
|
# SWE-bench harness typically nests per-instance data under the
|
|
# instance_id key; fall back to the top-level dict for flexibility.
|
|
inner = report_data.get(instance_id, report_data)
|
|
resolved_value = inner.get("resolved")
|
|
if resolved_value is True:
|
|
per_instance[instance_id] = {"resolved": True, "notes": ""}
|
|
resolved_count += 1
|
|
elif resolved_value is False:
|
|
notes_parts: list[str] = []
|
|
tests_status = inner.get("tests_status")
|
|
if isinstance(tests_status, dict):
|
|
fail_to_pass = tests_status.get("FAIL_TO_PASS", {})
|
|
if isinstance(fail_to_pass, dict):
|
|
failed = fail_to_pass.get("failure", []) or []
|
|
if failed:
|
|
notes_parts.append(f"FAIL_TO_PASS failures: {len(failed)}")
|
|
per_instance[instance_id] = {
|
|
"resolved": False,
|
|
"notes": "; ".join(notes_parts),
|
|
}
|
|
unresolved_count += 1
|
|
else:
|
|
per_instance[instance_id] = {
|
|
"resolved": None,
|
|
"notes": "report.json missing 'resolved' field",
|
|
}
|
|
error_count += 1
|
|
|
|
return {
|
|
"per_instance": per_instance,
|
|
"resolved_count": resolved_count,
|
|
"unresolved_count": unresolved_count,
|
|
"error_count": error_count,
|
|
}
|
|
|
|
|
|
def format_resolved_cell(resolved: bool | None) -> str:
|
|
if resolved is True:
|
|
return "yes"
|
|
if resolved is False:
|
|
return "no"
|
|
return "error"
|
|
|
|
|
|
def render_summary_markdown(run_id: str, results: dict) -> str:
|
|
total = (
|
|
results["resolved_count"]
|
|
+ results["unresolved_count"]
|
|
+ results["error_count"]
|
|
)
|
|
resolved = results["resolved_count"]
|
|
resolve_rate = (resolved / total * 100.0) if total > 0 else 0.0
|
|
|
|
lines: list[str] = []
|
|
lines.append(f"# Run {run_id}")
|
|
lines.append(f"- Total: {total}")
|
|
lines.append(f"- Resolved: {resolved} ({resolve_rate:.2f}%)")
|
|
lines.append(f"- Unresolved: {results['unresolved_count']}")
|
|
lines.append(f"- Errors: {results['error_count']}")
|
|
lines.append("")
|
|
lines.append("## Per-instance")
|
|
lines.append("| instance_id | resolved | notes |")
|
|
lines.append("|---|---|---|")
|
|
for instance_id, record in results["per_instance"].items():
|
|
resolved_cell = format_resolved_cell(record["resolved"])
|
|
notes_cell = record.get("notes", "") or ""
|
|
# Escape pipe chars in notes to avoid breaking markdown tables.
|
|
notes_cell = notes_cell.replace("|", "\\|")
|
|
lines.append(f"| {instance_id} | {resolved_cell} | {notes_cell} |")
|
|
lines.append("")
|
|
return "\n".join(lines)
|
|
|
|
|
|
def render_diff_markdown(
|
|
current_run_id: str,
|
|
other_run_id: str,
|
|
current_results: dict,
|
|
other_results: dict,
|
|
) -> str:
|
|
def resolve_rate(results: dict) -> tuple[int, float]:
|
|
total = (
|
|
results["resolved_count"]
|
|
+ results["unresolved_count"]
|
|
+ results["error_count"]
|
|
)
|
|
rate = (results["resolved_count"] / total * 100.0) if total > 0 else 0.0
|
|
return total, rate
|
|
|
|
current_total, current_rate = resolve_rate(current_results)
|
|
other_total, other_rate = resolve_rate(other_results)
|
|
rate_delta = current_rate - other_rate
|
|
|
|
lines: list[str] = []
|
|
lines.append(f"# Diff vs {other_run_id}")
|
|
lines.append(
|
|
f"- {current_run_id}: {current_results['resolved_count']}/{current_total} "
|
|
f"({current_rate:.2f}%)"
|
|
)
|
|
lines.append(
|
|
f"- {other_run_id}: {other_results['resolved_count']}/{other_total} "
|
|
f"({other_rate:.2f}%)"
|
|
)
|
|
lines.append(f"- Delta: {rate_delta:+.2f} percentage points")
|
|
lines.append("")
|
|
lines.append("## Per-instance status changes")
|
|
lines.append(f"| instance_id | {other_run_id} | {current_run_id} |")
|
|
lines.append("|---|---|---|")
|
|
|
|
all_instance_ids = set(current_results["per_instance"].keys()) | set(
|
|
other_results["per_instance"].keys()
|
|
)
|
|
changes_found = False
|
|
for instance_id in sorted(all_instance_ids):
|
|
current_record = current_results["per_instance"].get(instance_id)
|
|
other_record = other_results["per_instance"].get(instance_id)
|
|
current_status = (
|
|
format_resolved_cell(current_record["resolved"])
|
|
if current_record
|
|
else "absent"
|
|
)
|
|
other_status = (
|
|
format_resolved_cell(other_record["resolved"])
|
|
if other_record
|
|
else "absent"
|
|
)
|
|
if current_status != other_status:
|
|
lines.append(f"| {instance_id} | {other_status} | {current_status} |")
|
|
changes_found = True
|
|
if not changes_found:
|
|
lines.append("| (no status changes) | | |")
|
|
lines.append("")
|
|
return "\n".join(lines)
|
|
|
|
|
|
def main() -> int:
|
|
parser = argparse.ArgumentParser(
|
|
description="Summarize SWE-bench evaluation run results."
|
|
)
|
|
parser.add_argument(
|
|
"--run-id",
|
|
required=True,
|
|
help="Run identifier used in logs/run_evaluation/<run_id>/ and evals/swebench/runs/<run_id>/.",
|
|
)
|
|
parser.add_argument(
|
|
"--compare",
|
|
metavar="OTHER_RUN_ID",
|
|
default=None,
|
|
help="Optional other run_id to diff resolve rates and per-instance status changes against.",
|
|
)
|
|
parser.add_argument(
|
|
"--model-name",
|
|
default="claude-opus-4-7+claude-mem",
|
|
help="Model name directory inside logs/run_evaluation/<run_id>/.",
|
|
)
|
|
parser.add_argument(
|
|
"--out",
|
|
default=None,
|
|
help="Output path for the markdown summary (default: evals/swebench/runs/<run_id>/summary.md).",
|
|
)
|
|
args = parser.parse_args()
|
|
|
|
# Resolve repo root from this script's location: evals/swebench/summarize.py
|
|
script_path = Path(__file__).resolve()
|
|
repo_root = script_path.parent.parent.parent
|
|
|
|
current_predictions_path = (
|
|
repo_root / "evals" / "swebench" / "runs" / args.run_id / "predictions.jsonl"
|
|
)
|
|
current_instance_ids = load_expected_instance_ids(current_predictions_path)
|
|
current_results = load_run_results(
|
|
run_id=args.run_id,
|
|
model_name=args.model_name,
|
|
expected_instance_ids=current_instance_ids,
|
|
repo_root=repo_root,
|
|
)
|
|
|
|
summary_markdown = render_summary_markdown(args.run_id, current_results)
|
|
|
|
if args.compare:
|
|
other_predictions_path = (
|
|
repo_root
|
|
/ "evals"
|
|
/ "swebench"
|
|
/ "runs"
|
|
/ args.compare
|
|
/ "predictions.jsonl"
|
|
)
|
|
other_instance_ids = load_expected_instance_ids(other_predictions_path)
|
|
other_results = load_run_results(
|
|
run_id=args.compare,
|
|
model_name=args.model_name,
|
|
expected_instance_ids=other_instance_ids,
|
|
repo_root=repo_root,
|
|
)
|
|
diff_markdown = render_diff_markdown(
|
|
current_run_id=args.run_id,
|
|
other_run_id=args.compare,
|
|
current_results=current_results,
|
|
other_results=other_results,
|
|
)
|
|
summary_markdown = summary_markdown + "\n" + diff_markdown
|
|
|
|
if args.out:
|
|
output_path = Path(args.out)
|
|
if not output_path.is_absolute():
|
|
output_path = (Path.cwd() / output_path).resolve()
|
|
else:
|
|
output_path = (
|
|
repo_root
|
|
/ "evals"
|
|
/ "swebench"
|
|
/ "runs"
|
|
/ args.run_id
|
|
/ "summary.md"
|
|
)
|
|
|
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
|
output_path.write_text(summary_markdown, encoding="utf-8")
|
|
|
|
print(str(output_path))
|
|
return 0
|
|
|
|
|
|
if __name__ == "__main__":
|
|
raise SystemExit(main())
|