mirror of
https://github.com/thedotmack/claude-mem
synced 2026-04-25 17:15:04 +02:00
* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement
Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md.
Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin;
run-instance.sh executes the two-turn ingest/fix protocol per instance;
run-batch.py orchestrates parallel Docker runs with per-instance isolation;
eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports.
Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends;
agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in
container mode) and the orchestrator reads it back. Scaffolding only — no
Docker build or smoke test run yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions
Skips per-call API billing by extracting OAuth creds from host Keychain
(macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them
read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude
at container start so the per-instance isolation model still holds.
Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls
back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or
CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end
using OAuth for quick verification before batch runs.
Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is
framed for individual developer use — batch evaluation may exhaust the
quota or raise compliance questions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(docker): basic claude-mem container for ad-hoc testing
Adds docker/claude-mem/ with a fresh spin-up image:
- Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer
pattern — Anthropic ships the Dockerfile, not a pullable image); layers
Bun + uv + locally-built plugin/; runs as non-root node user
- entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into
$HOME/.claude/.credentials.json, then exec's the command (default: bash)
- build.sh: npm run build + docker build
- run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain
(security find-generic-password) or ~/.claude/.credentials.json on Linux,
mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the
observations DB survives container exit
Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK
compression runs under subscription OAuth, observations row lands with
populated facts/concepts/files_read, Chroma sync triggers.
Also updates .gitignore/.dockerignore for the new runtime-output paths.
Built plugin artifacts refreshed by the build step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(evals/swebench): non-root user, OAuth mount, Lite dataset default
- Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code
refuses --permission-mode bypassPermissions when euid==0, which made every
agent run exit 1 before producing a diff. Also move Bun + uv installs to
system paths so the non-root user can exec them.
- run-batch.py: add extract_oauth_credentials() that pulls from macOS
Keychain / Linux ~/.claude/.credentials.json into a temp file and bind-
mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE.
New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can
target SWE-bench_Lite without editing the script.
- smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite
contains sympy__sympy-24152, Verified does not); accept DATASET env
override.
Caveat surfaced during testing: Max/Pro subscriptions have per-window usage
limits; running 5 instances in parallel with the "read every source file"
ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address PR #2076 review comments
- docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds
to match what `claude login` writes; avoids exposing tokens to other
host users. Verified readable inside the container under Docker
Desktop's UID translation.
- docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION
/ UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`;
uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`.
- evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c`
so paths with spaces/special chars can't break shell interpolation.
- evals/swebench/run-batch.py: add --overwrite flag; abort by default
when predictions.jsonl for the run-id already exists, preventing
accidental silent discard of partial results.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address coderabbit review on PR #2076
Actionable (4):
- Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing
`|| true` no longer masks failures from `curl|sh` via bash operator
precedence (&& binds tighter than ||). Applied to both docker/claude-mem/
and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines.
- docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB).
- run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`)
and force-remove via `docker rm -f <name>` in the TimeoutExpired handler
so timed-out runs don't leave orphan containers.
Nitpicks (2):
- smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance
JSON once, print `repo base_commit`, and write problem.txt in the same
call.
- run-instance.sh: shallow clone via `--depth 1 --no-single-branch` +
`fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the
server rejects the by-commit fetch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address second coderabbit review on PR #2076
Actionable (3):
- docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json
when the Keychain lookup misses (some setups still have file-only creds).
Unified into a single creds_obtained gate so the error surface lists both
sources tried.
- docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell
so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted
OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so
the trap runs on exit.
- evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick
`timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with
a warning. Name the container so exit-124 from timeout can `docker rm -f`
it deterministically.
Nitpick from the same review (consolidated python3 calls in smoke-test.sh)
was already addressed in the prior commit ef621e00.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: address third coderabbit review on PR #2076
Actionable (1):
- evals/swebench/smoke-test.sh: the consolidated python heredoc had competing
stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data).
The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse
would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it
inside the script instead; the heredoc is now only the script body, which
is what `python3 -` needs.
Nitpicks (2):
- evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to
~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh).
- evals/swebench/run-batch.py: extract_oauth_credentials() no longer
early-returns on Darwin keychain miss; falls through to the on-disk creds
file so macOS setups with file-only credentials work in batch mode too.
Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated
and problem.txt written from a synthetic INSTANCE_JSON.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
153 lines
5.5 KiB
Bash
Executable File
153 lines
5.5 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
# smoke-test.sh — runs ONE SWE-bench instance end-to-end against the agent
|
|
# container using OAuth credentials extracted from the host. Use this to
|
|
# verify the two-turn protocol + /claude-mem:mem-search slash resolution
|
|
# before kicking off a batch run.
|
|
#
|
|
# Usage:
|
|
# evals/swebench/smoke-test.sh [INSTANCE_ID]
|
|
#
|
|
# Defaults to sympy__sympy-24152 (an easy Verified instance) if no arg given.
|
|
#
|
|
# Outputs:
|
|
# evals/swebench/runs/smoke/<INSTANCE_ID>/{ingest.jsonl,fix.jsonl,model_patch.diff}
|
|
# evals/swebench/runs/smoke/predictions.jsonl
|
|
|
|
INSTANCE_ID="${1:-sympy__sympy-24152}"
|
|
DATASET="${DATASET:-princeton-nlp/SWE-bench_Lite}"
|
|
IMAGE="${IMAGE:-claude-mem/swebench-agent:latest}"
|
|
TIMEOUT="${TIMEOUT:-1800}"
|
|
|
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
|
RUN_DIR="$REPO_ROOT/evals/swebench/runs/smoke/$INSTANCE_ID"
|
|
PREDICTIONS="$REPO_ROOT/evals/swebench/runs/smoke/predictions.jsonl"
|
|
mkdir -p "$RUN_DIR" "$(dirname "$PREDICTIONS")"
|
|
|
|
# --- Extract OAuth credentials ---
|
|
CREDS_FILE="$(mktemp -t claude-mem-creds.XXXXXX.json)"
|
|
trap 'rm -f "$CREDS_FILE"' EXIT
|
|
|
|
# Try macOS Keychain first (primary on Darwin), then fall through to the
|
|
# on-disk credentials file — matches docker/claude-mem/run.sh behavior.
|
|
creds_obtained=0
|
|
if [[ "$(uname)" == "Darwin" ]]; then
|
|
if security find-generic-password -s 'Claude Code-credentials' -w > "$CREDS_FILE" 2>/dev/null \
|
|
&& [[ -s "$CREDS_FILE" ]]; then
|
|
creds_obtained=1
|
|
fi
|
|
fi
|
|
if [[ "$creds_obtained" -eq 0 && -f "$HOME/.claude/.credentials.json" ]]; then
|
|
cp "$HOME/.claude/.credentials.json" "$CREDS_FILE"
|
|
creds_obtained=1
|
|
fi
|
|
if [[ "$creds_obtained" -eq 0 ]]; then
|
|
echo "ERROR: no Claude OAuth creds found (macOS Keychain or ~/.claude/.credentials.json)" >&2
|
|
exit 1
|
|
fi
|
|
chmod 600 "$CREDS_FILE"
|
|
|
|
# --- Fetch instance data from HuggingFace via a small Python helper ---
|
|
INSTANCE_JSON="$(mktemp)"
|
|
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"' EXIT
|
|
python3 - "$INSTANCE_ID" "$DATASET" > "$INSTANCE_JSON" <<'PY'
|
|
import json, sys
|
|
from datasets import load_dataset
|
|
target = sys.argv[1]
|
|
dataset = sys.argv[2]
|
|
ds = load_dataset(dataset, split="test")
|
|
for row in ds:
|
|
if row["instance_id"] == target:
|
|
print(json.dumps({
|
|
"instance_id": row["instance_id"],
|
|
"repo": row["repo"],
|
|
"base_commit": row["base_commit"],
|
|
"problem_statement": row["problem_statement"],
|
|
}))
|
|
break
|
|
else:
|
|
print(f"ERROR: instance {target} not found", file=sys.stderr)
|
|
sys.exit(1)
|
|
PY
|
|
|
|
SCRATCH="$(mktemp -d -t claude-mem-smoke.XXXXXX)"
|
|
trap 'rm -f "$CREDS_FILE" "$INSTANCE_JSON"; rm -rf "$SCRATCH"' EXIT
|
|
|
|
# Parse the instance JSON once: print repo + base_commit to stdout, write the
|
|
# problem statement directly to $SCRATCH/problem.txt. INSTANCE_JSON is passed
|
|
# as argv so stdin is free for the `python3 -` heredoc script body (previously
|
|
# both were competing for stdin, which made json.load see the heredoc's EOF).
|
|
read -r REPO BASE_COMMIT < <(
|
|
python3 - "$SCRATCH" "$INSTANCE_JSON" <<'PY'
|
|
import json, os, sys
|
|
scratch, instance_json = sys.argv[1], sys.argv[2]
|
|
with open(instance_json) as f:
|
|
d = json.load(f)
|
|
open(os.path.join(scratch, "problem.txt"), "w").write(d["problem_statement"])
|
|
print(d["repo"], d["base_commit"])
|
|
PY
|
|
)
|
|
|
|
echo "=== Running $INSTANCE_ID ($REPO @ $BASE_COMMIT) ===" >&2
|
|
echo "Scratch: $SCRATCH" >&2
|
|
echo "Logs will land in: $RUN_DIR" >&2
|
|
|
|
# Pick a wall-clock timeout binary. Linux ships `timeout`; macOS needs
|
|
# `gtimeout` from coreutils (brew install coreutils). If neither is available,
|
|
# warn and run without a cap — the smoke test is manual anyway.
|
|
TIMEOUT_CMD=()
|
|
if command -v timeout >/dev/null 2>&1; then
|
|
TIMEOUT_CMD=(timeout "$TIMEOUT")
|
|
elif command -v gtimeout >/dev/null 2>&1; then
|
|
TIMEOUT_CMD=(gtimeout "$TIMEOUT")
|
|
else
|
|
echo "WARN: no \`timeout\`/\`gtimeout\` on PATH; container runs uncapped" >&2
|
|
fi
|
|
|
|
# Name the container so we can force-remove it if the wall-clock timeout
|
|
# fires (SIGTERM from timeout leaves the container state open briefly).
|
|
CONTAINER_NAME="claude-mem-smoke-$INSTANCE_ID-$$"
|
|
|
|
set +e
|
|
"${TIMEOUT_CMD[@]}" docker run --rm \
|
|
--name "$CONTAINER_NAME" \
|
|
-e CLAUDE_MEM_OUTPUT_DIR=/scratch \
|
|
-e CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json \
|
|
-v "$SCRATCH:/scratch" \
|
|
-v "$CREDS_FILE:/auth/.credentials.json:ro" \
|
|
"$IMAGE" \
|
|
"$INSTANCE_ID" "$REPO" "$BASE_COMMIT" /scratch/problem.txt /scratch/ignored-predictions.jsonl
|
|
DOCKER_EXIT=$?
|
|
set -e
|
|
|
|
if [[ "$DOCKER_EXIT" -eq 124 ]]; then
|
|
# `timeout` signals TERM and returns 124 on timeout. Force-remove the
|
|
# container in case docker hasn't reaped it yet.
|
|
echo "ERROR: docker run exceeded ${TIMEOUT}s wall-clock; removing container" >&2
|
|
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
|
fi
|
|
|
|
# Copy artifacts from scratch → RUN_DIR
|
|
for f in ingest.jsonl fix.jsonl model_patch.diff; do
|
|
[[ -f "$SCRATCH/$f" ]] && cp "$SCRATCH/$f" "$RUN_DIR/$f"
|
|
done
|
|
|
|
# Emit authoritative prediction row
|
|
DIFF_FILE="$SCRATCH/model_patch.diff"
|
|
DIFF=""
|
|
[[ -f "$DIFF_FILE" ]] && DIFF="$(cat "$DIFF_FILE")"
|
|
jq -nc \
|
|
--arg id "$INSTANCE_ID" \
|
|
--arg patch "$DIFF" \
|
|
--arg model "claude-opus-4-7+claude-mem" \
|
|
'{instance_id:$id, model_patch:$patch, model_name_or_path:$model}' \
|
|
>> "$PREDICTIONS"
|
|
|
|
echo "=== Done ===" >&2
|
|
echo "Diff size: $(wc -c < "$DIFF_FILE" 2>/dev/null || echo 0) bytes" >&2
|
|
echo "Predictions: $PREDICTIONS" >&2
|
|
echo "Verify mem-search invocation:" >&2
|
|
echo " grep -o '\"name\":\"[^\"]*mem-search[^\"]*\"' $RUN_DIR/fix.jsonl || echo 'NOT INVOKED'" >&2
|