Files
claude-mem/evals/swebench/run-batch.py
Alex Newman 97c7c999b1 feat: basic claude-mem Docker container for easy spin-up (#2076)
* feat(evals): SWE-bench Docker scaffolding for claude-mem resolve-rate measurement

Adds evals/swebench/ scaffolding per .claude/plans/swebench-claude-mem-docker.md.
Agent image builds Claude Code 2.1.114 + locally-built claude-mem plugin;
run-instance.sh executes the two-turn ingest/fix protocol per instance;
run-batch.py orchestrates parallel Docker runs with per-instance isolation;
eval.sh wraps the upstream SWE-bench harness; summarize.py aggregates reports.

Orchestrator owns JSONL writes under a lock to avoid racy concurrent appends;
agent writes its authoritative diff to CLAUDE_MEM_OUTPUT_DIR (/scratch in
container mode) and the orchestrator reads it back. Scaffolding only — no
Docker build or smoke test run yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(evals): OAuth credential mounting for Claude Max/Pro subscriptions

Skips per-call API billing by extracting OAuth creds from host Keychain
(macOS) or ~/.claude/.credentials.json (Linux) and bind-mounting them
read-only into each agent container. Creds are copied into HOME=$SCRATCH/.claude
at container start so the per-instance isolation model still holds.

Adds run-batch.py --auth {oauth,api-key,auto} (auto prefers OAuth, falls
back to API key). run-instance.sh accepts either ANTHROPIC_API_KEY or
CLAUDE_MEM_CREDENTIALS_FILE. smoke-test.sh runs one instance end-to-end
using OAuth for quick verification before batch runs.

Caveat surfaced in docstrings: Max/Pro has per-window usage limits and is
framed for individual developer use — batch evaluation may exhaust the
quota or raise compliance questions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(docker): basic claude-mem container for ad-hoc testing

Adds docker/claude-mem/ with a fresh spin-up image:
- Dockerfile: FROM node:20 (reproduces anthropics/claude-code .devcontainer
  pattern — Anthropic ships the Dockerfile, not a pullable image); layers
  Bun + uv + locally-built plugin/; runs as non-root node user
- entrypoint.sh: seeds OAuth creds from CLAUDE_MEM_CREDENTIALS_FILE into
  $HOME/.claude/.credentials.json, then exec's the command (default: bash)
- build.sh: npm run build + docker build
- run.sh: interactive launcher; auto-extracts OAuth from macOS Keychain
  (security find-generic-password) or ~/.claude/.credentials.json on Linux,
  mounts host .docker-claude-mem-data/ at /home/node/.claude-mem so the
  observations DB survives container exit

Validated end-to-end: PostToolUse hook fires, queue enqueues, worker's SDK
compression runs under subscription OAuth, observations row lands with
populated facts/concepts/files_read, Chroma sync triggers.

Also updates .gitignore/.dockerignore for the new runtime-output paths.
Built plugin artifacts refreshed by the build step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(evals/swebench): non-root user, OAuth mount, Lite dataset default

- Dockerfile.agent: switch to non-root \`node\` user (uid 1000); Claude Code
  refuses --permission-mode bypassPermissions when euid==0, which made every
  agent run exit 1 before producing a diff. Also move Bun + uv installs to
  system paths so the non-root user can exec them.
- run-batch.py: add extract_oauth_credentials() that pulls from macOS
  Keychain / Linux ~/.claude/.credentials.json into a temp file and bind-
  mounts it at /auth/.credentials.json:ro with CLAUDE_MEM_CREDENTIALS_FILE.
  New --auth {oauth,api-key,auto} flag. New --dataset flag so the batch can
  target SWE-bench_Lite without editing the script.
- smoke-test.sh: default DATASET to princeton-nlp/SWE-bench_Lite (Lite
  contains sympy__sympy-24152, Verified does not); accept DATASET env
  override.

Caveat surfaced during testing: Max/Pro subscriptions have per-window usage
limits; running 5 instances in parallel with the "read every source file"
ingest prompt exhausted the 5h window within ~25 minutes (3/5 hit HTTP 429).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address PR #2076 review comments

- docker/claude-mem/run.sh: chmod 600 (not 644) on extracted OAuth creds
  to match what `claude login` writes; avoids exposing tokens to other
  host users. Verified readable inside the container under Docker
  Desktop's UID translation.
- docker/claude-mem/Dockerfile: pin Bun + uv via --build-arg BUN_VERSION
  / UV_VERSION (defaults: 1.3.12, 0.11.7). Bun via `bash -s "bun-v<V>"`;
  uv via versioned installer URL `https://astral.sh/uv/<V>/install.sh`.
- evals/swebench/smoke-test.sh: pipe JSON through stdin to `python3 -c`
  so paths with spaces/special chars can't break shell interpolation.
- evals/swebench/run-batch.py: add --overwrite flag; abort by default
  when predictions.jsonl for the run-id already exists, preventing
  accidental silent discard of partial results.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address coderabbit review on PR #2076

Actionable (4):
- Dockerfile uv install: wrap `chmod ... || true` in braces so the trailing
  `|| true` no longer masks failures from `curl|sh` via bash operator
  precedence (&& binds tighter than ||). Applied to both docker/claude-mem/
  and evals/swebench/Dockerfile.agent. Added `set -eux` to the RUN lines.
- docker/claude-mem/Dockerfile: drop unused `sudo` apt package (~2 MB).
- run-batch.py: name each agent container (`swebench-agent-<id>-<pid>-<tid>`)
  and force-remove via `docker rm -f <name>` in the TimeoutExpired handler
  so timed-out runs don't leave orphan containers.

Nitpicks (2):
- smoke-test.sh: collapse 3 python3 invocations into 1 — parse the instance
  JSON once, print `repo base_commit`, and write problem.txt in the same
  call.
- run-instance.sh: shallow clone via `--depth 1 --no-single-branch` +
  `fetch --depth 1 origin $BASE_COMMIT`. Falls back to a full clone if the
  server rejects the by-commit fetch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address second coderabbit review on PR #2076

Actionable (3):
- docker/claude-mem/run.sh: on macOS, fall back to ~/.claude/.credentials.json
  when the Keychain lookup misses (some setups still have file-only creds).
  Unified into a single creds_obtained gate so the error surface lists both
  sources tried.
- docker/claude-mem/run.sh: drop `exec docker run` — `exec` replaces the shell
  so the EXIT trap (`rm -f "$CREDS_FILE"`) never fires and the extracted
  OAuth JSON leaks to disk until tmpfs cleanup. Run as a child instead so
  the trap runs on exit.
- evals/swebench/smoke-test.sh: actually enforce the TIMEOUT env var. Pick
  `timeout` or `gtimeout` (coreutils on macOS), fall back to uncapped with
  a warning. Name the container so exit-124 from timeout can `docker rm -f`
  it deterministically.

Nitpick from the same review (consolidated python3 calls in smoke-test.sh)
was already addressed in the prior commit ef621e00.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address third coderabbit review on PR #2076

Actionable (1):
- evals/swebench/smoke-test.sh: the consolidated python heredoc had competing
  stdin redirections — `<<'PY'` (script body) AND `< "$INSTANCE_JSON"` (data).
  The heredoc won, so `json.load(sys.stdin)` saw an empty stream and the parse
  would have failed at runtime. Pass INSTANCE_JSON as argv[2] and `open()` it
  inside the script instead; the heredoc is now only the script body, which
  is what `python3 -` needs.

Nitpicks (2):
- evals/swebench/smoke-test.sh: macOS Keychain lookup now falls through to
  ~/.claude/.credentials.json on miss (matches docker/claude-mem/run.sh).
- evals/swebench/run-batch.py: extract_oauth_credentials() no longer
  early-returns on Darwin keychain miss; falls through to the on-disk creds
  file so macOS setups with file-only credentials work in batch mode too.

Functional spot-check of the parse fix confirmed: REPO/BASE_COMMIT populated
and problem.txt written from a synthetic INSTANCE_JSON.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:34:30 -07:00

562 lines
19 KiB
Python
Executable File

#!/usr/bin/env python3
"""
Batch orchestrator for SWE-bench evaluation of Claude Code + claude-mem.
Iterates a list of SWE-bench Verified instances, launches a per-instance Docker
container (`claude-mem/swebench-agent:latest`) that runs the two-turn
ingest/fix protocol, and collects all resulting diffs into a single
`predictions.jsonl` compatible with the upstream SWE-bench harness.
Usage:
python evals/swebench/run-batch.py \
--run-id claude-mem-baseline-001 \
--limit 3 \
--max-concurrent 2
Rate-limit note: Anthropic API rate limits can bite quickly. The default
`--max-concurrent` is 4, but it is safer to START WITH 2 and raise the cap
only after observing no 429s in the logs.
"""
from __future__ import annotations
import argparse
import atexit
import json
import os
import platform
import shutil
import stat
import subprocess
import sys
import tempfile
import threading
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Any, Iterable
from datasets import load_dataset
# Hidden-from-agent fields per the plan. We MUST NOT pass these to the agent
# container — they are evaluator-only ground truth.
HIDDEN_AGENT_FIELDS = (
"patch",
"test_patch",
"FAIL_TO_PASS",
"PASS_TO_PASS",
"environment_setup_commit",
"version",
)
def extract_oauth_credentials() -> Path | None:
"""
Extract Claude Code OAuth credentials (from a Max/Pro subscription) to a
temp file the container can bind-mount. Returns the temp file path, or
None if extraction failed / no creds present.
macOS: creds live in the Keychain under service "Claude Code-credentials".
Linux: creds live at ~/.claude/.credentials.json.
CAVEAT: Anthropic Max/Pro subscriptions have usage limits (per ~5h window)
and their ToS is framed around individual developer use. Running batch
evaluation across parallel containers may exhaust the quota quickly or
raise compliance concerns. This helper exists because the user explicitly
requested it; the caller is responsible for the policy call.
The token may age out mid-run; we mount read-only so refresh writes fail
silently inside the container (the underlying token in the host
Keychain/file is untouched).
"""
temp = tempfile.NamedTemporaryFile(
prefix="claude-mem-creds-",
suffix=".json",
delete=False,
)
temp_path = Path(temp.name)
temp.close()
# Clean up on process exit, even on crash.
atexit.register(lambda: temp_path.unlink(missing_ok=True))
# macOS: try Keychain first (primary storage on Darwin). On miss, fall
# through to the on-disk credentials file — some macOS setups (older CLI,
# migrated machines) only have the file form.
if platform.system() == "Darwin":
try:
completed = subprocess.run(
[
"security",
"find-generic-password",
"-s",
"Claude Code-credentials",
"-w",
],
capture_output=True,
text=True,
check=False,
)
if completed.returncode == 0 and completed.stdout.strip():
temp_path.write_text(completed.stdout.strip(), encoding="utf-8")
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
return temp_path
# else fall through to the on-disk credentials check below
except FileNotFoundError:
print(
"WARN: `security` command not available; trying on-disk creds.",
file=sys.stderr,
)
# fall through to the on-disk credentials check below
# Both platforms (and macOS fallback): read the on-disk credentials file.
creds_file = Path.home() / ".claude" / ".credentials.json"
if creds_file.exists():
temp_path.write_text(creds_file.read_text(encoding="utf-8"), encoding="utf-8")
temp_path.chmod(stat.S_IRUSR | stat.S_IWUSR)
return temp_path
if platform.system() == "Darwin":
print(
"WARN: Claude Code-credentials not found in macOS Keychain and "
"~/.claude/.credentials.json missing. Run `claude login` on the "
"host first, or fall back to ANTHROPIC_API_KEY.",
file=sys.stderr,
)
return None
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Run the claude-mem SWE-bench agent on a batch of instances.",
)
parser.add_argument(
"--instance-ids",
nargs="+",
default=None,
help="Optional explicit list of instance_ids to run.",
)
parser.add_argument(
"--limit",
type=int,
default=None,
help="If set, process only the first N instances after filtering.",
)
parser.add_argument(
"--max-concurrent",
type=int,
default=4,
help="Max concurrent agent containers (default 4; start with 2 and raise after observing no 429s).",
)
parser.add_argument(
"--run-id",
type=str,
required=True,
help="Run identifier; used for output paths.",
)
parser.add_argument(
"--out",
type=str,
default=None,
help="Path to predictions.jsonl (default: evals/swebench/runs/<run_id>/predictions.jsonl).",
)
parser.add_argument(
"--timeout",
type=int,
default=1800,
help="Per-instance timeout in seconds (default 1800, matches upstream harness).",
)
parser.add_argument(
"--image",
type=str,
default="claude-mem/swebench-agent:latest",
help="Agent Docker image tag.",
)
parser.add_argument(
"--dataset",
type=str,
default="princeton-nlp/SWE-bench_Verified",
help="HuggingFace dataset name (e.g. princeton-nlp/SWE-bench_Lite, default Verified).",
)
parser.add_argument(
"--auth",
choices=["oauth", "api-key", "auto"],
default="auto",
help=(
"Auth mode. 'oauth' extracts Claude Max/Pro creds from host "
"Keychain (macOS) or ~/.claude/.credentials.json (Linux). "
"'api-key' uses ANTHROPIC_API_KEY env. 'auto' prefers oauth, "
"falls back to api-key."
),
)
parser.add_argument(
"--overwrite",
action="store_true",
help=(
"Truncate existing predictions.jsonl for this --run-id. "
"Without this flag, the run aborts if predictions already exist "
"(protects partial results from accidental re-runs)."
),
)
return parser.parse_args()
def select_instances(
dataset: Iterable[dict[str, Any]],
instance_ids: list[str] | None,
limit: int | None,
) -> list[dict[str, Any]]:
"""Filter dataset rows by instance_ids (if given) and apply limit."""
rows: list[dict[str, Any]] = list(dataset)
if instance_ids:
wanted = set(instance_ids)
rows = [r for r in rows if r["instance_id"] in wanted]
missing = wanted - {r["instance_id"] for r in rows}
if missing:
print(
f"WARN: {len(missing)} requested instance_ids not found in dataset: "
f"{sorted(missing)[:5]}{'...' if len(missing) > 5 else ''}",
file=sys.stderr,
)
if limit is not None:
rows = rows[:limit]
return rows
def append_prediction_row(
predictions_path: Path,
instance_id: str,
model_patch: str,
model_name_or_path: str,
lock: threading.Lock,
) -> None:
"""Append one JSONL prediction row under a lock (appends are NOT atomic across threads)."""
row = {
"instance_id": instance_id,
"model_patch": model_patch,
"model_name_or_path": model_name_or_path,
}
line = json.dumps(row, ensure_ascii=False) + "\n"
with lock:
with predictions_path.open("a", encoding="utf-8") as fp:
fp.write(line)
def copy_log_if_exists(src: Path, dst: Path) -> None:
"""Copy a log file from the shared scratch volume into the run-log directory, if present."""
if src.exists() and src.is_file():
dst.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dst)
def run_one_instance(
instance: dict[str, Any],
image: str,
predictions_path: Path,
predictions_dir: Path,
run_dir: Path,
timeout: int,
predictions_lock: threading.Lock,
model_name_or_path: str,
oauth_creds_path: Path | None,
) -> tuple[str, str]:
"""
Run the agent container for a single instance.
Returns a (status, instance_id) tuple where status is one of:
"succeeded", "failed", "timed_out".
On ANY non-success (timeout, non-zero exit, missing diff), a prediction
row with model_patch="" is still appended — the plan requires we never
silently drop an instance.
"""
instance_id: str = instance["instance_id"]
repo: str = instance["repo"]
base_commit: str = instance["base_commit"]
problem_statement: str = instance["problem_statement"]
instance_log_dir = run_dir / instance_id
instance_log_dir.mkdir(parents=True, exist_ok=True)
stderr_log_path = instance_log_dir / "stderr.log"
# Per-instance scratch dir — MUST NOT be shared across containers.
scratch_dir = Path(tempfile.mkdtemp(prefix=f"swebench-{instance_id}-"))
problem_file = scratch_dir / "problem.txt"
problem_file.write_text(problem_statement, encoding="utf-8")
status: str = "failed"
model_patch: str = ""
# Uniquely named so the TimeoutExpired handler can kill it without racing
# other instances on the host.
container_name = f"swebench-agent-{instance_id}-{os.getpid()}-{threading.get_ident()}"
try:
# The orchestrator owns JSONL writes under `predictions_lock` to avoid
# racy concurrent appends across containers — so we DO NOT mount the
# predictions directory into the container. Instead, the agent writes
# its authoritative diff to /scratch/model_patch.diff (via
# CLAUDE_MEM_OUTPUT_DIR), plus ingest/fix logs to the same dir. The
# 5th CLI arg to run-instance.sh is only used in standalone smoke-test
# mode; here we point it at a throwaway path inside the container.
cmd: list[str] = [
"docker",
"run",
"--rm",
"--name",
container_name,
"-e",
"CLAUDE_MEM_OUTPUT_DIR=/scratch",
"-v",
f"{scratch_dir}:/scratch",
]
if oauth_creds_path is not None:
cmd += [
"-e",
"CLAUDE_MEM_CREDENTIALS_FILE=/auth/.credentials.json",
"-v",
f"{oauth_creds_path}:/auth/.credentials.json:ro",
]
else:
# Pay-per-call path.
cmd += ["-e", "ANTHROPIC_API_KEY"]
cmd += [
image,
instance_id,
repo,
base_commit,
"/scratch/problem.txt",
"/scratch/ignored-predictions.jsonl",
]
try:
completed = subprocess.run(
cmd,
timeout=timeout,
capture_output=True,
text=True,
check=False,
)
# Persist stderr so post-mortem is possible even on success.
stderr_log_path.write_text(
f"=== STDOUT ===\n{completed.stdout}\n=== STDERR ===\n{completed.stderr}\n",
encoding="utf-8",
)
if completed.returncode == 0:
# Read the diff the agent wrote to the shared predictions volume.
# The container writes its own prediction line; we prefer to
# write our own authoritative row here from the diff file the
# agent left in /scratch. If the agent wrote a diff file, use
# it; otherwise fall back to empty patch.
diff_file = scratch_dir / "model_patch.diff"
if diff_file.exists():
diff_text = diff_file.read_text(encoding="utf-8")
if diff_text.strip():
model_patch = diff_text
status = "succeeded"
else:
status = "failed" # empty diff
else:
# Container did not leave a diff file — treat as failure
# but still emit an empty-patch row below.
status = "failed"
else:
status = "failed"
except subprocess.TimeoutExpired as exc:
status = "timed_out"
# subprocess.run killed the docker CLI, but the container may
# still be running. Force-remove it by name so we don't leak
# containers across the batch.
subprocess.run(
["docker", "rm", "-f", container_name],
capture_output=True,
check=False,
timeout=30,
)
stderr_log_path.write_text(
f"TIMEOUT after {timeout}s (forced docker rm -f {container_name})\n"
f"=== STDOUT (partial) ===\n{exc.stdout or ''}\n"
f"=== STDERR (partial) ===\n{exc.stderr or ''}\n",
encoding="utf-8",
)
# Copy per-turn logs left by the agent in the shared scratch volume.
copy_log_if_exists(scratch_dir / "ingest.jsonl", instance_log_dir / "ingest.jsonl")
copy_log_if_exists(scratch_dir / "fix.jsonl", instance_log_dir / "fix.jsonl")
# Always write a row — never silently drop an instance.
append_prediction_row(
predictions_path=predictions_path,
instance_id=instance_id,
model_patch=model_patch,
model_name_or_path=model_name_or_path,
lock=predictions_lock,
)
except Exception as exc: # pragma: no cover — defensive
status = "failed"
try:
stderr_log_path.write_text(
f"ORCHESTRATOR EXCEPTION: {exc!r}\n",
encoding="utf-8",
)
except OSError:
pass
append_prediction_row(
predictions_path=predictions_path,
instance_id=instance_id,
model_patch="",
model_name_or_path=model_name_or_path,
lock=predictions_lock,
)
finally:
# Per-instance scratch must not leak across containers.
shutil.rmtree(scratch_dir, ignore_errors=True)
return status, instance_id
def main() -> int:
args = parse_args()
repo_root = Path(__file__).resolve().parents[2]
if args.out:
predictions_path = Path(args.out).resolve()
else:
predictions_path = (
repo_root
/ "evals"
/ "swebench"
/ "runs"
/ args.run_id
/ "predictions.jsonl"
)
predictions_dir = predictions_path.parent
run_dir = predictions_dir # logs land in evals/swebench/runs/<run_id>/<instance_id>/
predictions_dir.mkdir(parents=True, exist_ok=True)
# Don't silently discard partial results from a prior run.
if predictions_path.exists() and predictions_path.stat().st_size > 0:
if not args.overwrite:
print(
f"ERROR: {predictions_path} already exists and is non-empty. "
"Pass --overwrite to truncate, or pick a different --run-id.",
file=sys.stderr,
)
return 1
print(
f"WARN: --overwrite set; truncating existing {predictions_path}",
file=sys.stderr,
)
predictions_path.write_text("", encoding="utf-8")
# Resolve auth: OAuth (Max/Pro subscription) or API key.
oauth_creds_path: Path | None = None
if args.auth in ("oauth", "auto"):
oauth_creds_path = extract_oauth_credentials()
if oauth_creds_path is not None:
print(
f"Auth: OAuth credentials extracted to {oauth_creds_path} "
"(mounted read-only into each container). "
"NOTE: Max/Pro has per-window usage limits; batch runs may exhaust them.",
file=sys.stderr,
)
elif args.auth == "oauth":
print(
"ERROR: --auth=oauth requested but credentials extraction failed.",
file=sys.stderr,
)
return 1
if oauth_creds_path is None:
if not os.environ.get("ANTHROPIC_API_KEY"):
print(
"ERROR: no auth available. Either run `claude login` on host "
"(for OAuth) or set ANTHROPIC_API_KEY.",
file=sys.stderr,
)
return 1
print("Auth: ANTHROPIC_API_KEY (pay-per-call).", file=sys.stderr)
print(f"Loading dataset {args.dataset} (split=test)...", file=sys.stderr)
dataset = load_dataset(args.dataset, split="test")
instances = select_instances(dataset, args.instance_ids, args.limit)
total = len(instances)
if total == 0:
print("No instances selected; nothing to do.", file=sys.stderr)
return 0
# Scrub hidden-from-agent fields defensively. The agent container only
# receives instance_id/repo/base_commit/problem_statement via CLI args +
# the per-instance problem file — the hidden fields never leave this
# process. This loop makes that invariant explicit.
for row in instances:
for key in HIDDEN_AGENT_FIELDS:
row.pop(key, None)
model_name_or_path = "claude-opus-4-7+claude-mem"
print(
f"Launching {total} instance(s) with max_concurrent={args.max_concurrent}, "
f"timeout={args.timeout}s, image={args.image}",
file=sys.stderr,
)
predictions_lock = threading.Lock()
succeeded = 0
failed = 0
timed_out = 0
with ThreadPoolExecutor(max_workers=args.max_concurrent) as executor:
future_to_id = {
executor.submit(
run_one_instance,
instance=instance,
image=args.image,
predictions_path=predictions_path,
predictions_dir=predictions_dir,
run_dir=run_dir,
timeout=args.timeout,
predictions_lock=predictions_lock,
model_name_or_path=model_name_or_path,
oauth_creds_path=oauth_creds_path,
): instance["instance_id"]
for instance in instances
}
for future in as_completed(future_to_id):
instance_id = future_to_id[future]
try:
status, _ = future.result()
except Exception as exc: # pragma: no cover — defensive
status = "failed"
print(
f"[{instance_id}] orchestrator future raised: {exc!r}",
file=sys.stderr,
)
if status == "succeeded":
succeeded += 1
elif status == "timed_out":
timed_out += 1
else:
failed += 1
print(
f"[{instance_id}] {status} "
f"({succeeded + failed + timed_out}/{total} done)",
file=sys.stderr,
)
print(
f"{total} total, {succeeded} succeeded, {failed} failed, {timed_out} timed out",
)
# Per plan: exit 0 even if some instances failed.
return 0
if __name__ == "__main__":
sys.exit(main())