The field of AI security evaluation has grown crowded with benchmarks that test whether language models can identify vulnerabilities, explain code, or suggest fixes. Fewer tools probe whether those models can actually exploit the bugs they find. ExploitBench sits in that narrower space—measuring not just detection, but the full exploitation chain: from reaching vulnerable code to achieving arbitrary code execution.
This fills a real gap. Most existing AI security benchmarks stop at "can the model spot the flaw?" ExploitBench asks a harder question: what happens after the model knows where the bug is? Can it trigger the condition? Build a working primitive? Escalate to code execution? The benchmark frames exploitation as a ladder, with distinct capability levels that models must climb.
What ExploitBench does differently
The project measures exploitation capability through a multi-stage grading system. Rather than a binary pass/fail, it evaluates whether an AI agent progresses through defined stages: reaching the vulnerable code, triggering the bug, constructing exploit primitives, and ultimately achieving arbitrary code execution. This granularity matters because it exposes where models fail—which rung of the ladder they can reach and which they cannot.
ExploitBench drives evaluation through MCP (Model Context Protocol) servers running inside containers. The first shipped server, bench-v8, targets Chromium V8 vulnerabilities with a 16-capability bitmap covering primitives like addrof and fakeobj. Each container encapsulates a specific CVE with its environment, and the MCP interface provides a contract between the model and the graded environment. This design keeps the benchmark reproducible: the same container image produces the same conditions across runs.
The tool supports multiple model providers without requiring code changes. Model IDs with the anthropic/ prefix use the native Anthropic SDK with cache_control support. The openai/, gemini/, and openrouter/ prefixes route through LiteLLM, which means any OpenAI-compatible gateway—vLLM, Ollama, OpenRouter, or a self-hosted proxy—works out of the box. This flexibility lets researchers test proprietary models, open-weight models through local gateways, or compare providers on identical workloads.
Results live at exploitbench.ai with a leaderboard, capability heatmaps, and per-CVE drilldowns. The site pulls from a SQLite database that the benchmark populates, and the source lives in a separate website repository. For local analysis, the engine includes a FastAPI read backend for querying runs directly.
The trade-offs
ExploitBench is not a lightweight tool. The V8 evaluation images are approximately 70 GB per bug, though pre-built images on GHCR avoid the need to build them locally. Even so, running the full matrix means managing 41 V8 bugs across multiple models and seeds—a significant infrastructure commitment. The README explicitly recommends walking through a verification ladder: cache preflight, then a 20-turn smoke test, then full 300-turn runs, then audit, before scaling to all bugs and seeds. This isn't a "pip install and go" benchmark.
The scope is also deliberately narrow. It measures exploitation capability in a specific, well-defined environment—containerized CVEs with an MCP contract. This is a strength for reproducibility and fair comparison, but it means the benchmark doesn't capture broader red-teaming tasks like vulnerability discovery in unknown codebases or social engineering. Researchers looking for that kind of evaluation need different tools.
The project explicitly asks that no one perform reinforcement learning on the benchmark, since RL can pollute results for everyone. This is a methodological stance that limits how some teams might approach the problem—but it's clearly communicated, and the team points RL-interested parties toward Bugcrowd's separate environments.
What it ships with
The repository contains more than just the benchmark runner. Key components include:
- A CLI tool (
exploitbench) for running benchmarks, auditing results, resuming failed runs, and querying the database - Benchmark configurations in YAML format, including v8.yaml (full 41-bug matrix) and v8-small.yaml (14-bug subset matching the Claude Opus 4.6 baseline)
- Pre-built V8 images published to GHCR at ghcr.io/exploitbench/v8-r1, pulled automatically on first use
- An audit system with 11 transcript checks (C1–C11) that flag suspicious patterns like hardcoded addresses, off-workspace writes, or model refusal language
- A validation system that checks images against a five-point schema: manifest structure, MCP contract compliance, target starts, known PoC reproduction, and integrity posture
- Audit bundle generation that packages runs into sha256-manifested tarballs for sharing with external auditors
- A FastAPI backend for local JSON querying of benchmark results
- Import functionality for ingesting historical evaluation trees
The architecture documentation in docs/ and the runbook in docs/RUNBOOK.md cover the methodology and operational steps in detail.
If you want to try it
The project is written in Python and requires Docker for container management. You'll need API keys for whichever model providers you want to test—Anthropic, OpenAI, Gemini, or OpenRouter all work, and you can mix providers in a single benchmark run. The README includes a quick start that walks through creating a virtual environment, configuring API keys, running a smoke test with the mock LLM, and then a cheap real run with Haiku (about $1.50 and five minutes). Full runs with flagship models like Claude Opus 4.7 use a 300-turn budget and take significantly longer.
The actual installation commands are in the README, and the verification ladder there is worth following before running the full matrix.
ExploitBench serves a specific purpose: measuring how far AI agents can climb the exploitation ladder in well-defined, reproducible CVE environments. If you're evaluating AI security models and care about more than vulnerability detection, it's one of the few tools that actually measures exploitation capability at scale. The source is on GitHub.
Comments