EVMbench: Putting AI Agents on the Smart‑Contract Auditing Hot Seat
Why I’m suddenly obsessing over “smart contracts”
Look, I’ve been covering everything from the first consumer‑grade VR headset to the latest quantum‑ready CPUs, and I still get a little jittery when I hear the phrase “$100 billion of crypto assets sit behind code you can’t see.” It feels a bit like watching a massive dam built out of glass—beautiful, impressive, and terrifying if a crack shows up.
Those “cracks” are the vulnerabilities that attackers hunt for, and they’re not just theoretical. In the past year alone, a handful of exploits have siphoned off tens of millions of dollars from DeFi platforms that many of us thought were “battle‑tested.”
Enter AI. The same large‑language models that can now write a decent sonnet or suggest a new recipe are getting good—sometimes frighteningly good—at reading, writing, and executing code. If an AI can suggest a bug‑fix for a Rust library, why not let it hunt for hidden flaws in a Solidity contract?
That’s the premise behind EVMbench, a new benchmark released jointly by OpenAI and the crypto‑research firm Paradigm. It’s a sandbox where AI agents are asked to do three things: spot a vulnerability, patch it, and—if you’re feeling mischievous—exploit it. The goal? Give us a concrete yardstick for how far AI‑driven security tools have come, and, more importantly, how far they still have to go.
A quick refresher: smart contracts in plain English
If you’ve ever used a ride‑sharing app, you already understand the idea of a “contract” that runs automatically when conditions are met. In the blockchain world, a smart contract is a piece of code that lives on a public ledger and enforces those conditions without a middleman.
- Money moves when the contract says it should.
- Rules are immutable (unless the contract itself includes an upgrade mechanism).
- Everyone can read the code—but that doesn’t mean everyone can understand it.
Because these contracts often hold or move real value—think stablecoins, NFTs, or tokenized assets—their security is not a nice‑to‑have; it’s a make‑or‑break issue.
AI as both the lock‑picker and the locksmith
I’ve watched the security community wrestle with a paradox for years: the same tools that help defenders can also empower attackers. Machine‑learning‑based fuzzers, static analysis tools, and now LLM‑driven code assistants are all double‑edged swords.
What makes EVMbench compelling is that it deliberately measures AI in all three roles—detect, patch, and exploit—so we can see where the balance tilts. Think of it as a triathlon for AI agents, where the “swim” is spotting the problem, the “bike” is fixing it without breaking the bike, and the “run” is trying to break the bike again.
Inside the sandbox: how EVMbench is built
1. A curated set of 120 vulnerabilities
Paradigm’s auditors mined 40 real‑world audit reports, primarily from the Code4rena competition series, and distilled 120 high‑severity bugs. Most of these are the kind of “re‑entrancy” or “unchecked external call” issues that have historically led to big losses. A handful come from the Tempo L1 blockchain—a newer, high‑throughput chain focused on stablecoin payments. Including Tempo contracts nudges the benchmark toward a use case that’s gaining traction: AI‑driven stablecoin payments.
Why this matters: By grounding the test set in actual audit findings, the benchmark avoids the “toy‑problem” trap where models ace contrived examples but stumble on production code.
2. Three task modes, each with its own scoring logic
| Mode | What the agent does | How we score it |
|---|---|---|
| Detect | Audits a repository, flags known bugs | Recall of ground‑truth vulnerabilities (higher recall = higher score) |
| Patch | Submits a modified contract that should still work | Automated test suite + exploit checks must pass; no compilation errors |
| Exploit | Sends transactions to a sandboxed blockchain to drain funds | Transaction replay and on‑chain verification; success = points |
The Rust‑based harness that powers the whole thing spins up a fresh Anvil (local Ethereum testnet) for every exploit run, ensuring deterministic results and no accidental spillover to a live network.
3. Guardrails against cheating
The OpenAI team didn’t just hand over a list of bugs and call it a day. They wrote custom graders, red‑teamed the environments, and even threw in “automated task auditing agents” to sniff out loopholes where a clever model might game the system (e.g., by submitting a contract that simply aborts every transaction).
Side note: This mirrors the cat‑and‑mouse game we see in Capture‑the‑Flag (CTF) competitions, where organizers constantly patch the challenge to keep it fair.
The headline numbers: GPT‑5.3‑Codex leads the pack
When we talk about “frontier agents,” we’re talking about the most recent, high‑capacity models that OpenAI has made available through its Codex CLI. Here’s a quick rundown of the results that OpenAI highlighted in the release:
| Model | Detect recall | Patch success | Exploit score |
|---|---|---|---|
| GPT‑5.3‑Codex (latest) | 48 % | 34 % | 72.2 % |
| GPT‑5 (released 6 months earlier) | 31 % | 19 % | 31.9 % |
| GPT‑4.5‑Codex (baseline) | 27 % | 15 % | 24.3 % |
A few observations jump out:
- Exploit mode is where the AI shines. The objective is crystal clear: keep trying until the contract is emptied. The model can iterate quickly, try variations, and learn from the sandbox feedback.
- Detect and patch lag behind. Spotting a bug is one thing; fixing it without breaking the contract’s intended behavior is another. The patch scores suggest that the models still struggle to preserve functional invariants while removing subtle vulnerabilities.
- Rapid progress. The jump from GPT‑5 to GPT‑5.3 in exploit performance is more than double. That’s a steep curve, and it mirrors the broader trend we’ve seen in LLMs where a few months of additional training data and architecture tweaks translate into large gains on niche tasks.
The blind spots: where EVMbench falls short
No benchmark is perfect, and the authors are candid about the limitations.
Real‑world complexity is higher
The 120 bugs are high‑severity, but they’re drawn from competitions where participants already know they’re being judged. In the wild, contracts undergo multiple layers of review, and many vulnerabilities are hidden behind complex upgrade patterns, cross‑chain calls, or obscure op‑codes that simply don’t appear in a Code4rena dataset.
“Detect” only measures recall of known bugs
If an AI flags a genuine issue that human auditors missed, the current scoring system treats it as a false positive. This is a classic problem in security research: the ground truth is often incomplete. It means the detect scores are a lower bound on true capability.
Timing and network effects are abstracted away
Exploit tasks run on a clean Anvil instance, not a fork of mainnet. Real attacks often rely on front‑running, miner extractable value (MEV), or precise block‑timestamp manipulation—behaviors that are impossible to capture in a deterministic replay environment.
Single‑chain focus
The benchmark only supports a single EVM‑compatible chain at a time. Multi‑chain DeFi protocols that stitch together assets across Ethereum, Polygon, and Arbitrum present a whole new attack surface that isn’t represented here.
Why this matters for developers, auditors, and the rest of us
1. A yardstick for defensive AI tools
If you’re a security team at a DeFi startup, you can now point to a concrete number: “Our AI‑assistant can detect 48 % of the known high‑severity bugs in EVMbench.” That’s more actionable than a vague claim that “our model is good at smart‑contract analysis.” It also gives you a baseline to compare against human auditors.
2. A warning for attackers
The exploit scores suggest that a competent LLM can autonomously craft a fund‑draining transaction in a sandbox with a 70 % success rate. That’s a signal that threat actors could soon automate large‑scale probing of vulnerable contracts, lowering the barrier to entry for sophisticated attacks.
3. Incentives for the community
OpenAI is coupling the release with a $10 M API‑credits grant for projects focused on cyber defense. The idea is to lower the cost of integrating high‑capacity models into open‑source security tools. If you’re maintaining a popular Solidity library, you could apply for credits to run nightly AI‑driven audits on every PR.
4. A call for better benchmarks
EVMbench is a solid first step, but the community will need follow‑ups that address the limitations listed above—multi‑chain scenarios, MEV‑aware exploits, and a more flexible “detect” scoring that rewards novel findings. Think of it as the first episode of a series; the sequel will need to be bigger, messier, and more realistic.
My personal take: the “AI‑as‑co‑pilot” model feels right
When I first tried the Codex CLI on a simple ERC‑20 contract, the model suggested a patch that simply added a require(msg.sender == owner) guard. It “fixed” the re‑entrancy issue but broke the token’s transfer logic for everyone else. That was a classic case of over‑fitting to the test: the model saw the vulnerability, but didn’t understand the contract’s business intent.
What EVMbench forces the model to do—preserve functionality while removing the bug—is exactly the kind of human‑in‑the‑loop problem we face every day. It tells me that AI can be a powerful co‑pilot, but the pilot still needs to be vigilant.
In my own workflow, I’m already experimenting with a lightweight version of the benchmark: I feed my contracts through an open‑source LLM, let it suggest patches, then run the same Rust harness locally to verify that the patched contract still passes my unit tests. The process adds about 10 minutes to my CI pipeline, but the peace of mind is worth it.
Looking ahead: what could the next version of EVMbench look like?
- Dynamic state modeling – Introduce scenarios where the exploit depends on transaction ordering or gas price manipulation.
- Cross‑chain bridges – Add contracts that interact with other EVM chains via trusted relayers, exposing a new class of “bridge‑hacking” bugs.
- Human‑in‑the‑loop scoring – Allow auditors to review AI‑found vulnerabilities and flag them as true positives, feeding back into a more nuanced recall metric.
- Open‑source leaderboard – Publish a public leaderboard where anyone can submit a model (or a fine‑tuned version) and see how it stacks up. Competition tends to accelerate progress, as we saw with the ImageNet challenge for computer vision.
If the community rallies around these ideas, we could end up with a benchmark that not only measures AI capability but also shapes the security practices of the next generation of blockchain developers.
TL;DR
- EVMbench is a new, Rust‑powered benchmark that asks AI agents to detect, patch, and exploit 120 real‑world smart‑contract bugs.
- GPT‑5.3‑Codex scores a solid 72 % on the exploit task, but still lags behind on detection (48 % recall) and patching (34 % success).
- The benchmark is a useful yardstick for both defenders and attackers, but it doesn’t capture the full messiness of live DeFi ecosystems.
- OpenAI is backing the effort with a $10 M API‑credit grant, encouraging developers to embed AI‑driven auditing into their pipelines.
- Future iterations should broaden the attack surface (multi‑chain, MEV, bridge contracts) and refine the scoring to reward novel findings.
If you’re building or maintaining smart contracts, it’s worth giving EVMbench a spin—or at least borrowing its methodology for your own internal audits. The AI tools are getting better, but the stakes are high, and a little extra scrutiny never hurts.
Sources
- OpenAI & Paradigm. Introducing EVMbench: Making smart contracts safer by evaluating AI agents’ ability to detect, patch, and exploit vulnerabilities in blockchain environments. PDF. https://cdn.openai.com/evmbench/evmbench.pdf (accessed Feb 18 2026).
- Paradigm. Paradigm – Research & Investment. https://www.paradigm.xyz (accessed Feb 18 2026).
- Tempo. Tempo – High‑throughput L1 for stablecoin payments. https://tempo.xyz (accessed Feb 18 2026).
- Code4rena. Code4rena Auditing Competitions. https://code4rena.com (accessed Feb 18 2026).