cs.CRcs.AI

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

May 13, 2026

AI Summarygemma4:e4b

The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achieving full arbitrary code execution against hardened targets remains a significant, emerging challenge.

Abstract

More Like This

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: <model,env> as the primary measurement of model-environment capability, <model,env, adaptive coaching> as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and <model,env,harness> as an ablation that swaps in the model's native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability.

The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt…

02Low15%

Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents

The paper introduces Aethelgard, a novel four-layer adaptive governance framewor…

03Low14%

An Agentic Workflow for Detecting Personally Identifiable Information in Crash Narratives

The paper proposes a novel, locally deployable agentic workflow using large lang…

04Low11%

Security awareness in LLM agents: the NDAI zone case

The paper investigates how LLM agents determine the security of their execution…

05Low11%

Efficient Software Vulnerability Detection Using Transformer-based Models

This paper proposes using transformer-based models on program slices to accurate…

06Low10%

HPCCFA: Leveraging Hardware Performance Counters for Control Flow Attestation

The paper introduces HPCCFA, a novel mechanism that leverages Hardware Performan…

07Low10%

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

The paper proposes CoDe-R, a two-stage framework that significantly improves the…

08Low10%

Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery

SAILOR automates the construction of symbolic execution harnesses by combining s…