Papers similar to 2605.28334v2

~ similar to 2605.28334v2· 20 results

cs.CRcs.AIRecentMay 25, 2026

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

Yihe Fan, Changyi Li, Lichen Xu, Xudong Pan +3 more

The paper introduces CyberEvolver, a self-evolving agent framework that iteratively revises its own operational scaffold based on failed execution attempts, significantly improving cybersecurity agent…

View →

eess.SYcs.AIcs.CRRecentMar 20, 2026

An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Ravish Gupta, Saket Kumar, Shreeya Sharma, Maulik Dang +1 more

The paper introduces a novel six-agent AI architecture for cybersecurity risk assessment, demonstrating high accuracy and speed compared to human experts, though its performance is ultimately limited…

View →

cs.CRRecentApr 22, 2026

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen +3 more

The paper introduces AgentFlow, a novel framework that uses a typed graph DSL and feedback-driven optimization to automatically synthesize and improve multi-agent harnesses for discovering security vu…

View →

cs.CRcs.AIcs.LGEmpiricalRecentJun 12, 2026

AgentCyberRange: Benchmarking Frontier AI Systems in Realistic Cyber Ranges

Fengyu Liu, Jiarun Dai, Yihe Fan, Wuyuao Mai +10 more

The paper introduces AgentCyberRange, an open, multi-range infrastructure for measuring autonomous cyber attack capability in realistic cyber ranges, and evaluates six frontier AI systems.

View →

cs.AIcs.CRcs.IRRecentMay 3, 2026

CyberAId: AI-Driven Cybersecurity for Financial Service Providers

George Fatouros, Georgios Makridis, John Soldatos, Dimosthenis Kyriazis +17 more

The paper proposes CyberAId, a hybrid multi-agent system designed to enhance cybersecurity for financial institutions by integrating specialized LLM subagents with existing SIEM/XDR telemetry, address…

View →

cs.CRcs.AIcs.LGRecentJun 3, 2026

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang +12 more

The paper introduces CyberGym-E2E, a large-scale, end-to-end benchmark designed to comprehensively evaluate AI agents' capabilities across the entire lifecycle of real-world software vulnerability dis…

View →

cs.CRcs.AIRecentMar 30, 2026

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

Yicheng Cai, Mitchell John DeStefano, Guodong Dong, Pulkit Handa +4 more

This paper proposes a set of design principles and a conceptual benchmark (SOC-bench) to systematically evaluate the blue team operational capabilities of multi-agent AI systems in autonomous Security…

View →

cs.CRcs.AIRecentMay 11, 2026

Engineering Robustness into Personal Agents with the AI Workflow Store

Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari +2 more

The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.

View →

cs.CRcs.AIcs.CLRecentApr 18, 2026

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal +1 more

This study provides a comprehensive benchmark of 10 frontier LLMs on 200 offensive cybersecurity tasks, finding that environment tooling and model selection are the primary performance drivers, with C…

View →

cs.CRcs.LGcs.SERecentApr 23, 2026

Strategic Heterogeneous Multi-Agent Architecture for Cost-Effective Code Vulnerability Detection

Zhaohui Geoffrey Wang

The paper proposes a novel '3+1' heterogeneous multi-agent architecture using cloud LLMs and a local verifier to achieve high-accuracy, cost-effective code vulnerability detection, significantly outpe…

View →

cs.CRcs.AIRecentApr 20, 2026

Towards Optimal Agentic Architectures for Offensive Security Tasks

Isaac David, Arthur Gervais

The paper empirically evaluates various agentic architectures for offensive security tasks, finding that while broader coordination improves coverage, the optimal architecture is non-monotonic and dep…

View →

cs.CRcs.AIcs.MARecentMar 23, 2026

STRIATUM-CTF: A Protocol-Driven Agentic Framework for General-Purpose CTF Solving

James Hugglestone, Samuel Jacob Chacko, Dawson Stoller, Ryan Schmidt +1 more

The paper introduces STRIATUM-CTF, a modular agentic framework that uses a standardized context protocol to enable LLMs to perform multi-step, stateful reasoning for general-purpose CTF solving, achie…

View →

cs.CRcs.AIRecentApr 29, 2026

SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting with Tri-Context Personalization

Yair Meidan, Omri Haller, Yulia Moshan, Shahaf David +3 more

SecMate is a multi-agent virtual customer assistant for cybersecurity troubleshooting that significantly improves resolution rates (from 50% to over 90%) by integrating device, user, and service-speci…

View →

cs.MAcs.CRcs.LGRecentApr 25, 2026

Architecture Matters for Multi-Agent Security

Ben Hagag, William L. Anderson, Christian Schroeder de Witt, Sarah Scheffler

This paper empirically demonstrates that the architectural design of multi-agent systems significantly impacts their security, finding that coordination mechanisms can introduce vulnerabilities greate…

View →

cs.CRcs.AIcs.CLRecentJun 3, 2026

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Nicholas Saban

The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…

View →

cs.AIcs.CLcs.CRRecentApr 9, 2026

ACIArena: Toward Unified Evaluation for Agent Cascading Injection

Hengyu An, Minxi Li, Jinghuai Zhang, Naen Xu +5 more

The paper introduces ACIArena, a unified and comprehensive evaluation framework designed to systematically test the robustness of Multi-Agent Systems against complex Agent Cascading Injection attacks.

View →

cs.CRcs.AIcs.CLRecentMay 21, 2026

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

Aaditya Pai

The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…

View →

cs.AIcs.CRRecentMay 5, 2026

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…

View →

cs.CRcs.AIRecentMay 13, 2026

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

Seunghyun Lee, David Brumley

The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…

View →

cs.CRRecentApr 20, 2026

TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs

Ting Zhang, Yikun Li, Chengran Yang, Ratnadira Widyasari +14 more

TitanCA presents a novel, multi-agent LLM orchestration framework that significantly improves vulnerability discovery by reducing false positives and identifying numerous zero-day vulnerabilities.

View →