~ similar to 2604.01637v1· 20 results
The paper proposes an end-to-end LLM framework that automates SOC operations by integrating ensemble-based threat detection, syntax-constrained query generation, and evidence-grounded incident resolut…
FORGE is a multi-agent system that integrates vulnerability exploitation, prioritization, and detection engineering into a single pipeline, achieving high-fidelity, multi-level exploitation and genera…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more
The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…
The paper analyzes LLM vulnerability detection using mechanistic interpretability, finding that models primarily rely on safety detectors rather than direct vulnerability signature recognition.
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Hanzhi Liu, Chaofan Shou, Xiaonan Liu, Hongbo Wen +3 more
The paper introduces AgentFlow, a novel framework that uses a typed graph DSL and feedback-driven optimization to automatically synthesize and improve multi-agent harnesses for discovering security vu…
The paper introduces an efficient, lightweight LLM framework for smart contract auditing that decouples the audit process into multiple components, achieving high accuracy while significantly reducing…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
Bushra Sabir, Shigang Liu, Seung Ick Jang, Sharif Abuadbba +5 more
The paper evaluates multi-LLM strategies for secure code generation, finding that hybrid pipelines combining ensembling, static analysis, and patching achieve the strongest security performance, outpe…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…
Pengyu Sun, Qishu Jin, Enhao Huang, Zifeng Kang +3 more
VIPER-MCP is a novel, end-to-end automated framework that detects and dynamically confirms the exploitability of taint-style vulnerabilities in Model Context Protocol (MCP) servers, achieving high-fid…
VulGD is a dynamic, open-access graph database that aggregates cybersecurity data from multiple sources and uses LLM embeddings to improve vulnerability representation and risk assessment.
The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…
OpenSOC-AI is a lightweight framework that uses parameter-efficient fine-tuning of a small LLM to automate threat classification and severity assessment from raw security logs, significantly improving…
Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari +1 more
The paper evaluates frontier LLMs on cybersecurity tasks using dual-mode benchmarks and concludes that general-purpose models are insufficient, advocating for specialized, vertical foundation models.
This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari +4 more
RuleForge is an automated system that generates and validates detection rules for web vulnerabilities from structured CVE templates, significantly improving detection accuracy and reducing false posit…