Papers similar to 2604.13764v1

~ similar to 2604.13764v1· 20 results

cs.CRcs.CLRecentMay 14, 2026

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…

View →

cs.CRcs.SERecentApr 23, 2026

CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

Arunabh Majumdar

The paper introduces CrossCommitVuln-Bench, a benchmark dataset demonstrating that many real-world Python vulnerabilities are introduced across multiple commits, making them invisible to standard per-…

View →

cs.CRcs.SERecentMay 29, 2026

How to Compare the Security of Code Written by Humans to LLM-generated Code

Rebecca Balebako, Jasmine Egl

The paper proposes an automated, standardized framework to empirically compare the security quality of code generated through human-only, LLM-only, and hybrid collaboration methods.

View →

cs.CRcs.AIRecentMar 31, 2026

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu +5 more

This paper provides the first comprehensive Systematization of Knowledge (SoK) on the security aspects of LLM-as-a-Judge (LaaJ) systems, identifying key vulnerabilities and proposing a taxonomy for fu…

View →

cs.CRRecentMar 25, 2026

Bridging Code Property Graphs and Language Models for Program Analysis

Ahmed Lekssays

The paper introduces codebadger, a Model Context Protocol (MCP) server that integrates Joern's Code Property Graph (CPG) with LLMs, enabling large language models to perform large-scale, semantic prog…

View →

cs.CRcs.SERecentMar 24, 2026

Does Teaming-Up LLMs Improve Secure Code Generation? A Comprehensive Evaluation with Multi-LLMSecCodeEval

Bushra Sabir, Shigang Liu, Seung Ick Jang, Sharif Abuadbba +5 more

The paper evaluates multi-LLM strategies for secure code generation, finding that hybrid pipelines combining ensembling, static analysis, and patching achieve the strongest security performance, outpe…

View →

cs.CRcs.PLcs.SERecentApr 28, 2026

Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets

Zeyad Abdelrazek, Young Lee

The paper introduces a novel multi-LLM orchestration system combined with symbolic execution to successfully detect memory vulnerabilities in uncompilable, incomplete Rust CVE code snippets, achieving…

View →

cs.CRcs.LGRecentMay 26, 2026

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang +2 more

The paper introduces SEC-bench Pro, a rigorous benchmark for evaluating LLM-based bug hunting on complex software, finding that even advanced agents struggle with long-horizon security tasks.

View →

cs.CRcs.SCRecentMay 25, 2026

Heimdall: Formally Verified Automated Migration of Legacy eBPF Programs to Rust

Vishnu Asutosh Dasu, Monika Santra, Md Rafi Ur Rashid, Ashish Kumar +2 more

The paper introduces Heimdall, an automated pipeline that uses LLMs and formal verification to safely and automatically migrate legacy, potentially buggy eBPF programs written in C to memory-safe Rust…

View →

cs.CRRecentApr 25, 2026

AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code

ElMouatez Billah Karbab

AsmRAG is a novel framework that improves malware detection by treating it as an evidence-based retrieval task using a code-specialized LLM, achieving high accuracy while providing transparent forensi…

View →

cs.CRRecentMay 6, 2026

Sealing the Audit-Runtime Gap for LLM Skills

Tingda Shen, Yebo Feng, Konglin Zhu, Xiaojun Jia +2 more

The paper introduces SIGIL, a novel framework that cryptographically seals the entire lifecycle of LLM skills, ensuring verifiable integrity from publication through runtime execution to prevent suppl…

View →

cs.CRcs.SERecentApr 5, 2026

LLM-Enabled Open-Source Systems in the Wild: An Empirical Study of Vulnerabilities in GitHub Security Advisories

Fariha Tanjim Shifat, Hariswar Baburaj, Ce Zhou, Jaydeb Sarker +1 more

The paper analyzes GitHub security advisories for LLM-integrated open-source systems, finding that while most vulnerabilities map to existing code-level weaknesses, the architectural risks like Supply…

View →

cs.CRcs.AIcs.CLRecentApr 2, 2026

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari +4 more

RuleForge is an automated system that generates and validates detection rules for web vulnerabilities from structured CVE templates, significantly improving detection accuracy and reducing false posit…

View →

cs.CRcs.SERecentMay 3, 2026

QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

Mohd Ruhul Ameen, Md Takrim Ul Alam, Akif Islam

QASecClaw, a multi-agent LLM system, significantly improves the accuracy of Static Application Security Testing (SAST) by using specialized LLM agents to filter out false positives, achieving an F1 sc…

View →

cs.CRRecentApr 18, 2026

False Security Confidence in Benign LLM Code Generation

Xiaolei Ren

The paper introduces False Security Confidence (FSC), a new metric to measure the inherent prevalence of security vulnerabilities in code generated by LLMs that are otherwise functionally correct, eve…

View →

cs.CRcs.SERecentMay 20, 2026

FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction

Ze Sheng, Zhicheng Chen, Qingxiao Xu, Kewen Zhu +1 more

FuzzingBrain V2 is a multi-agent LLM system that significantly improves automated vulnerability discovery by ensuring all reported bugs are fuzzer-reproducible and handling complex cross-function depe…

View →

cs.CRRecentMay 8, 2026

Longitudinal Analyses of SAST Tools: A CodeQL Case Study

Jean-Charles Noirot Ferrand, Kyle Domico, Yohan Beugin, Patrick McDaniel

This study conducts a large-scale longitudinal analysis of CodeQL, finding that while the tool is effective at detecting vulnerabilities, its detection capabilities are not guaranteed to be stable acr…

View →

cs.CRcs.AIRecentJun 3, 2026

From Attack Simulation to SIEM Rule: Deterministic Detection-as-Code Synthesis with Probe-Level Traceability

Alexandre Cristovão Maiorano

The paper introduces a deterministic method to automatically synthesize initial SIEM detection rules (Sigma rules) from attack simulation findings, ensuring full traceability back to the specific orig…

View →

cs.CRcs.LGcs.SERecentApr 21, 2026

Evaluating LLM-Generated Obfuscated XSS Payloads for Machine Learning-Based Detection

Divyesh Gabbireddy, Suman Saha

This paper proposes a structured pipeline using LLMs to generate and evaluate obfuscated XSS payloads, demonstrating that while LLMs can generate samples, they currently struggle to ensure payloads ma…

View →

cs.SEcs.CRRecentApr 9, 2026

Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

Kevin Lira, Baldoino Fonseca, Davy Baía, Márcio Ribeiro +1 more

This study assesses the effectiveness and cost of four modern LLMs in detecting vulnerabilities that span multiple functions (interprocedural dependencies), finding that Gemini 3 Flash offers strong c…

View →