~ similar to 2605.27981· 20 results
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
This study formally verified 3,500 AI-generated code artifacts and found that a majority (55.8%) contain exploitable security vulnerabilities, regardless of the LLM used.
Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more
The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…
Yuwei Liu, Xinyi Wan, Yanhao Wang, Minghua Wang +2 more
KVerus is a retrieval-augmented system that significantly improves the scalability and resilience of formal verification for Rust code by managing complex cross-module dependencies and adapting to cod…
The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…
Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more
BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.
The paper introduces SPECA, an LLM-driven framework that audits distributed protocols by deriving and enforcing security properties from natural-language specifications, enabling cross-implementation…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
The paper demonstrates that standard LLM evaluation metrics overestimate performance because they fail to account for the stability of outcomes, showing a significant gap between reported pass rates a…
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
The paper introduces False Security Confidence (FSC), a new metric to measure the inherent prevalence of security vulnerabilities in code generated by LLMs that are otherwise functionally correct, eve…
The paper introduces Oracle Poisoning, an attack that corrupts knowledge graphs used by AI agents, demonstrating that all tested models blindly trust poisoned data at high sophistication levels.
The paper tested the hypothesis that wrapping untrusted prompt inputs in mock tool calls would improve LLM robustness, but found that this technique generally fails and can even increase vulnerability…
Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu +2 more
The paper introduces Tree-like Self-Play (TSP), a novel framework that treats secure code generation as a fine-grained decision process, significantly improving LLM security by forcing the model to se…
ZERO-APT introduces a novel closed-loop adversarial framework for automated penetration testing that simulates attacks against an intelligent, real-time defending system, achieving a high attack succe…
Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan +2 more
The paper introduces Poison-with-Style (PwS), a stealthy model poisoning attack that exploits developers' inherent code styles as covert triggers to make Code LLMs generate vulnerable code without exp…
The paper empirically evaluates the security quality of LLM-generated code across various prompting methods, finding that while prompting alters the structure of weaknesses, it is insufficient to reli…
The paper establishes an information-theoretic upper bound on the combined functional capacity and perturbation retention of code LLMs, quantifying the security budget available for code generation.
Houjun Liu, Lisa Einstein, John Yang, Joachim Baumann +4 more
SecureForge is an automated pipeline that significantly reduces cybersecurity vulnerabilities in LLM-generated code by optimizing system prompts, achieving up to a 48% reduction in output vulnerabilit…