~ similar to 2605.29277· 20 results
Fuwei Zhang, Yanzhao Zhang, Mingxin Li, Dingkun Long +4 more
This paper introduces CORE-Bench, a comprehensive benchmark for code retrieval in agentic coding.
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Code2LoRA introduces a hypernetwork framework to efficiently inject repository-specific knowledge into code language models using LoRA adapters, supporting both static and evolving codebases.
The paper proposes an automated, standardized framework to empirically compare the security quality of code generated through human-only, LLM-only, and hybrid collaboration methods.
Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi +4 more
This study analyzes over 20,000 real-world coding sessions to show that AI coding agents frequently fail users through subtle misalignment, requiring constant manual correction even when major system…
The paper introduces an execution-grounded, cross-language framework that significantly improves the reliability of LLM-driven code vulnerability analysis by ensuring that all proposed fixes are confi…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more
BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.
This paper proposes an empirical methodology to automate web application trustworthiness assessment by leveraging Large Language Models (LLMs) to verify adherence to secure coding practices, showing t…
The paper introduces a hybrid system, HYBRIDSOURCETRACKER (HST), that combines vector search and Winnowing fingerprinting to achieve scalable, high-precision provenance tracking for code generated by…
Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more
The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…
The paper proposes CoDe-R, a two-stage framework that significantly improves the accuracy and re-executability of decompiled code generated by LLMs, achieving a new SOTA in the lightweight regime.
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
This paper identifies the 'Format-Reliability Gap'—where LLMs know about code vulnerabilities but generate insecure code anyway—and proposes a localized, per-vulnerability steering vector fix that sig…
Han Dai, Soumyakant Priyadarshan, Abdullah Imran, Ruoyu Wang +1 more
SCRIBE is a novel framework that enables reliable source-level patching of binaries by performing 'binary-aware' recompilation, successfully resolving syntactic and semantic inaccuracies inherent in d…
Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu +2 more
CRAFTQA introduces a novel adaptive, code-driven framework that significantly enhances complex structured data reasoning by dynamically generating custom code functions beyond predefined operations.
Nils Loose, Joseph Bienhüls, Kristoffer Hempel, Felix Mächtle +1 more
The paper evaluates code language model-based detection of vulnerability-fixing commits (VFCs) using a unified benchmark and concludes that code changes alone are insufficient for accurate detection,…
Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang +2 more
The paper introduces SEC-bench Pro, a rigorous benchmark for evaluating LLM-based bug hunting on complex software, finding that even advanced agents struggle with long-horizon security tasks.