Papers similar to 2605.29277

~ similar to 2605.29277· 20 results

cs.IREmpiricalRecentJun 10, 2026

CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding

Fuwei Zhang, Yanzhao Zhang, Mingxin Li, Dingkun Long +4 more

This paper introduces CORE-Bench, a comprehensive benchmark for code retrieval in agentic coding.

View →

cs.SEcs.AIRecentMay 28, 2026

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal

The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…

View →

cs.SEcs.AIRecentMay 28, 2026

Inferring Code Correctness from Specification

Tambon Florian, Papadakis Mike

The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…

View →

cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →

cs.SEcs.AIcs.CLRecentJun 4, 2026

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

Code2LoRA introduces a hypernetwork framework to efficiently inject repository-specific knowledge into code language models using LoRA adapters, supporting both static and evolving codebases.

View →

cs.CRcs.SERecentMay 29, 2026

How to Compare the Security of Code Written by Humans to LLM-generated Code

Rebecca Balebako, Jasmine Egl

The paper proposes an automated, standardized framework to empirically compare the security quality of code generated through human-only, LLM-only, and hybrid collaboration methods.

View →

cs.SEcs.AIcs.HCRecentMay 28, 2026

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi +4 more

This study analyzes over 20,000 real-world coding sessions to show that AI coding agents frequently fail users through subtle misalignment, requiring constant manual correction even when major system…

View →

cs.SEcs.AIcs.CRRecentApr 12, 2026

Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

Jugal Gajjar

The paper introduces an execution-grounded, cross-language framework that significantly improves the reliability of LLM-driven code vulnerability analysis by ensuring that all proposed fixes are confi…

View →

cs.CRcs.AIcs.SERecentMay 5, 2026

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Jonathan Steinberg, Oren Gal

The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…

View →

cs.SEcs.AIcs.CLRecentMay 31, 2026

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more

BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.

View →

cs.CRRecentMar 24, 2026

Leveraging Large Language Models for Trustworthiness Assessment of Web Applications

Oleksandr Yarotskyi, José D'Abruzzo Pereira, João R. Campos

This paper proposes an empirical methodology to automate web application trustworthiness assessment by leveraging Large Language Models (LLMs) to verify adherence to secure coding practices, showing t…

View →

cs.SEcs.AIcs.IRRecentMay 27, 2026

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli +1 more

The paper introduces a hybrid system, HYBRIDSOURCETRACKER (HST), that combines vector search and Winnowing fingerprinting to achieve scalable, high-precision provenance tracking for code generated by…

View →

cs.CRcs.AIRecentApr 4, 2026

SecPI: Secure Code Generation with Reasoning Models via Security Reasoning Internalization

Hao Wang, Niels Mündler, Mark Vero, Jingxuan He +2 more

The paper introduces SecPI, a fine-tuning pipeline that teaches reasoning language models (RLMs) to autonomously internalize structured security reasoning, significantly improving secure code generati…

View →

cs.SEcs.AIcs.CRRecentApr 14, 2026

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

Qiang Zhang, Zhongnian Li

The paper proposes CoDe-R, a two-stage framework that significantly improves the accuracy and re-executability of decompiled code generated by LLMs, achieving a new SOTA in the lightweight regime.

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.CRcs.LGRecentApr 17, 2026

Surgical Repair of Insecure Code Generation in LLMs

Gustavo Sandoval, Brendan Dolan-Gavitt, Siddharth Garg

This paper identifies the 'Format-Reliability Gap'—where LLMs know about code vulnerabilities but generate insecure code anyway—and proposes a localized, per-vulnerability steering vector fix that sig…

View →

cs.CRcs.SERecentMay 4, 2026

SCRIBE: Practical Static Binary Patching via Binary-Aware Recompilation of Decompiled Code

Han Dai, Soumyakant Priyadarshan, Abdullah Imran, Ruoyu Wang +1 more

SCRIBE is a novel framework that enables reliable source-level patching of binaries by performing 'binary-aware' recompilation, successfully resolving syntactic and semantic inaccuracies inherent in d…

View →

cs.CLRecentJun 1, 2026

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu +2 more

CRAFTQA introduces a novel adaptive, code-driven framework that significantly enhances complex structured data reasoning by dynamically generating custom code functions beyond predefined operations.

View →

cs.SEcs.CRcs.LGRecentMay 13, 2026

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

Nils Loose, Joseph Bienhüls, Kristoffer Hempel, Felix Mächtle +1 more

The paper evaluates code language model-based detection of vulnerability-fixing commits (VFCs) using a unified benchmark and concludes that code changes alone are insufficient for accurate detection,…

View →

cs.CRcs.LGRecentMay 26, 2026

SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?

Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang +2 more

The paper introduces SEC-bench Pro, a rigorous benchmark for evaluating LLM-based bug hunting on complex software, finding that even advanced agents struggle with long-horizon security tasks.

View →