~ similar to 2606.06492· 19 results
Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more
The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.
Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more
BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.
The paper introduces codebadger, a Model Context Protocol (MCP) server that integrates Joern's Code Property Graph (CPG) with LLMs, enabling large language models to perform large-scale, semantic prog…
Fuwei Zhang, Yanzhao Zhang, Mingxin Li, Dingkun Long +4 more
This paper introduces CORE-Bench, a comprehensive benchmark for code retrieval in agentic coding.
Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok +5 more
Mellum 2 is an open-weight 12B Mixture-of-Experts (MoE) language model specialized for software engineering, achieving performance competitive with larger models while maintaining the efficiency of a…
The paper introduces LLM4CodeRE, a domain-adaptive LLM framework that significantly improves bidirectional code reverse engineering by unifying assembly-to-source and source-to-assembly translation.
The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…
FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…
Yuwei Liu, Xinyi Wan, Yanhao Wang, Minghua Wang +2 more
KVerus is a retrieval-augmented system that significantly improves the scalability and resilience of formal verification for Rust code by managing complex cross-module dependencies and adapting to cod…
The paper proposes CoDe-R, a two-stage framework that significantly improves the accuracy and re-executability of decompiled code generated by LLMs, achieving a new SOTA in the lightweight regime.
The paper proposes and evaluates a novel embedding model for bidirectional function association between source code and decompiled/stripped code, significantly outperforming existing models.
The paper proposes projectional decoding, a novel framework that integrates a partial graph model alongside text generation to ensure the semantic validity of LLM-generated software artifacts.
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
The paper introduces CHECKMATE, a novel framework that uses code evolution to automatically generate and optimize algorithms for complex combinatorial problems, outperforming state-of-the-art solvers.
Han Dai, Soumyakant Priyadarshan, Abdullah Imran, Ruoyu Wang +1 more
SCRIBE is a novel framework that enables reliable source-level patching of binaries by performing 'binary-aware' recompilation, successfully resolving syntactic and semantic inaccuracies inherent in d…
Chang Liu, Noah Fleischmann, Nicolò Altamura, Edward Raff +2 more
The paper introduces ASSEMBLAGE-DEEPHISTORY, a novel, comprehensive binary dataset that unifies cross-compiler builds, historical versions, and vulnerability labels into a single, queryable structure.
Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang +5 more
The paper introduces Atomic Decomposition and Recombination (ADR), a novel framework that generates genuinely novel and challenging verifiable code tasks, significantly improving the scalability of Re…