~ similar to 2606.11864· 20 results
Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more
The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.
The paper proposes and evaluates a novel embedding model for bidirectional function association between source code and decompiled/stripped code, significantly outperforming existing models.
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
Code2LoRA introduces a hypernetwork framework to efficiently inject repository-specific knowledge into code language models using LoRA adapters, supporting both static and evolving codebases.
Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi +4 more
This study analyzes over 20,000 real-world coding sessions to show that AI coding agents frequently fail users through subtle misalignment, requiring constant manual correction even when major system…
CoHyDE introduces an iterative co-training framework that jointly optimizes an LLM rewriter and a dense encoder, significantly improving tool retrieval accuracy for LLM agents, especially on vague que…
The paper introduces a hybrid system, HYBRIDSOURCETRACKER (HST), that combines vector search and Winnowing fingerprinting to achieve scalable, high-precision provenance tracking for code generated by…
Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang +2 more
The paper proposes formulating RAG design as an architecture search problem and introduces RAISE, a comprehensive framework and benchmark for systematically optimizing RAG hyperparameters.
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
The paper introduces Entity-Collision, a rigorous protocol that separates genuine retrieval lift from simple lexical overlap, demonstrating that embedder performance depends critically on the query ty…
Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more
BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.
Yongjie Wang, Xinyue Zhang, Kunhong Yao, Zhiwei Zeng +3 more
The paper introduces the concept of Search-Time Contamination (STC), demonstrating that deep research agents can leak information from public benchmarks via web search, leading to an overestimation of…
The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
The paper proposes CoDe-R, a two-stage framework that significantly improves the accuracy and re-executability of decompiled code generated by LLMs, achieving a new SOTA in the lightweight regime.
LongAttnComp introduces a novel, two-stage fine-tuning framework for context compression that significantly improves long-context reasoning performance, matching or exceeding full-context accuracy on…
Siyuan Qi, Xinyuan Wang, Yingxuan Yang, Haochuan Guo +4 more
DynaTree introduces a two-stage framework that pre-constructs a reusable retrieval tree offline using coordinated agents, allowing for efficient, structure-aware, and highly effective time-sensitive n…
The paper introduces VibeSearchBench, a new benchmark designed to evaluate long-horizon, proactive search capabilities, demonstrating that current state-of-the-art LLM agents are still significantly i…
Critic-R introduces a novel framework that uses a critic model to provide natural language introspective feedback, significantly improving the performance of agentic search systems by optimizing retri…
Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more
The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…