cs.IREmpirical

CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding

Fuwei Zhang, Yanzhao Zhang, Mingxin Li, Dingkun Long, Lexiang Hu, Pengjun Xie, Zhao Zhang, Fuzhen Zhuang

Jun 10, 2026

AI Summaryllama-3.1-8b-instruct

This paper introduces CORE-Bench, a comprehensive benchmark for code retrieval in agentic coding.

This paper introduces a new benchmark for code retrieval in agentic coding, which is more comprehensive than existing benchmarks.

Keywords

code retrieval agentic coding benchmark code understanding issue-to-edit localization broader context retrieval

Before reading this…

Code retrieval Agentic coding Benchmarking

Applications

→Code retrieval in agentic coding
→Code understanding
→Issue-to-edit localization
→Broader context retrieval

Skill Ladder

To understand this paper, make sure you know these concepts first:

Code retrievalfind papers →
Agentic codingfind papers →
Benchmarkingfind papers →

Abstract

More Like This

Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and filter similar in-repository distractors. Existing code retrieval benchmarks mainly evaluate docstring-to-function or snippet-level matching, thereby missing this requirement-driven repository search problem. To address this gap, we introduce CORE-Bench, a comprehensive benchmark for code retrieval in the era of agentic coding. CORE-Bench evaluates code retrieval ability at three levels: code understanding, issue-to-edit localization, and broader context retrieval. Built from curated code-search tasks and SWE-bench-series instances, CORE-Bench contains over 180K queries and 106K broader-context relevance labels. Experiments with representative embedding models show a sharp drop from traditional code search to code retrieval in agentic coding settings. Simple supervised fine-tuning of existing embedding models significantly improves performance in this setting, suggesting substantial room for further progress.