~ similar to 2606.02384· 20 results
While restricting a model to the theoretical Markov boundary can significantly improve prediction, the practical process of discovering and using this boundary is often computationally infeasible and…
TabChange proposes a novel framework to generate natural and minimally altered counterfactual instances in tabular data by precisely controlling attribute modifications based on their relationship str…
Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more
BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.
This paper introduces BBOmix, an open-source benchmark for unsupervised representation learning on real-world biological data.
Yunkai Lou, Longbin Lai, Shunyang Li, Zhengping Qian +1 more
SpecDB is a novel system that uses LLMs to synthesize highly customized, purpose-built relational databases, achieving performance comparable to commercial systems while significantly reducing code si…
Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma +7 more
SIRIUS-SQL introduces a robust multi-candidate text-to-SQL system that addresses weaknesses in candidate generation, error handling, and selection, achieving state-of-the-art performance on complex be…
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…
FuzzPilot is a controller for AFL++ that validates candidate mutation recipes by running short micro-campaigns, demonstrating a mechanism to manage fuzzing plateaus, though initial results on a satura…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
ParaTool introduces a novel framework that shifts tool representations from bulky context documentation to dedicated, loadable parameters, enabling efficient and robust tool calling in LLMs.
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…
The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…
The paper systematically evaluates various tabular representation learning techniques to automatically extract robust features from NetFlow data for network intrusion detection, finding that supervise…
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang +5 more
The paper introduces DOMINO, a novel inductive framework that synthesizes domain-specific data for LLMs using only reference examples, significantly improving performance on challenging, implicitly de…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more
The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.