ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.02384· 20 results

cs.LGcs.AIstat.MERecentMay 28, 2026

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

While restricting a model to the theoretical Markov boundary can significantly improve prediction, the practical process of discovering and using this boundary is often computationally infeasible and…

View →
cs.LGcs.AIRecentMay 30, 2026

TabChange: Precise Attribute Changes in Tabular Data

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

TabChange proposes a novel framework to generate natural and minimally altered counterfactual instances in tabular data by precisely controlling attribute modifications based on their relationship str…

View →
cs.SEcs.AIcs.CLRecentMay 31, 2026

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao +9 more

BenchEvolver introduces a solution-centric evolutionary framework to automatically transform saturated coding benchmarks into significantly harder, high-quality, and diverse evaluation suites.

View →
cs.LGRecentJun 3, 2026

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

Luca Thale-Bombien, Jan Ewald, Ralf König, Aaron Klein

This paper introduces BBOmix, an open-source benchmark for unsupervised representation learning on real-world biological data.

View →
cs.DBcs.AIRecentMay 29, 2026

SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition

Yunkai Lou, Longbin Lai, Shunyang Li, Zhengping Qian +1 more

SpecDB is a novel system that uses LLMs to synthesize highly customized, purpose-built relational databases, achieving performance comparable to commercial systems while significantly reducing code si…

View →
cs.AIRecentMay 31, 2026

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma +7 more

SIRIUS-SQL introduces a robust multi-candidate text-to-SQL system that addresses weaknesses in candidate generation, error handling, and selection, achieving state-of-the-art performance on complex be…

View →
cs.AIRecentMay 27, 2026

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…

View →
cs.SEcs.CRRecentMay 25, 2026

FuzzPilot: Plateau-Triggered Recipe Validation for Structured Text Fuzzing

Zhiyi Yao

FuzzPilot is a controller for AFL++ that validates candidate mutation recipes by running short micro-campaigns, demonstrating a mechanism to manage fuzzing plateaus, though initial results on a satura…

View →
cs.AIRecentMay 27, 2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more

The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…

View →
cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →
cs.AIcs.SERecentMay 28, 2026

ParaTool: Shifting Tool Representations from Context to Parameters

Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao +2 more

ParaTool introduces a novel framework that shifts tool representations from bulky context documentation to dedicated, loadable parameters, enabling efficient and robust tool calling in LLMs.

View →
cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →
cs.CLRecentMay 31, 2026

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…

View →
cs.CRcs.LGcs.SERecentApr 30, 2026

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin

The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…

View →
cs.LGcs.CRRecentMay 4, 2026

Evaluating Tabular Representation Learning for Network Intrusion Detection

Muhammad Usman Butt, Andreas Hotho, Daniel Schlör

The paper systematically evaluates various tabular representation learning techniques to automatically extract robust features from NetFlow data for network intrusion detection, finding that supervise…

View →
cs.AIRecentMay 28, 2026

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more

The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…

View →
cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →
cs.AIRecentMay 28, 2026

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang +5 more

The paper introduces DOMINO, a novel inductive framework that synthesizes domain-specific data for LLMs using only reference examples, significantly improving performance on challenging, implicitly de…

View →
cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →
cs.SEcs.AIRecentMay 28, 2026

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun +2 more

The paper introduces Code-QA-Bench, a novel framework that rigorously separates genuine code reasoning from mere documentation memorization in repository-level code understanding benchmarks.

View →