~ similar to 2605.29955· 20 results
The paper introduces COVCAL, a risk-controlled method that precisely determines when a partial formalization signal from an autoformalizer can be trusted to certify the correctness of natural-language…
The paper introduces MAVEN, a lightweight symbolic reasoning scaffold that significantly improves the generalization and end-to-end success rate of large language models in complex, multi-step tool-ca…
The paper introduces new benchmarks for complex asynchronous planning and demonstrates that general constraint satisfaction formalizers (like CP-SAT) significantly outperform direct LLM planning or tr…
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang +5 more
The paper introduces Crafter, a multi-agent harness that significantly improves the generation of editable, publication-quality scientific figures from diverse inputs, addressing the limitations of ex…
The paper introduces Iteris, an agentic research system, demonstrating its capability to generate numerical evidence, constructions, and proof drafts for open problems in computational mathematics, re…
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…
The paper introduces COMPOSE, a dual-graph framework that generates plausible future mathematical theorems by simultaneously conditioning a language model on both the scientific citation context and t…
AutoVerifier is an LLM-based agentic framework that automates the end-to-end verification of complex technical claims, enabling non-experts to generate evidence-backed intelligence assessments.
The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
The paper proposes a federated formal verification architecture that treats verification as a polyglot proof system, successfully validating it on complex production subsystems like a Raft consensus m…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…
Tong Liu, Cheng Qian, Matej Cief, Yuan He +3 more
This paper analyzes tool-calling in LLM agents, demonstrating that evaluation results are highly sensitive to implementation details and proposing new techniques to significantly improve the efficienc…
Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin +2 more
The paper introduces OR-Space, a novel full-lifecycle workspace benchmark designed to rigorously evaluate industrial optimization agents by simulating real-world, multi-stage OR workflows that go beyo…
AXIOM is a trust-first neuro-symbolic execution architecture that ensures verifiable mathematical reasoning by strictly separating language model interpretation from deterministic computation, achievi…
The paper introduces Hyperparam, a set of lightweight JavaScript libraries designed to enable direct, model-aware querying of unstructured data (like agent traces) within client-side AI applications.