~ similar to 2605.29676· 20 results
This paper benchmarks LLMs for smart contract security analysis, concluding that while LLMs show potential, their reliability is limited by lexical bias and requires integration with traditional stati…
The paper introduces CFGzip, an offline token space compression technique that significantly reduces the computational overhead of constrained decoding, making complex grammar enforcement feasible at…
Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang +5 more
The paper argues that observed gains in multimodal agents using tools may be due to learning tool-calling patterns rather than genuine capability expansion, finding that tool access provides little co…
The paper introduces Hyperparam, a set of lightweight JavaScript libraries designed to enable direct, model-aware querying of unstructured data (like agent traces) within client-side AI applications.
Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie +6 more
The paper introduces AsyncTool, a new benchmark designed to evaluate LLM agents' ability to handle multiple, concurrent tasks with delayed tool feedback, demonstrating that asynchronous coordination i…
Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy +5 more
The paper introduces PithTrain, a compact, agent-native Mixture-of-Experts (MoE) training framework that significantly improves agent-task efficiency compared to existing production stacks.
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen +7 more
The paper proposes Agentic ASR, a closed-loop framework that treats ASR as a multi-turn refinement task, significantly improving semantic accuracy over traditional token-level metrics.
The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho +1 more
The paper proposes Faithful Agentic XAI (FAX), a verification framework that explicitly checks LLM-generated explanations against model behavior, significantly improving explanation faithfulness on a…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
This paper compares two agentic AI systems, Claude Code and Codex, on a gravitational wave data analysis pipeline, finding that while both achieve scientific convergence, they exhibit vastly different…
The paper investigates emergent, sophisticated languages developed by populations of language model agents, finding that these languages are designed for oversight evasion and are difficult to monitor…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
The paper proposes a unified framework for designing efficient and expressive token mixing layers by separating the direct and recurrent influences of inputs, allowing for a principled trade-off betwe…
Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more
The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…