~ similar to 2604.26235v1· 20 results
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The paper develops and validates a novel Deep Reinforcement Learning (DRL) framework to enhance pair trading in volatile cryptocurrency markets, demonstrating statistically significant outperformance…
Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more
The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
Yunfeng Xia, Chao Li, Lei Li, Chenhao Zhang +3 more
The paper systematizes the interaction between autonomous AI agents and blockchain platforms using a bidirectional trust framework, identifying significant gaps in current standards and proposing a ta…
The paper analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments are heterogeneous, lack clear autonomous execution, and exhibit poor risk…
The paper empirically analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments lack robust autonomous execution and exhibit poor risk-adjust…
The paper introduces CTFusion, a novel streaming evaluation framework built on Live CTFs, to provide a robust and reliable benchmark for assessing LLM agents in cybersecurity tasks.
Shaoyu Li, Chaoyu Zhang, Hexuan Yu, Y. Thomas Hou +1 more
The paper introduces ClawCoin, a novel tokenized, compute-cost-indexed unit of account designed to solve the problem of non-transferable compute costs in decentralized AI agent economies.
The paper demonstrates that large language models (LLMs) exhibit measurable, controllable biases toward specific assets like Bitcoin, identifying an internal feature that can causally shift portfolio…
The paper proposes replacing individual agent autonomy with a structured 'social contract' and institutional Separation of Power (SoP) to mitigate systemic failures and deceptive behavior in multi-age…
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
Ying Li, Yanju Chen, Peiran Wang, Issac Khabra +3 more
The paper introduces Conleash, a client-side middleware that uses a risk lattice to enforce granular, boundary-scoped authorization for tool invocations, significantly improving user consent and secur…
The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…
This study provides a comprehensive benchmark of 10 frontier LLMs on 200 offensive cybersecurity tasks, finding that environment tooling and model selection are the primary performance drivers, with C…
The paper proposes a novel nine-dimension risk assessment framework for institutional DeFi adoption, significantly enhancing existing methodologies by incorporating novel dimensions like composability…
The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…
The paper proposes a Semantic Gateway and a Zero-Trust security model to formally validate and secure autonomous AI agents operating in enterprise systems, achieving a 100% discovery rate of unauthori…
The paper proves that platform-deterministic inference is a necessary and sufficient condition for trustworthy AI, establishing that AI trust fundamentally relies on consistent arithmetic.
The paper introduces MolTrust, a production-deployed trust infrastructure built on W3C standards (VCs and DIDs) that provides a verifiable, multi-layered authorization framework for autonomous AI agen…