Papers similar to 2604.26235v1

~ similar to 2604.26235v1· 20 results

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →

cs.AREmpiricalRecentJul 3, 2026

ArchEval: Measuring AI Agents as Computer Architects

Chenyu Wang, Zishen Wan, Jeffrey Ma, Shvetank Prakash +7 more

This paper introduces ArchEval, a benchmark and platform for evaluating LLM agents on computer architecture design and optimization.

View →

cs.LGcs.NEq-fin.STRecentJun 3, 2026

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

Damian Lebiedź, Robert Ślepaczuk

The paper develops and validates a novel Deep Reinforcement Learning (DRL) framework to enhance pair trading in volatile cryptocurrency markets, demonstrating statistically significant outperformance…

View →

cs.AIRecentMay 27, 2026

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more

The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…

View →

cs.AIcs.CLRecentMay 28, 2026

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Lorenz Kutschka, Bernhard Geiger

This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…

View →

cs.CRRecentMay 9, 2026

Toward Web 4.0: Bidirectional Trust between AI Agents and Blockchain

Yunfeng Xia, Chao Li, Lei Li, Chenhao Zhang +3 more

The paper systematizes the interaction between autonomous AI agents and blockchain platforms using a bidirectional trust framework, identifying significant gaps in current standards and proposing a ta…

View →

cs.AIcs.CRRecentMay 27, 2026

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Jay Yu, Amy Zhao, Danning Sui

The paper analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments are heterogeneous, lack clear autonomous execution, and exhibit poor risk…

View →

cs.AIcs.CRRecentMay 27, 2026

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Jay Yu, Amy Zhao, Danning Sui

The paper empirically analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments lack robust autonomous execution and exhibit poor risk-adjust…

View →

cs.LGcs.CRRecentMay 12, 2026

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

The paper introduces CTFusion, a novel streaming evaluation framework built on Live CTFs, to provide a robust and reliable benchmark for assessing LLM agents in cybersecurity tasks.

View →

cs.MAcs.CRRecentApr 21, 2026

ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies

Shaoyu Li, Chaoyu Zhang, Hexuan Yu, Y. Thomas Hou +1 more

The paper introduces ClawCoin, a novel tokenized, compute-cost-indexed unit of account designed to solve the problem of non-transferable compute costs in decentralized AI agent economies.

View →

q-fin.GNcs.CYcs.LGRecentJun 1, 2026

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

Wenbin Wu

The paper demonstrates that large language models (LLMs) exhibit measurable, controllable biases toward specific assets like Bitcoin, identifying an internal feature that can causally shift portfolio…

View →

cs.MAcs.AIcs.CRRecentMar 26, 2026

From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies

Anbang Ruan

The paper proposes replacing individual agent autonomy with a structured 'social contract' and institutional Separation of Power (SoP) to mitigate systemic failures and deceptive behavior in multi-age…

View →

cs.SEEmpiricalRecentJul 8, 2026

Rethinking Code Performance Benchmarks for LLMs

Nhat Minh Le, Yisen Xu, Zhijie Wang, Tse-Hsun +1 more

This paper evaluates the performance of large language models on popular benchmarks and finds that only a small percentage of the performant implementations are significantly faster than canonical sol…

View →

cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →

cs.CRcs.AIcs.SERecentMay 12, 2026

Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

Ying Li, Yanju Chen, Peiran Wang, Issac Khabra +3 more

The paper introduces Conleash, a client-side middleware that uses a risk lattice to enforce granular, boundary-scoped authorization for tool invocations, significantly improving user consent and secur…

View →

cs.MAcs.AIcs.CYEmpiricalRecentJul 2, 2026

SovereignNegotiation-Bench: Evaluating User-Owned Personal Agents In Delegated Bargaining Under Privacy, Consent, Evidence, And Institutional Pressure

Dylan Zongmin Liu

This paper introduces SovereignNegotiation-Bench, a benchmark for evaluating user-owned personal negotiation agents based on agreement success, user utility, privacy, consent, evidence grounding, conc…

View →

cs.AIcs.CLcs.CRRecentApr 18, 2026

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…

View →