ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2604.26235v1· 20 results

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →
cs.LGcs.NEq-fin.STRecentJun 3, 2026

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

Damian Lebiedź, Robert Ślepaczuk

The paper develops and validates a novel Deep Reinforcement Learning (DRL) framework to enhance pair trading in volatile cryptocurrency markets, demonstrating statistically significant outperformance…

View →
cs.AIRecentMay 27, 2026

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo +7 more

The paper introduces a unified framework to fairly evaluate LLM agentic capabilities by standardizing diverse benchmarks and separating the effects of the LLM model from the surrounding framework and…

View →
cs.AIcs.CLRecentMay 28, 2026

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Lorenz Kutschka, Bernhard Geiger

This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…

View →
cs.CRRecentMay 9, 2026

Toward Web 4.0: Bidirectional Trust between AI Agents and Blockchain

Yunfeng Xia, Chao Li, Lei Li, Chenhao Zhang +3 more

The paper systematizes the interaction between autonomous AI agents and blockchain platforms using a bidirectional trust framework, identifying significant gaps in current standards and proposing a ta…

View →
cs.AIcs.CRRecentMay 27, 2026

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Jay Yu, Amy Zhao, Danning Sui

The paper analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments are heterogeneous, lack clear autonomous execution, and exhibit poor risk…

View →
cs.AIcs.CRRecentMay 27, 2026

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Jay Yu, Amy Zhao, Danning Sui

The paper empirically analyzes the nascent DeFi investment agent market, finding that while token valuations are high, current deployments lack robust autonomous execution and exhibit poor risk-adjust…

View →
cs.LGcs.CRRecentMay 12, 2026

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

The paper introduces CTFusion, a novel streaming evaluation framework built on Live CTFs, to provide a robust and reliable benchmark for assessing LLM agents in cybersecurity tasks.

View →
cs.MAcs.CRRecentApr 21, 2026

ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies

Shaoyu Li, Chaoyu Zhang, Hexuan Yu, Y. Thomas Hou +1 more

The paper introduces ClawCoin, a novel tokenized, compute-cost-indexed unit of account designed to solve the problem of non-transferable compute costs in decentralized AI agent economies.

View →
q-fin.GNcs.CYcs.LGRecentJun 1, 2026

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

Wenbin Wu

The paper demonstrates that large language models (LLMs) exhibit measurable, controllable biases toward specific assets like Bitcoin, identifying an internal feature that can causally shift portfolio…

View →
cs.MAcs.AIcs.CRRecentMar 26, 2026

From Logic Monopoly to Social Contract: Separation of Power and the Institutional Foundations for Autonomous Agent Economies

Anbang Ruan

The paper proposes replacing individual agent autonomy with a structured 'social contract' and institutional Separation of Power (SoP) to mitigate systemic failures and deceptive behavior in multi-age…

View →
cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →
cs.CRcs.AIcs.SERecentMay 12, 2026

Options, Not Clicks: Lattice Refinement for Consent-Driven MCP Authorization

Ying Li, Yanju Chen, Peiran Wang, Issac Khabra +3 more

The paper introduces Conleash, a client-side middleware that uses a risk lattice to enforce granular, boundary-scoped authorization for tool invocations, significantly improving user consent and secur…

View →
cs.AIcs.CLcs.CRRecentApr 18, 2026

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…

View →
cs.CRcs.AIcs.CLRecentApr 18, 2026

Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

Tyler H. Merves, Michael H. Conaway, Joseph M. Escobar, Hakan T. Otal +1 more

This study provides a comprehensive benchmark of 10 frontier LLMs on 200 offensive cybersecurity tasks, finding that environment tooling and model selection are the primary performance drivers, with C…

View →
cs.DCcs.CRcs.CYRecentMay 6, 2026

Toward a Risk Assessment Framework for Institutional DeFi: A Nine-Dimension Approach

Eva Oberholzer, Valeriy Zamaraiev

The paper proposes a novel nine-dimension risk assessment framework for institutional DeFi adoption, significantly enhancing existing methodologies by incorporating novel dimensions like composability…

View →
cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →
cs.CRcs.AIRecentApr 28, 2026

From CRUD to Autonomous Agents: Formal Validation and Zero-Trust Security for Semantic Gateways in AI-Native Enterprise Systems

Ignacio Peyrano

The paper proposes a Semantic Gateway and a Zero-Trust security model to formally validate and secure autonomous AI agents operating in enterprise systems, achieving a 100% discovery rate of unauthori…

View →
cs.AIcs.CRRecentMar 26, 2026

On the Foundations of Trustworthy Artificial Intelligence

TJ Dunham

The paper proves that platform-deterministic inference is a necessary and sufficient condition for trustworthy AI, establishing that AI trust fundamentally relies on consistent arithmetic.

View →
cs.CRcs.AIRecentMay 7, 2026

From Specification to Deployment: Empirical Evidence from a W3C VC + DID Trust Infrastructure for Autonomous Agents

Lars Kersten Kroehl

The paper introduces MolTrust, a production-deployed trust infrastructure built on W3C standards (VCs and DIDs) that provides a verifiable, multi-layered authorization framework for autonomous AI agen…

View →