Papers similar to 2605.30680

~ similar to 2605.30680· 20 results

cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →

cs.SEcs.AIcs.LGRecentMay 29, 2026

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Nazmus Ashrafi

The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…

View →

cs.GTcs.CRcs.LGRecentMay 8, 2026

Differentially Private Auditing Under Strategic Response

Florian A. D. Burnat

This paper analyzes differential privacy auditing as a bilevel game, showing that naive audit designs fail to detect true harm when developers strategically respond, and proposes an optimal, single-le…

View →

cs.MAcs.AIcs.LGRecentMay 28, 2026

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

Víctor Gallego

The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…

View →

cs.MAcs.AIcs.GTRecentMay 28, 2026

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Francisco León Zúñiga Bolívar

The study extends cooperative bias testing across diverse, next-generation LLMs, finding that provider identity is a stronger predictor of cooperative equilibrium than model generation, and that noise…

View →

cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →

cs.CRcs.AIcs.SERecentMay 5, 2026

MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents

Jonathan Steinberg, Oren Gal

The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…

View →

cs.AIRecentJun 1, 2026

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more

The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits

Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more

The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…

View →

cs.GTcs.AIcs.MARecentMay 29, 2026

Social welfare optimisation under institutional reward and punishment

Van An Nguyen, Vuong Khang Huynh, Huu Loi Bui, Hai Anh Ha +7 more

This paper introduces a welfare-centric framework for designing institutional incentives, showing that optimizing for total social welfare often requires different incentive levels than those optimize…

View →

cs.MAcs.AIRecentMay 29, 2026

Safe Equilibrium Policy Optimization for Strategic Agent Policies

Karthika Arumugam, Kiran Kumar Manku, Amit Dhanda

The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.

View →

cs.CRcs.AIRecentMar 18, 2026

Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare

Saikat Maiti

The paper proposes and validates a comprehensive four-layer Zero Trust security architecture designed to mitigate critical vulnerabilities in autonomous AI agents handling Protected Health Information…

View →

cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →

cs.SEcs.AIcs.CLRecentMay 18, 2026

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng +3 more

The paper introduces OverEager-Gen, a new benchmark that measures 'overeager actions'—where coding agents perform unauthorized tasks beyond a benign request—and finds that removing explicit consent de…

View →

cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →

cs.CRcs.AIcs.CLRecentMay 27, 2026

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang +3 more

The paper introduces SNARE, a novel adaptive benchmarking pipeline that systematically measures overeager behavior in coding agents, finding that the agent framework accounts for the majority of the v…

View →

cs.CRcs.AIcs.CLRecentMay 27, 2026

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang +3 more

The paper introduces SNARE, a novel adaptive testing pipeline that systematically measures overeager behavior in coding agents, finding that the agent framework accounts for the majority of the variat…

View →

cs.AIRecentMay 31, 2026

"Skill issues'': data-centric optimization of lakehouse agents

Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue

The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…

View →

q-fin.RMcs.AIcs.CRRecentMay 6, 2026

The Insurability Frontier of AI Risk: Mapping Threats to Affirmative Coverage, Silent Exposures, and Exclusions

Alex Leung, Rex Zhang, Ervin Ling, Kentaroh Toyoda +1 more

This paper maps the emerging insurability frontier of AI risk by coding 55 AI threat classes against 26 insurance products, identifying four tiers of coverage: affirmative, silent, excluded, and outsi…

View →

cs.CRRecentMar 20, 2026

Constraint Migration: A Formal Theory of Throughput in AI Cybersecurity Pipelines

Surasak Phetmanee

The paper develops a formal theory to analyze how throughput changes in AI-enhanced cybersecurity pipelines when stage capacities are perturbed by multipliers.

View →