~ similar to 2605.30680· 20 results
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…
This paper analyzes differential privacy auditing as a bilevel game, showing that naive audit designs fail to detect true harm when developers strategically respond, and proposes an optimal, single-le…
The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…
The study extends cooperative bias testing across diverse, next-generation LLMs, finding that provider identity is a stronger predictor of cooperative equilibrium than model generation, and that noise…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
Van An Nguyen, Vuong Khang Huynh, Huu Loi Bui, Hai Anh Ha +7 more
This paper introduces a welfare-centric framework for designing institutional incentives, showing that optimizing for total social welfare often requires different incentive levels than those optimize…
The paper introduces Safe Equilibrium Policy Optimization (σepo{}) to train language models for multi-agent strategic tasks, achieving improved safety and robustness across various game domains.
The paper proposes and validates a comprehensive four-layer Zero Trust security architecture designed to mitigate critical vulnerabilities in autonomous AI agents handling Protected Health Information…
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng +3 more
The paper introduces OverEager-Gen, a new benchmark that measures 'overeager actions'—where coding agents perform unauthorized tasks beyond a benign request—and finds that removing explicit consent de…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang +3 more
The paper introduces SNARE, a novel adaptive benchmarking pipeline that systematically measures overeager behavior in coding agents, finding that the agent framework accounts for the majority of the v…
Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang +3 more
The paper introduces SNARE, a novel adaptive testing pipeline that systematically measures overeager behavior in coding agents, finding that the agent framework accounts for the majority of the variat…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Alex Leung, Rex Zhang, Ervin Ling, Kentaroh Toyoda +1 more
This paper maps the emerging insurability frontier of AI risk by coding 55 AI threat classes against 26 insurance products, identifying four tiers of coverage: affirmative, silent, excluded, and outsi…
The paper develops a formal theory to analyze how throughput changes in AI-enhanced cybersecurity pipelines when stage capacities are perturbed by multipliers.