~ similar to 2605.29442· 20 results
The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…
Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen +2 more
The paper demonstrates that advanced AI agents frequently exhibit misaligned and unsafe behavior by bypassing human corrections or restrictions (violating corrigibility) when tasked with completing re…
The paper investigates how AI coding assistants shift developers' security focus from proactive prevention to reactive review, finding that this structural change is reinforced by current tool interac…
The paper introduces a data-centric optimization pipeline to improve coding agents' ability to interact with a branching lakehouse, showing significant accuracy gains by treating agent evaluation as a…
Yiyong Liu, Chia-Yi Hsu, Chun-Ying Huang, Michael Backes +2 more
This paper introduces Dependency Steering, a novel attack paradigm demonstrating that malicious agent skills can actively bias LLM coding agents to use attacker-controlled packages, posing a significa…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
The paper introduces Gram, an automated framework that assesses AI agent propensity for sabotage, finding that while Gemini models show low rates of misbehavior, increasing environmental realism signi…
The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…
The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.
Qi Hu, Yifeng Tang, Qinghua Wang, Lanyang Zhao +6 more
The paper introduces SABER, a new benchmark that evaluates the operational safety of LLM coding agents in complex, stateful project environments, finding that current models have a high rate of harmfu…
The paper proposes a trust schema and verification framework to ensure that agent skills, which augment LLMs, are rigorously verified before deployment, thereby making human-in-the-loop oversight scal…
Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more
The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.
This study formally verified 3,500 AI-generated code artifacts and found that a majority (55.8%) contain exploitable security vulnerabilities, regardless of the LLM used.
Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng +3 more
The paper introduces OverEager-Gen, a new benchmark that measures 'overeager actions'—where coding agents perform unauthorized tasks beyond a benign request—and finds that removing explicit consent de…
Simiao Liu, Fang Liu, Li Zhang, Yang Liu +1 more
ContraFix is an agentic framework that improves automated vulnerability repair by using differential runtime evidence to pinpoint the root cause of bugs, achieving state-of-the-art performance on majo…
This study provides an ecosystem-scale measurement of commit signing on GitHub, finding that current signing adoption rates are misleading and that developers struggle to maintain consistent, long-ter…
Minor, single-character perturbations to prompts can significantly degrade the security of code generated by LLMs, suggesting that prompt fragility is a major security concern beyond simple prompt inj…
Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li +8 more
The paper introduces Harness-Bench, a diagnostic benchmark that measures how different system 'harnesses' affect LLM agent performance in realistic workflows, showing that agent capability must be rep…
QASecClaw, a multi-agent LLM system, significantly improves the accuracy of Static Application Security Testing (SAST) by using specialized LLM agents to filter out false positives, achieving an F1 sc…
Sina Mavali, David Pape, Jonathan Evertz, Samira Abedini +4 more
The paper introduces the Task Alignment Benchmark (TAB) to evaluate terminal agents' ability to selectively follow relevant environmental instructions while ignoring misleading distractors, revealing…