Papers similar to 2606.04967

~ similar to 2606.04967· 20 results

cs.CRcs.AIRecentMay 11, 2026

Engineering Robustness into Personal Agents with the AI Workflow Store

Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari +2 more

The paper argues that current 'on-the-fly' AI agent design lacks necessary software engineering rigor and proposes an 'AI Workflow Store' to provide hardened, reusable, and reliable agent workflows.

View →

cs.HCcs.AIcs.IRRecentMay 28, 2026

From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

The paper introduces an ontology-driven framework, From Prompts to Context, to explicitly model and structure the often-opaque context of human-Generative AI collaborations, thereby improving traceabi…

View →

cs.SEcs.AIRecentMay 31, 2026

Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory

Ruiyin Li, Yiran Zhang, Xiyu Zhou, Yangxiao Cai +5 more

The paper introduces MAAD, a multi-agent framework that autonomously transforms software requirements into comprehensive, multi-view architectural blueprints, significantly improving completeness and…

View →

cs.CRcs.AIcs.SERecentMay 11, 2026

Comment and Control: Hijacking Agentic Workflows via Context-Grounded Evolution

Neil Fendley, Zhengyu Liu, Aonan Guan, Jiacheng Zhong +1 more

The paper introduces JAW, a novel framework that demonstrates how adversaries can hijack agentic workflows on automation platforms like GitHub Actions by manipulating inputs based on context-grounded…

View →

cs.SEcs.AIRecentMay 27, 2026

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Swanand Rao

Tool Forge is a validation-carrying toolchain that converts natural language capability intent into governed, sandbox-verified tool artifacts, significantly improving agent efficiency and reliability.

View →

cs.CRRecentMay 8, 2026

Demystifying and Detecting Agentic Workflow Injection Vulnerabilities in GitHub Actions

Shenao Wang, Xinyi Hou, Zhao Liu, Yanjie Zhao +4 more

This paper introduces Agentic Workflow Injection (AWI), a new class of vulnerability in LLM-powered GitHub Actions, and presents TaintAWI, a novel taint-analysis tool that identifies hundreds of explo…

View →

cs.HCcs.CRRecentMay 22, 2026

From Preventive to Reactive: How AI Coding Assistants Transform Developers' Security Awareness

Faisal Haque Bappy, Tahrim Hossain, Sidratul Muntaher Meheraj, Annoor Sharara Akhand +4 more

The paper investigates how AI coding assistants shift developers' security focus from proactive prevention to reactive review, finding that this structural change is reinforced by current tool interac…

View →

cs.CRRecentMar 25, 2026

AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

Shenghan Zheng, Qifan Zhang

The paper introduces a comprehensive security framework, AgentRFC, to systematically analyze and test the security conformance of various AI agent protocols, identifying critical design gaps, especial…

View →

cs.SEcs.AIcs.LGRecentMay 29, 2026

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Nazmus Ashrafi

The study found that while multi-agent LLM code generation architectures significantly affect code complexity, the added complexity does not translate into better functional correctness, suggesting ar…

View →

cs.CRcs.AIRecentApr 3, 2026

Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui +1 more

This paper provides the first comprehensive security analysis of the Agent Skills framework, identifying severe structural vulnerabilities that require fundamental architectural changes rather than si…

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.AIcs.LGRecentMay 30, 2026

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge +7 more

MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…

View →

cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →

cs.CRcs.AIRecentMay 26, 2026

Lessons from Penetration Tests on Large-Scale Agent Systems

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more

The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.

View →

cs.CRcs.AIRecentApr 30, 2026

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Luyao Xu, Xiang Chen

This paper provides a systematic, layered review of security risks and defense strategies for autonomous agent frameworks, using OpenClaw as a case study to address the current lack of integrated rese…

View →

cs.AIcs.CRRecentMay 5, 2026

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh +1 more

The paper empirically evaluates the security quality of LLM-generated code across various prompting methods, finding that while prompting alters the structure of weaknesses, it is insufficient to reli…

View →

cs.CRcs.AIcs.SERecentJun 3, 2026

Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications

Yutao Shi, Xiaohan Zhang, Xiangjing Zhang, Xihua Shen +4 more

This paper investigates Description-Code Inconsistency (DCI) in Model Context Protocol (MCP) servers, finding that 9.93% of real-world tools exhibit inconsistencies that create security blind spots.

View →

cs.CRRecentMay 9, 2026

When LLMs Team Up: A Coordinated Attack Framework for Automated Cyber Intrusions

Minfeng Qi, Tianqing Zhu, Zijie Xu, Congcong Zhu +2 more

The paper introduces CAESAR, a novel multi-agent framework that coordinates LLM agents across five specialized roles to improve success rates and stability in complex, multi-stage cyber intrusion task…

View →

cs.AIRecentMay 27, 2026

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li +8 more

The paper introduces Harness-Bench, a diagnostic benchmark that measures how different system 'harnesses' affect LLM agent performance in realistic workflows, showing that agent capability must be rep…

View →