Papers similar to 2606.06140v1

~ similar to 2606.06140v1· 20 results

cs.CRcs.CLcs.CVRecentApr 9, 2026

Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection

Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo +6 more

The paper introduces Semantic-level UI Element Injection, a novel red-teaming technique that overlays misleading UI elements onto screenshots to significantly improve the attack success rate against s…

View →

cs.CRRecentMay 11, 2026

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

Desen Sun, Jason Hon, Howe Wang, Saarth Rajan +2 more

This paper investigates a novel security vulnerability where imperceptible branding hints can be injected into images and subsequently re-rendered onto new objects by generative AI models, proposing b…

View →

cs.AIcs.CRRecentMay 11, 2026

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu +2 more

The paper introduces EditRisk-Bench, a novel benchmark designed to systematically evaluate the safety risks and downstream reasoning corruption caused by malicious knowledge editing in large language…

View →

cs.CRcs.CVRecentApr 17, 2026

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more

This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…

View →

cs.CRcs.CLRecentApr 24, 2026

Training a General Purpose Automated Red Teaming Model

Aishwarya Padmakumar, Leon Derczynski, Traian Rebedea, Christopher Parisien

The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…

View →

cs.CRcs.AIcs.CLRecentJun 3, 2026

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Nicholas Saban

The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…

View →

cs.CRcs.AIRecentApr 12, 2026

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li +5 more

The paper introduces OS-BLIND, a benchmark demonstrating that current safety evaluations fail to detect critical vulnerabilities in computer-use agents when user instructions are benign, showing high…

View →

cs.CRcs.AIcs.CLRecentMay 12, 2026

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Chang Jin, An Wang, Zeming Wei, Kai Wang +6 more

The paper introduces SkillSafetyBench, a comprehensive benchmark demonstrating that agent safety failures often stem from adversarial influences within reusable skills and execution environments, rath…

View →

cs.CRcs.AIRecentMar 19, 2026

ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation

Haochen Zhao, Shaoyang Cui

The paper introduces ClawTrap, a MITM-based red-teaming framework, to evaluate the security robustness of web agents like OpenClaw against dynamic, real-world network attacks, finding that model stren…

View →

cs.CRcs.AIRecentMay 12, 2026

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection

Chia-Pei, Chen, Kentaroh Toyoda, Anita Lai +1 more

The paper introduces IPI-proxy, an open-source intercepting proxy toolkit designed to red-team web-browsing AI agents by injecting adversarial payloads into live HTTP responses from whitelisted domain…

View →

cs.CRcs.AIcs.LGRecentApr 22, 2026

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Chao Pan, Yu Wu, Xin Yao

The paper introduces SafeRedirect, a system-level defense that prevents frontier LLMs from generating harmful content during legitimate tasks that structurally require it, significantly reducing unsaf…

View →

cs.AIcs.CRcs.LGRecentApr 20, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more

ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…

View →

cs.AIcs.CRRecentMay 5, 2026

Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

Raja Sekhar Rao Dheekonda, Will Pearce, Nick Landers

The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…

View →

cs.CRcs.LGRecentMay 5, 2026

Laundering AI Authority with Adversarial Examples

Jie Zhang, Pura Peetathawatchai, Florian Tramèr, Avital Shafran

The paper demonstrates that adversarial examples can be used to manipulate Vision-Language Models (VLMs) into confidently providing authoritative but incorrect information, a process termed 'AI author…

View →

cs.CRcs.AIcs.MARecentApr 23, 2026

AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

Tanmay Gautam, Alireza Bahramali, Sandeep Atluri

AutoRISE proposes optimizing the entire attack strategy—by searching over executable programs—rather than just optimizing prompts, achieving significant improvements in red-teaming large language mode…

View →

cs.MAcs.AIcs.CRRecentMay 8, 2026

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Jianming Chen, Yawen Wang, Junjie Wang, Zhe Liu +2 more

OrchJail introduces an orchestration-guided fuzzing framework to systematically jailbreak tool-calling text-to-image agents by exploiting unsafe multi-step tool-orchestration patterns.

View →

cs.CVcs.AIcs.CRRecentMar 17, 2026

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei +4 more

The paper introduces REFORGE, a black-box red-teaming framework that uses adversarial image prompts to reveal persistent vulnerabilities in current Image Generation Model Unlearning (IGMU) methods.

View →

cs.CRcs.AIcs.CLRecentApr 3, 2026

An Independent Safety Evaluation of Kimi K2.5

Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary +11 more

The paper conducts a preliminary safety evaluation of the open-weight LLM Kimi K2.5, finding that while it is highly capable, it exhibits concerning dual-use risks, particularly regarding CBRNE misuse…

View →

cs.CRcs.AIcs.CLRecentJun 1, 2026

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Hiskias Dingeto, William Leeney

The paper introduces AGENTREDBENCH, a dynamic redteaming benchmark that significantly measures indirect prompt injection threats in LLM agents using third-party integrations, and releases AGENTREDGUAR…

View →

cs.CRcs.AIcs.CLRecentJun 1, 2026

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Hiskias Dingeto, Will Leeney

The paper introduces AGENTREDBENCH, a dynamic redteaming benchmark that significantly measures indirect prompt injection threats in LLM agents using SaaS integrations, and releases AGENTREDGUARD, a su…

View →