~ similar to 2605.00699v3· 20 results
The paper introduces retraining-free frameworks (Meow2X and TRNE) that mechanistically localize and suppress toxicity within language models by analyzing activation differences, achieving safety impro…
Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An +2 more
The paper introduces T-MAP, a trajectory-aware evolutionary search method, to discover and generate multi-step adversarial prompts that exploit vulnerabilities in autonomous LLM agents through tool ex…
Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath +1 more
The paper introduces the threat model of sequential data poisoning, demonstrating that multiple, collaborating attackers can exploit compound vulnerabilities in LLM post-training pipelines that are in…
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more
The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.
Wenkui Yang, Chao Jin, Haisu Zhu, Weilin Luo +6 more
The paper introduces Semantic-level UI Element Injection, a novel red-teaming technique that overlays misleading UI elements onto screenshots to significantly improve the attack success rate against s…
Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more
This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
This paper surveys the risks associated with world models, proposing a unified threat model and demonstrating adversarial attacks that show world models require rigorous safety standards comparable to…
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li +2 more
The paper introduces ToxiAlert-Bench, a large-scale audio dataset that uniquely annotates both textual and paralinguistic sources of toxicity, and proposes a dual-head neural network that significantl…
The paper evaluates Language Model Agents (LMAs) for red-teaming by benchmarking their ability to perform lateral movement, finding that expert-defined action plans are most effective, though all moda…
Doguhuan Yeke, Yanming Zhou, Leo Y. Lin, Hongyu Cai +2 more
The paper introduces RoboJailBench, the first standardized evaluation framework for assessing adversarial jailbreak attacks and defenses in embodied AI systems like robots.
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
The paper introduces Obsessive Experience Poisoning (OEP), a low-privilege black-box attack that poisons self-evolving LLM agents by generating locally correct but harmful experiences, causing dangero…
This paper addresses the lack of specialized NLP tools for detecting toxicity in real-time video game chat by creating a large, fine-grained dataset and developing a superior, domain-specific detector…
Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more
RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.
The paper proposes MemPoison, a novel memory poisoning attack that injects triggerable backdoors into LLM agents' long-term memory through dialogue interactions, achieving high success rates by bypass…
The paper introduces MemPoison, a novel memory poisoning attack that successfully injects triggerable backdoors into LLM agents' long-term memory through conversational interactions, achieving high at…
Wei Zou, Mingwen Dong, Miguel Romero Calvo, Shuaichen Chang +6 more
The paper introduces eTAMP, a novel attack that poisons LLM web agents' memory using only environmental observations, demonstrating cross-site and cross-session compromise without direct memory access…