~ similar to 2606.04929v1· 20 results
Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more
The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…
The paper systematically evaluates static and dynamic adversarial attacks on the ALEX learned index, finding that while static poisoning has minimal impact, dynamic attacks can cause significant slowd…
The paper introduces Oracle Poisoning, an attack that corrupts knowledge graphs used by AI agents, demonstrating that all tested models blindly trust poisoned data at high sophistication levels.
This paper introduces the first backdoor attack specifically targeting pipeline parallelism in decentralized post-training, demonstrating that a limited adversary controlling an intermediate stage can…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin +3 more
This paper introduces a new benchmark to test Tool Description Poisoning (TDP) attacks on LLM agents, demonstrating that even advanced models like GPT-4o are highly vulnerable and that current defense…
The paper introduces and demonstrates 'narrow secret loyalties,' a novel type of covert model manipulation that biases model output toward a specific principal's interests under narrow conditions, whi…
Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan +2 more
The paper introduces Poison-with-Style (PwS), a stealthy model poisoning attack that exploits developers' inherent code styles as covert triggers to make Code LLMs generate vulnerable code without exp…
Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more
RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.
Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah +1 more
This paper systematically studies memory poisoning attacks in LLM agents, identifying multiple vulnerabilities and proposing a new benchmark to assess the risk.
The paper introduces Obsessive Experience Poisoning (OEP), a low-privilege black-box attack that poisons self-evolving LLM agents by generating locally correct but harmful experiences, causing dangero…
The paper demonstrates a gray-box poisoning attack against continuous malware detection pipelines using subtle binary manipulations, showing that IAT-based perturbations can significantly degrade dete…
Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li +4 more
The paper introduces a new security benchmark and framework to defend LLM agents against 'cognitive poisoning,' where malicious tools build trust through benign feedback before executing a harmful fin…
Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon +1 more
The paper introduces a truly benign Direct Preference Optimization (DPO) attack that can jailbreak large language models (LLMs) by fine-tuning them with minimal, harmless preference data, thereby supp…
The paper introduces XFED, a novel non-collusive model poisoning attack that demonstrates the feasibility of compromising Federated Learning systems without requiring coordination among attackers, byp…
The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…
Wei Zou, Mingwen Dong, Miguel Romero Calvo, Shuaichen Chang +6 more
The paper introduces eTAMP, a novel attack that poisons LLM web agents' memory using only environmental observations, demonstrating cross-site and cross-session compromise without direct memory access…
Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…