Papers similar to 2606.04929v1

~ similar to 2606.04929v1· 20 results

cs.CRcs.AIcs.LGRecentMay 22, 2026

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru +1 more

The paper introduces PoisonForge, a comprehensive benchmark demonstrating that even a small number of targeted poisoned examples can significantly compromise the safety and reliability of instruction-…

View →

cs.CRcs.AIcs.LGRecentMay 24, 2026

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Wenjuan Li, Yitao Liu, Runze Chen, Rajkumar Buyya

This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…

View →

cs.CRcs.AIRecentApr 10, 2026

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun

The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…

View →

cs.CRcs.DBRecentApr 27, 2026

Poisoning Learned Index Structures: Static and Dynamic Adversarial Attacks on ALEX

Allen Jue

The paper systematically evaluates static and dynamic adversarial attacks on the ALEX learned index, finding that while static poisoning has minimal impact, dynamic attacks can cause significant slowd…

View →

cs.CRcs.AIRecentMay 10, 2026

Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise AI Agent Reasoning

Ben Kereopa-Yorke, Guillermo Diaz, Holly Wright, Reagan Johnston +2 more

The paper introduces Oracle Poisoning, an attack that corrupts knowledge graphs used by AI agents, demonstrating that all tested models blindly trust poisoned data at high sophistication levels.

View →

cs.CRcs.LGRecentMar 31, 2026

Backdoor Attacks on Decentralised Post-Training

Oğuzhan Ersoy, Nikolay Blagoev, Jona te Lintelo, Stefanos Koffas +2 more

This paper introduces the first backdoor attack specifically targeting pipeline parallelism in decentralized post-training, demonstrating that a limited adversary controlling an intermediate stage can…

View →

cs.CRcs.AIRecentMay 7, 2026

LoopTrap: Termination Poisoning Attacks on LLM Agents

Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more

The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.

View →

cs.CRcs.AIRecentMay 22, 2026

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

Shi Liu, Xuehai Tang, Xikang Yang, Liang Lin +3 more

This paper introduces a new benchmark to test Tool Description Poisoning (TDP) attacks on LLM agents, demonstrating that even advanced models like GPT-4o are highly vulnerable and that current defense…

View →

cs.CRcs.AIRecentMay 7, 2026

Narrow Secret Loyalty Dodges Black-Box Audits

Alfie Lamerton, Fabien Roger

The paper introduces and demonstrates 'narrow secret loyalties,' a novel type of covert model manipulation that biases model output toward a specific principal's interests under narrow conditions, whi…

View →

cs.CRcs.LGRecentMay 26, 2026

Poison with Style: A Practical Poisoning Attack on Code Large Language Models

Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan +2 more

The paper introduces Poison-with-Style (PwS), a stealthy model poisoning attack that exploits developers' inherent code styles as covert triggers to make Code LLMs generate vulnerable code without exp…

View →

cs.CRcs.AIRecentApr 24, 2026

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more

RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.

View →

cs.CRcs.AIRecentJun 3, 2026

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah +1 more

This paper systematically studies memory poisoning attacks in LLM agents, identifying multiple vulnerabilities and proposing a new benchmark to assess the risk.

View →

cs.CRcs.AIcs.LGRecentMay 18, 2026

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

Kaixiang Wang, Jiong Lou, Zhaojiacheng Zhou, Jie Li

The paper introduces Obsessive Experience Poisoning (OEP), a low-privilege black-box attack that poisons self-evolving LLM agents by generating locally correct but harmful experiences, causing dangero…

View →

cs.CRcs.LGRecentMay 6, 2026

Gray-Box Poisoning of Continuous Malware Ingestion Pipelines

Jan Dolejš, Martin Jureček, Róbert Lórencz

The paper demonstrates a gray-box poisoning attack against continuous malware detection pipelines using subtle binary manipulations, showing that IAT-based perturbations can significantly degrade dete…

View →

cs.CRcs.CLRecentMay 17, 2026

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li +4 more

The paper introduces a new security benchmark and framework to defend LLM agents against 'cognitive poisoning,' where malicious tools build trust through benign feedback before executing a harmful fin…

View →

cs.CRcs.AIRecentMay 9, 2026

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon +1 more

The paper introduces a truly benign Direct Preference Optimization (DPO) attack that can jailbreak large language models (LLMs) by fine-tuning them with minimal, harmless preference data, thereby supp…

View →

cs.CRcs.AIcs.DCRecentApr 10, 2026

XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers

Israt Jahan Mouri, Muhammad Ridowan, Muhammad Abdullah Adnan

The paper introduces XFED, a novel non-collusive model poisoning attack that demonstrates the feasibility of compromising Federated Learning systems without requiring coordination among attackers, byp…

View →

cs.CRcs.AIcs.LGRecentMay 14, 2026

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…

View →

cs.CRcs.AIRecentApr 3, 2026

Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

Wei Zou, Mingwen Dong, Miguel Romero Calvo, Shuaichen Chang +6 more

The paper introduces eTAMP, a novel attack that poisons LLM web agents' memory using only environmental observations, demonstrating cross-site and cross-session compromise without direct memory access…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →