Papers similar to 2604.21700v1

~ similar to 2604.21700v1· 20 results

cs.CRRecentMay 10, 2026

BadDLM: Backdooring Diffusion Language Models with Diverse Targets

Shengfang Zhai, Xiaoyang Ji, Yuling Shi, Haoran Gao +5 more

The paper introduces BadDLM, a unified framework that demonstrates a new class of backdoor vulnerabilities in Diffusion Language Models (DLMs) by exploiting their forward masking process across divers…

View →

cs.CRcs.CLRecentMay 14, 2026

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma +1 more

The paper introduces MetaBackdoor, a novel class of LLM backdoor attacks that exploits positional encoding (length-based triggers) rather than requiring modifications to the textual content.

View →

cs.CRcs.LGRecentMay 26, 2026

Poison with Style: A Practical Poisoning Attack on Code Large Language Models

Khang Tran, Yazan Boshmaf, Issa Khalil, NhatHai Phan +2 more

The paper introduces Poison-with-Style (PwS), a stealthy model poisoning attack that exploits developers' inherent code styles as covert triggers to make Code LLMs generate vulnerable code without exp…

View →

cs.CRcs.CLRecentApr 14, 2026

Compiling Activation Steering into Weights via Null-Space Constraints for Stealthy Backdoors

Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more

The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…

View →

cs.CRcs.AIRecentApr 10, 2026

BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun

The paper introduces BadSkill, a novel backdoor attack formulation that targets third-party agent skills by poisoning the embedded model artifacts, achieving high attack success rates across various m…

View →

cs.CRRecentMay 8, 2026

Cross-Modal Backdoors in Multimodal Large Language Models

Runhe Wang, Li Bai, Haibo Hu, Songze Li

The paper proposes a novel cross-modal backdoor attack that exploits the vulnerability of lightweight connectors in multimodal LLMs, demonstrating high attack success rates across different modalities…

View →

cs.CRcs.AIcs.LGRecentMay 18, 2026

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran, Noopur S. Bhatt

The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…

View →

cs.CRcs.AIcs.CLRecentMay 29, 2026

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more

The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…

View →

cs.CRcs.AIcs.CLRecentMay 28, 2026

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Travis Lelle

This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…

View →

cs.CRcs.AIRecentMay 4, 2026

On the Privacy of LLMs: An Ablation Study

Karima Makhlouf, Lamiaa Basyoni, Syed Khaderi, Gabriel Marquez +3 more

This paper conducts a structured ablation study using a unified threat model to evaluate how various system factors (like model architecture and retrieval configuration) influence different types of p…

View →

cs.CRcs.AIcs.IRRecentJun 2, 2026

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more

Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.

View →

cs.CRcs.AIcs.LGRecentMay 26, 2026

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Zedian Shao, Charles Fleming, Teodora Baluta

The paper introduces 'covert control attacks,' a novel and stealthy data poisoning method that teaches LLMs an information hiding scheme, allowing malicious instructions to be encoded and decoded and…

View →

cs.CRcs.AIRecentApr 1, 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…

View →

cs.CRcs.AIcs.LGRecentMay 24, 2026

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

Wenjuan Li, Yitao Liu, Runze Chen, Rajkumar Buyya

This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…

View →

cs.CRcs.AIcs.LGRecentMay 20, 2026

Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs

Yifei Wang, Tianlin Li, Xiaohan Zhang, Yida Yang +2 more

This paper introduces a novel class of backdoor attacks that exploit the numerical side effects of LLM inference optimization, achieving high success rates while maintaining clean accuracy.

View →

cs.CLcs.AIcs.CRRecentMay 8, 2026

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Sachin Kumar

The paper compares two sparse autoencoder architectures, finding that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in isolating backdoor-related features in language models.

View →

cs.CRRecentApr 4, 2026

AttackEval: A Systematic Empirical Study of Prompt Injection Attack Effectiveness Against Large Language Models

Jackson Wang

AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods

Mohammed Kharma, Ahmed Sabbah, Mohammad Alkhanafseh, Mohammad Hammoudeh +1 more

The paper empirically evaluates the security quality of LLM-generated code across various prompting methods, finding that while prompting alters the structure of weaknesses, it is insufficient to reli…

View →