~ similar to 2605.12529v1· 20 results
The paper compares two sparse autoencoder architectures, finding that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in isolating backdoor-related features in language models.
Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang +2 more
The paper introduces BadStyle, a novel backdoor attack framework that generates natural, stealthy poisoned samples using LLMs to compromise various LLMs with high success rates and robust activation.
Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma +1 more
The paper introduces MetaBackdoor, a novel class of LLM backdoor attacks that exploits positional encoding (length-based triggers) rather than requiring modifications to the textual content.
Shuhao Zhang, Yuli Chen, Jiale Han, Bo Cheng +1 more
The paper proposes Adaptive Stealing (AS), a novel and more robust watermark stealing algorithm that dynamically selects optimal attack perspectives to significantly increase the efficiency of comprom…
Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma +1 more
The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while main…
The paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing the backdoor generalizes at the token feature level, and proposes robust behavioral and weight-level detectors f…
This paper demonstrates that LoRA adapters can be backdoored via data poisoning, showing that the resulting backdoor generalizes at the token feature level, and proposes robust behavioral and weight-l…
This paper introduces Back-Reveal, an attack demonstrating that backdoored LLM agents can systematically exfiltrate sensitive user data by embedding semantic triggers into tool-use mechanisms.
Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu +1 more
The paper proposes Cert-LAS, a novel certified method for verifying model ownership in text-to-image diffusion models, which is robust against malicious signal removal attacks.
Yuqing Nie, Chong Wang, Guosheng Xu, Guoai Xu +3 more
MATRIX is a novel, robust code watermarking framework that encodes watermarks using constrained parity-check matrix equations, achieving high detection accuracy and improved robustness for code proven…
Cong Kong, Xin Cheng, Zhaoxia Yin, Shuai Li +2 more
VertMark introduces a novel, unified, and training-free framework to embed robust watermarks into vertical domain pre-trained language models (VPLMs) for copyright protection across multiple specializ…
Karima Makhlouf, Lamiaa Basyoni, Syed Khaderi, Gabriel Marquez +3 more
This paper conducts a structured ablation study using a unified threat model to evaluate how various system factors (like model architecture and retrieval configuration) influence different types of p…
The paper introduces a systematic benchmark to test LLMs' ability to recover Indicators of Compromise (IoCs) from JavaScript code, finding that while LLMs handle simple obfuscation well, encryption-ba…
Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more
The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…
The paper proposes a novel cross-modal backdoor attack that exploits the vulnerability of lightweight connectors in multimodal LLMs, demonstrating high attack success rates across different modalities…
Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen +1 more
The paper proposes SAFESEAL, a novel key-conditioned watermarking framework that embeds robust, provider-specific watermarks into LLM outputs with minimal semantic distortion, effectively protecting i…
Kun Wang, Meng Chen, Junhao Wang, Yuli Wu +5 more
STEP introduces a novel, black-box, retraining-free detector that profiles audio samples using dual perturbation branches to detect backdoor attacks by exploiting the characteristic instability of hid…
This paper introduces the first backdoor attack specifically targeting pipeline parallelism in decentralized post-training, demonstrating that a limited adversary controlling an intermediate stage can…
This paper proposes a multi-layered defense strategy combining pre-output monitoring, calibrated canary detection, and cumulative information-flow tracking to prevent LLM agents from exfiltrating sens…
The paper introduces SHADOWMASK, the first systematic backdoor attack targeting Masked Diffusion Language Models (MDLMs), demonstrating near-100% attack success while preserving clean model utility.