~ similar to 2605.14605v2· 20 results
This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…
The paper proposes a universal robustification framework to enhance drift-adaptive malware detectors against combined concept drift and adversarial attacks, significantly reducing attack success rates…
The paper introduces Indirect Harm Optimization (IHO), a novel black-box, adaptive, and efficient attack method that significantly improves jailbreak success rates against LLMs, aiming to provide a st…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
The paper proposes a unified closed-loop threat taxonomy to systematically analyze and defend foundation models by explicitly framing the bidirectional security interactions between data and models.
Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu +3 more
The paper investigates how dynamic adversarial fine-tuning (R2D2) reorganizes the internal mechanisms (refusal geometry) of safety-aligned language models, finding that it shifts the optimal refusal c…
Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein +1 more
This study longitudinally evaluates the adversarial robustness of Android malware detection systems over a decade, finding that temporal separation significantly degrades robustness due to concept dri…
Mengnan Zhao, Lihe Zhang, Tianhang Zheng, Bo Wang +1 more
This paper reinterprets catastrophic overfitting (CO) in Fast Adversarial Training (FAT) as a weak backdoor mechanism, proposing backdoor-inspired strategies to mitigate this generalization failure.
The paper introduces Landseer, a modular framework designed to systematically evaluate and compose multiple machine learning defenses to address complex, real-world security requirements.
The paper introduces Fine-Tuning Integrity (FTI), a security goal that uses Succinct Model Difference Proofs (SMDPs) to cryptographically prove that a fine-tuned model update adheres to specific struc…
Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang +4 more
The paper introduces Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that significantly reduces the reported robustness of Dummy Classes-based defenses by simultaneously targeting both t…
The paper introduces ParDef, a generalized defense mechanism that effectively mitigates various types of parameter attacks on deep neural networks while maintaining high performance.
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper introduces XFED, a novel non-collusive model poisoning attack that demonstrates the feasibility of compromising Federated Learning systems without requiring coordination among attackers, byp…
The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its pr…
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…
The paper demonstrates that simpler, shallower Deep Neural Network architectures with reduced features and ReLU activations can inherently improve the robustness of ML-NIDS against gradient-based adve…
Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon +1 more
The paper introduces a truly benign Direct Preference Optimization (DPO) attack that can jailbreak large language models (LLMs) by fine-tuning them with minimal, harmless preference data, thereby supp…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Priyal Deep, Shane Emmons, Amy Fox, Kyle Bacon +3 more
The paper evaluates prompt injection defenses and finds that only external output filtering, implemented in application code, reliably prevents secret leaks from LLMs, demonstrating that model-based d…