Papers similar to 2605.25304v1

~ similar to 2605.25304v1· 20 results

cs.LGcs.AIstat.MLRecentMay 28, 2026

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

Yael Konforti, Mateo Espinosa Zarlenga, Elaf Almahmoud, Mateja Jamnik

CB-SLICE is a novel concept-based method for discovering model error slices that leverages Concept Bottleneck Models (CBMs) to provide fine-grained, faithful explanations directly linked to the root c…

View →

cs.CRcs.LGRecentMay 19, 2026

Awakening the Hydra: Stabilizing Multi-Concept Backdoor Injection in Text-to-Image Diffusion Models

Kai Wang, Jiale Zhang, Chengcheng Zhu, Chuang Ma +1 more

The paper proposes Hydra, a framework to stabilize and control the injection of multiple, conflicting backdoor triggers into text-to-image diffusion models, ensuring high attack reliability while main…

View →

cs.CRcs.AIRecentMay 17, 2026

When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

Zehan Sun, Dingfan Chen, Songze Li

This paper demonstrates that LLM cascade systems, designed for efficiency, are vulnerable to targeted adversarial attacks that simultaneously degrade both performance and cost-efficiency.

View →

cs.CRcs.AIcs.LGRecentMay 22, 2026

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein +1 more

This study longitudinally evaluates the adversarial robustness of Android malware detection systems over a decade, finding that temporal separation significantly degrades robustness due to concept dri…

View →

cs.LGcs.AIcs.CRRecentApr 18, 2026

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

Bo Wang, Jia Ni, Mengnan Zhao, Zhan Qin +1 more

This paper systematically investigates unlearnable examples (UEs) across diverse training paradigms, finding that existing UEs fail under pretraining-finetuning (PF) settings, and proposes Shallow Sem…

View →

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

Taiwo Onitiju, Iman Vakilinia

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.LGcs.CRRecentMar 31, 2026

Dummy-Aware Weighted Attack (DAWA): Breaking the Safe Sink in Dummy Class Defenses

Yunrui Yu, Xuxiang Feng, Pengda Qin, Pengyang Wang +4 more

The paper introduces Dummy-Aware Weighted Attack (DAWA), a novel evaluation method that significantly reduces the reported robustness of Dummy Classes-based defenses by simultaneously targeting both t…

View →

cs.CRcs.LGRecentMay 28, 2026

Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

Syafiq Al Atiiq, Chun Zhou, Christian Gehrmann

The paper analyzes LLM vulnerability detection using mechanistic interpretability, finding that models primarily rely on safety detectors rather than direct vulnerability signature recognition.

View →

cs.CRcs.AIcs.CLRecentApr 6, 2026

Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities

Charafeddine Mouzouni

The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…

View →

cs.CRcs.CVRecentApr 17, 2026

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more

This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…

View →

cs.LGcs.CRRecentApr 5, 2026

Towards Unveiling Vulnerabilities of Large Reasoning Models in Machine Unlearning

Aobo Chen, Chenxu Zhao, Chenglin Miao, Mengdi Huai

The paper proposes a novel bi-level exact unlearning attack targeting Large Reasoning Models (LRMs) that forces incorrect final answers while generating misleading reasoning traces, highlighting new s…

View →

cs.CRRecentMay 14, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more

This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…

View →

cs.CVcs.AIcs.CRRecentMar 17, 2026

REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models

Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei +4 more

The paper introduces REFORGE, a black-box red-teaming framework that uses adversarial image prompts to reveal persistent vulnerabilities in current Image Generation Model Unlearning (IGMU) methods.

View →

cs.CRcs.AIcs.LGRecentMay 14, 2026

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…

View →

cs.CRRecentJun 2, 2026

Operationalizing Cyber Attack Prediction: A Gap-Prioritized Framework with Dataset and Model Selection Guidelines

Aminu Muhammad Auwal

This paper proposes a gap-prioritization framework to bridge the gap between theoretical cyber attack prediction research and practical operational deployment by identifying critical implementation hu…

View →

cs.LGcs.AIRecentJun 1, 2026

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu +2 more

GJDNet proposes a joint disentanglement framework to enhance the robustness of Graph Neural Networks against adversarial attacks by simultaneously stabilizing node representations and decision boundar…

View →

cs.AIRecentMay 28, 2026

Harnessing non-adversarial robustness in large language models

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov +5 more

The paper proposes a debiasing fine-tuning technique to efficiently enhance the robustness of Large Language Models against semantically similar but textually altered prompts.

View →

cs.CRcs.AIcs.CLRecentMay 7, 2026

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more

The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…

View →