~ similar to 2605.24817v1· 20 results
Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He +3 more
Misrouter introduces an input-only adversarial framework to exploit the routing mechanisms of Mixture-of-Experts (MoE) LLMs, enabling unsafe behavior induction against remotely hosted, black-box servi…
The paper analyzes the routing behavior of Mixtral MoE under benign and harmful prompts using activation and gradient signals, finding that safety-relevant routing is subtle, depth-dependent, and dist…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
Yitong Sun, Yao Huang, Teng Li, Ranjie Duan +4 more
MESA is a targeted alignment framework that decentralizes safety responsibilities across multiple experts in Mixture-of-Experts (MoE) LLMs using Optimal Transport theory, thereby improving safety robu…
The paper introduces KBF, a low-cost black-box auditing protocol that fingerprints LLM APIs by analyzing stable numerical recall near the knowledge boundary, successfully detecting numerous model subs…
The paper introduces KBF, a novel black-box auditing protocol that fingerprints LLM APIs by analyzing stable numerical recall near the knowledge boundary, effectively detecting model substitutions and…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…
The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
Hanzhi Liu, Chaofan Shou, Hongbo Wen, Yanju Chen +2 more
This paper systematically analyzes the threat posed by malicious third-party API routers in the LLM supply chain, finding that a significant number of routers actively perform payload injection, crede…
Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more
MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…
Haochun Tang, Yuliang Yan, Jiahua Lu, Huaxiao Liu +1 more
The paper introduces R$^2$A, an adversarial attack that uses suffix optimization to mislead black-box LLM routers into consistently selecting expensive, high-capability models.
The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…
Karima Makhlouf, Lamiaa Basyoni, Syed Khaderi, Gabriel Marquez +3 more
This paper conducts a structured ablation study using a unified threat model to evaluate how various system factors (like model architecture and retrieval configuration) influence different types of p…
Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more
RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.
The paper proposes a graph-based framework for detecting attacks in LLM agent tool-call traffic, finding that content-level embeddings are crucial for high accuracy and that tree ensembles on these em…
The paper proposes an embarrassingly simple detector that monitors model extraction attacks by testing whether the aggregate distribution of incoming LLM queries deviates from the historical distribut…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…