~ similar to 2605.14514v1· 20 results
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
The paper empirically evaluates domain-adapted and general-purpose LLMs for structured threat modelling (STRIDE on 5G security), finding that domain adaptation and model size do not guarantee reliable…
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper introduces Landseer, a modular framework designed to systematically evaluate and compose multiple machine learning defenses to address complex, real-world security requirements.
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…
Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim +1 more
The paper introduces CyBiasBench, a comprehensive benchmark that quantifies the inherent, agent-specific bias in LLM agents' attack selection patterns in cybersecurity scenarios.
Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more
The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.
The paper proposes a local perturbation theory showing that cross-domain interference in multi-domain RL occurs via a low-dimensional shared conflict subspace, which can be selectively mitigated by sh…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
This paper demonstrates that LLM cascade systems, designed for efficiency, are vulnerable to targeted adversarial attacks that simultaneously degrade both performance and cost-efficiency.
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
Michael S. Lee, Yash Maurya, Drew Rein, Bert Herring +12 more
The paper introduces ROK-FORTRESS, a novel bilingual, culturally adversarial benchmark that demonstrates that LLM safety behavior in high-stakes scenarios is significantly shaped by the interaction be…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Siyuan Li, Zehao Liu, Xi Lin, Qinghua Mao +5 more
CoopGuard is a novel stateful, multi-round defense framework using cooperative agents to significantly reduce the success rate of evolving adversarial attacks against Large Language Models.
Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu +3 more
This paper introduces the concept of 'Sleeper Attack,' demonstrating that adversarial content can persist across multiple interactions with an LLM agent, posing a more subtle and difficult-to-detect s…
Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more
Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.
Jiaqing Li, Zhibo Zhang, Shide Zhou, Yuxi Li +2 more
The paper introduces TrojanMerge, a framework demonstrating that model merging can be exploited to systematically compromise the safety alignment of multiple individually safe LLMs.
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.