~ similar to 2606.02041· 20 results
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…
The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…
GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…
The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.
Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more
The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more
BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…
Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more
BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.
Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more
This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He +4 more
The paper introduces the Prompt Overflow Attack, demonstrating that guardrail models inspecting truncated or segmented inputs fail to detect malicious instructions that are only actionable when the fu…
The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin +5 more
DeepGuard introduces a novel multi-layer semantic aggregation framework to enhance secure code generation by collecting vulnerability cues from multiple upper layers of LLMs, significantly improving s…
Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more
The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…
The paper introduces SONAR, a prompt sanitization framework that uses natural language inference metrics to identify and remove malicious instructions injected into LLM prompts, achieving near-zero at…
GLiNER Guard (GLiGuard) introduces a unified, efficient encoder family that simultaneously performs safety classification and PII detection in a single forward pass, offering a practical, low-cost alt…
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more
The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…