Papers similar to 2605.07982v1

~ similar to 2605.07982v1· 20 results

cs.CRRecentMay 6, 2026

GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy

Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

GLiNER Guard (GLiGuard) introduces a unified, efficient encoder family that simultaneously performs safety classification and PII detection in a single forward pass, offering a practical, low-cost alt…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving state-of-the-art safety performance with massiv…

View →

cs.AIcs.CLcs.CRRecentMay 27, 2026

Robust and Efficient Guardrails with Latent Reasoning

Siddharth Sai, Xiaofei Wen, Muhao Chen

The paper introduces COLAGUARD, a novel guardrail model that efficiently transfers multi-step safety reasoning into a continuous latent space, achieving high safety performance with massive improvemen…

View →

cs.CLRecentJun 1, 2026

SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li +3 more

SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…

View →

cs.SEcs.AIcs.CRRecentApr 10, 2026

DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation

Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin +5 more

DeepGuard introduces a novel multi-layer semantic aggregation framework to enhance secure code generation by collecting vulnerability cues from multiple upper layers of LLMs, significantly improving s…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Ihor Stepanov, Aleksandr Smechov

The paper introduces Opir, an efficient family of encoder-based multi-task guardrail models that provides competitive safety classification performance across various tasks while maintaining a signifi…

View →

cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →

cs.CRcs.CLRecentJun 4, 2026

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim +4 more

Membrane introduces a self-evolving guardrail using Contrastive Safety Memory (CSM) that generalizes across topical jailbreak variants, achieving superior safety performance while minimizing benign re…

View →

cs.CRcs.AIRecentMay 24, 2026

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more

Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.

View →

cs.CRcs.CLRecentApr 17, 2026

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts

Hua-Rong Chu, Kuan-Chun Wang, Yao-Te Huang

The paper introduces TWGuard, a linguistic context-optimized safety guardrail model, demonstrating that tailoring AI safety mechanisms to specific local linguistic contexts significantly improves perf…

View →

cs.LGcs.AIcs.CERecentMay 3, 2026

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Sadia Asif, Mohammad Mohammadi Amiri

The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…

View →

cs.CRcs.CLRecentMay 29, 2026

Triaging Threats to Specialized Guardrails

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more

The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple unsafe categories.

View →

cs.CRcs.CLRecentMay 29, 2026

Triaging Threats to Specialized Guardrails

Wenjie Jacky Mo, Xiaofei Wen, Rui Cai, Boyu Zhu +5 more

The paper introduces RouteGuard, a router-expert framework, to improve the robustness and generalization of safety guardrails by specializing threat detection across multiple distinct unsafe categorie…

View →

cs.CRcs.CLRecentMay 13, 2026

Model-Agnostic Lifelong LLM Safety via Externalized Attack-Defense Co-Evolution

Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more

The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.

View →

cs.CRRecentMay 6, 2026

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek +1 more

The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…

View →

cs.CRcs.AIRecentMay 17, 2026

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

Nanxi Li, Zhengyue Zhao, Chaowei Xiao

The paper introduces Latent Policy Guardrail (LPG), a novel framework that efficiently enforces dynamic safety policies for LLMs by compressing complex policy deliberation into a small set of latent t…

View →

cs.AIcs.CLRecentJun 1, 2026

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more

SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…

View →

cs.CRcs.AIcs.CLRecentApr 8, 2026

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen

The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…

View →

cs.CRcs.AIcs.LGRecentMar 29, 2026

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Alexandre Cristovão Maiorano

The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…

View →