ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.30049· 20 results

cs.CVcs.AIcs.CRRecentMar 25, 2026

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin +4 more

The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, partic…

View →
cs.CVcs.CRRecentMay 11, 2026

What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers

Chenyu Zhang

The paper proposes AHV-D&S, a novel training-free inference-time safeguard that detects and suppresses risky content in Diffusion Transformers (DiTs) by quantifying token sensitivity across attention…

View →
cs.AIcs.CLRecentJun 1, 2026

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

Hao Li, Jingkun An, Zijun Song, Pengyu Zhu +7 more

SafeSteer proposes a localized on-policy distillation method that restricts safety alignment to specific safety tokens, thereby achieving strong safety performance with minimal degradation to general…

View →
cs.CRcs.CVRecentApr 17, 2026

TwoHamsters: Benchmarking Multi-Concept Compositional Unsafety in Text-to-Image Models

Chaoshuo Zhang, Yibo Liang, Mengke Tian, Chenhao Lin +5 more

This paper introduces TwoHamsters, a new benchmark that rigorously tests Multi-Concept Compositional Unsafety (MCCU) in text-to-image models, demonstrating that current state-of-the-art models and saf…

View →
cs.CRRecentMay 6, 2026

GLiNER Guard: Unified Encoder Family for Production LLM Safety and Privacy

Bogdan Minko, Sabrina Sadiekh, Evgeniy Kokuykin

GLiNER Guard (GLiGuard) introduces a unified, efficient encoder family that simultaneously performs safety classification and PII detection in a single forward pass, offering a practical, low-cost alt…

View →
cs.ROcs.AIRecentJun 4, 2026

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

Qi Lan, Yining Tang, Yu Shen, Yi Zhou +3 more

RiskFlow is a novel framework that generates realistic and safety-critical multi-agent traffic scenarios by reformulating trajectory generation as a single-pass transport problem in the action space.

View →
cs.CRcs.AIcs.CLRecentMay 7, 2026

Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun +3 more

The paper introduces Safety Bottleneck Regularization (SBR), a novel defense mechanism that anchors LLM safety by constraining the unembedding layer, effectively preventing harmful fine-tuning (HFT) e…

View →
cs.CLcs.AIRecentMay 31, 2026

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Rashad Aziz, Ikhlasul Akmal Hanif, Fajri Koto

The paper shows that safety failures in low-resource languages are due to a failure in the model's safety decision calibration, not a lack of underlying knowledge, and proposes a recalibration method…

View →
cs.CRRecentMay 6, 2026

You Snooze, You Lose: Automatic Safety Alignment Restoration through Neural Weight Translation

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera, Stjepan Picek +1 more

The paper introduces NeWTral, a framework that restores safety alignment to specialized LLM adapters without sacrificing their domain-specific knowledge, achieving a significant reduction in attack su…

View →
cs.AIcs.CRRecentMay 18, 2026

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao +5 more

This paper introduces the concept of Safety Geometry Collapse, demonstrating that multimodal inputs degrade the safety separation of LLMs, and proposes ReGap, a training-free method that adaptively co…

View →
cs.CRcs.SERecentMar 19, 2026

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Yue Zhao, Yujia Gong, Ruigang Liang, Shenchen Zhu +3 more

The paper introduces Cross-Model Neuron Transfer (CNT), a post-hoc method that efficiently transfers safety-oriented functionalities between different large language models by transferring minimal sub…

View →
cs.CLcs.CRRecentMay 8, 2026

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

Urchade Zaratiana, Mary Newhauser, George Hurn-Maloney, Ash Lewis

GLiGuard introduces a compact, schema-conditioned bidirectional encoder that achieves state-of-the-art performance in LLM content moderation across multiple safety dimensions while drastically reducin…

View →
cs.CLcs.AIRecentMay 28, 2026

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

Hyeseon An, Yo-Sub Han

The paper introduces DLM-SWAI, a training-free method that effectively steers diffusion language models (DLMs) toward desired textual styles or properties by biasing the token distribution at each den…

View →
cs.LGcs.AIcs.CRRecentApr 8, 2026

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Ismail Hossain, Sai Puppala, Jannatul Ferdaus, Md Jahangir Alam +3 more

The paper demonstrates that fine-tuning safety guard models on benign data can catastrophically collapse their safety alignment, proposing Fisher-Weighted Safety Subspace Regularization (FW-SSR) to ac…

View →
cs.CRcs.AIcs.LGRecentApr 22, 2026

SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

Chao Pan, Yu Wu, Xin Yao

The paper introduces SafeRedirect, a system-level defense that prevents frontier LLMs from generating harmful content during legitimate tasks that structurally require it, significantly reducing unsaf…

View →
cs.AIRecentMay 28, 2026

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang +2 more

The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining u…

View →
cs.CRRecentApr 30, 2026

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more

MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…

View →
cs.CLcs.AIcs.CVRecentJun 1, 2026

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu +3 more

The paper introduces PaSBench-Video, a comprehensive streaming video benchmark designed to rigorously test multimodal LLMs' ability to issue proactive safety warnings, finding that current models stru…

View →
cs.LGcs.AIcs.CERecentMay 3, 2026

RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs

Sadia Asif, Mohammad Mohammadi Amiri

The paper introduces RefusalGuard, a novel fine-tuning framework that preserves the geometric structure of safety-relevant representations in LLMs, thereby mitigating the degradation of refusal behavi…

View →
cs.LGcs.CRRecentJun 2, 2026

When Autoregressive Consistency Hurts Safety Alignment

Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu

The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…

View →