Kui Ren
10 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
STEP introduces a novel, black-box, retraining-free detector that profiles audio samples using dual perturbation branches to detect backdoor attacks by exploiting the characteristic instability of hidden triggers.
The paper proposes the Expected Safety Impact (ESI) framework to identify safety-critical parameters in LLMs, introducing targeted tuning methods (SET and SPA) to enhance safety and preserve alignment during model adaptation.
This paper systematically investigates unlearnable examples (UEs) across diverse training paradigms, finding that existing UEs fail under pretraining-finetuning (PF) settings, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability.
The paper proposes the first general defense framework to make all union-preserving Differential Privacy (DP) protocols, specifically those based on shuffle-DP, resilient against poisoning attacks.
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
The paper proposes W-IR, a novel watermarking framework that simultaneously achieves high certified robustness against adversarial attacks and effectively mitigates identity leakage in watermarked images.
RouteScan introduces a non-intrusive framework that audits the safety of Mixture-of-Experts (MoE) LLMs by analyzing low-level GPU expert routing telemetry, achieving high accuracy even on unseen harmful prompts.
LoRA-Key introduces a user-centric watermarking framework that attaches a recoverable ownership key to LoRA modules via a standalone Watermark LoRA, providing lightweight, plug-and-play copyright protection without requiring per-LoRA retraining.
The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated into the final safety decision.
The paper proposes TRACE, a novel agentic jailbreaking framework that successfully bypasses safety mechanisms of advanced LLM agents by decomposing malicious tasks and disguising harmful subtasks within task-aware, iteratively evolved scenarios.
Papers
ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails
Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi +8 more
The paper introduces ConsisGuard, a framework that addresses the 'deliberation-to-enforcement gap' in LLM guardrails by ensuring that the reasoning process is faithfully and consistently translated in…