Tianhang Zheng
7 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes the Expected Safety Impact (ESI) framework to identify safety-critical parameters in LLMs, introducing targeted tuning methods (SET and SPA) to enhance safety and preserve alignment during model adaptation.
This paper reinterprets catastrophic overfitting (CO) in Fast Adversarial Training (FAT) as a weak backdoor mechanism, proposing backdoor-inspired strategies to mitigate this generalization failure.
The paper proposes a Distribution-aware Dynamic Guidance (DDG) strategy to mitigate catastrophic overfitting and the robustness-accuracy trade-off inherent in Fast Adversarial Training (FAT) by dynamically adjusting perturbation budgets and supervision signals based on sample confidence.
RouteScan introduces a non-intrusive framework that audits the safety of Mixture-of-Experts (MoE) LLMs by analyzing low-level GPU expert routing telemetry, achieving high accuracy even on unseen harmful prompts.
The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex, open-world agentic scenarios.
The paper introduces AgentDoG 1.5, a lightweight and scalable alignment framework that significantly improves AI agent safety and security for complex open-world agent deployments.
The paper proposes TRACE, a novel agentic jailbreaking framework that successfully bypasses safety mechanisms of advanced LLM agents by decomposing malicious tasks and disguising harmful subtasks within task-aware, iteratively evolved scenarios.
Papers
TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking
Churui Zeng, Weiwei Qi, Kedong Xiu, Tianhang Zheng +4 more
The paper proposes TRACE, a novel agentic jailbreaking framework that successfully bypasses safety mechanisms of advanced LLM agents by decomposing malicious tasks and disguising harmful subtasks with…