Xuanli He
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.
The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in sensitive CBRN domains.
Papers
Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more
The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…