Xuanli He

2 indexed papers

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

NLP×2Crypto×2AI×1

Frequent co-authors

Bilgehan Sel2×

Jerry Wei2×

Faizan Ali1×

Jenny Bao1×

Hoagy Cunningham1×

Alwin Peng1×

Research Timeline

2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in sensitive CBRN domains.

Highlighted terms show continued research focus across papers

Papers

cs.CLcs.CRRecentApr 16, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more

View →

cs.CRcs.AIcs.CLRecentMar 30, 2026