Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:
ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Home/Authors/Xuanli He

Xuanli He

2 indexed papers

Recent (6 mo)
2
With code
0
Influential cites
0
Benchmarked
0

Publications per year

2
26

Top categories

NLP×2Crypto×2AI×1

Frequent co-authors

Bilgehan Sel2×
Jerry Wei2×
Faizan Ali1×
Jenny Bao1×
Hoagy Cunningham1×
Alwin Peng1×

Research Timeline

2026
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in sensitive CBRN domains.

Highlighted terms show continued research focus across papers

Papers

cs.CLcs.CRRecentApr 16, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…

View →
cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…

View →