Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:
ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Home/Authors/Jerry Wei

Jerry Wei

3 indexed papers

Recent (6 mo)
3
With code
0
Influential cites
0
Benchmarked
0

Publications per year

3
26

Top categories

Crypto×3AI×2NLP×2ML×1

Frequent co-authors

Xuanli He2×
Bilgehan Sel2×
Daniel Zhu1×
Zihan Wang1×
Xuchan Bao1×
Faizan Ali1×

Research Timeline

2026
Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in sensitive CBRN domains.

Jailbroken Frontier Models Retain Their Capabilities

The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.

Highlighted terms show continued research focus across papers

Papers

cs.LGcs.AIcs.CRRecentApr 30, 2026

Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei

The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.

View →
cs.CLcs.CRRecentApr 16, 2026

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Xuanli He, Bilgehan Sel, Faizan Ali, Jenny Bao +2 more

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in…

View →
cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation t…

View →