Jerry Wei

3 indexed papers

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

Crypto×3AI×2NLP×2ML×1

Frequent co-authors

Xuanli He2×

Bilgehan Sel2×

Daniel Zhu1×

Zihan Wang1×

Xuchan Bao1×

Faizan Ali1×

Research Timeline

2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

The paper introduces a robust streaming probing objective that requires multiple evidence tokens to support a prediction, significantly improving the detection of harmful intent in LLMs, especially in sensitive CBRN domains.

Jailbroken Frontier Models Retain Their Capabilities

The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.

Highlighted terms show continued research focus across papers

Papers

cs.LGcs.AIcs.CRRecentApr 30, 2026

Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei

The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.

View →

cs.CLcs.CRRecentApr 16, 2026