Alwin Peng

1 indexed paper

Recent (6 mo)

With code

Influential cites

Benchmarked

Publications per year

Top categories

Crypto×1AI×1NLP×1

Frequent co-authors

Bilgehan Sel1×

Xuanli He1×

Ming Jin1×

Jerry Wei1×

Research Timeline

2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

The paper introduces Trojan-Speak, an adversarial fine-tuning method that successfully bypasses advanced LLM safety classifiers (like Anthropic's Constitutional Classifiers) with minimal degradation to the model's core reasoning capabilities.

Highlighted terms show continued research focus across papers

Papers

cs.CRcs.AIcs.CLRecentMar 30, 2026

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin +1 more

View →