N. Benjamin Erichson
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is crucial for fair comparison and understanding attack success.
D-Judge introduces a semantics-preserving output rewriting defense that disrupts multi-turn jailbreak attacks by misaligning the feedback signal used by an attacker's judge model.
Papers
D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir +3 more
D-Judge introduces a semantics-preserving output rewriting defense that disrupts multi-turn jailbreak attacks by misaligning the feedback signal used by an attacker's judge model.