Albert No
3 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper introduces a truly benign Direct Preference Optimization (DPO) attack that can jailbreak large language models (LLMs) by fine-tuning them with minimal, harmless preference data, thereby suppressing refusal behavior even for malicious prompts.
The paper argues that using confidence-based decoding, which is optimized via training mask alignment, fundamentally misaligns Masked Diffusion Models (MDMs) from the logical flow needed for complex reasoning, leading to catastrophic failures on challenging inputs.
The paper introduces REFT, a novel method that diversifies rollouts by sampling the first token after the reasoning marker, significantly improving performance in Reinforcement Learning with Verifiable Rewards (RLVR) without altering the core RLVR pipeline.
Papers
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
The paper argues that using confidence-based decoding, which is optimized via training mask alignment, fundamentally misaligns Masked Diffusion Models (MDMs) from the logical flow needed for complex r…