Zhanxing Zhu
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This paper investigates the non-monotonic role of sample difficulty in Reinforcement Learning with Verifiable Reward (RLVR), finding that medium-difficulty problems provide the most balanced and beneficial learning signals for LLMs.
The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessitating adversarial safety training.
Papers
When Autoregressive Consistency Hurts Safety Alignment
The paper argues that shallow safety alignment in LLMs is due to autoregressive consistency, a mechanism that allows small harmful inputs to redirect the model's generation to unsafe outputs, necessit…