Jian Lou
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.
The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining utility.
Papers
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Zhihao Liu, Yifan Wu, Jian Lou, Di Wang +2 more
The paper proposes a novel zeroth-order optimization framework to enhance the robustness of LLM safety alignment, showing that few refinement steps can significantly improve safety while maintaining u…