Che Wang
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes SALO, a novel detector that monitors the dynamic, layer-wise activation pattern (Refusal Trajectory) to improve jailbreak detection robustness compared to traditional methods relying on static terminal representations.
The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.
Papers
Online Learning with Gradient-Variation Interval Regret
The paper proposes a novel online learning algorithm that achieves an interval regret bound scaling with gradient variation, providing strong theoretical guarantees for non-stationary environments.