Gjergji Kasneci
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This paper systematically audits the safety implications of activation steering vectors, finding that these vectors significantly influence the success rate of jailbreak attacks by overlapping with latent refusal directions.
The paper introduces CoRP, a gradient-free operator that consolidates the benefits of ensemble-based post-training methods into a single, deployable model update, significantly improving performance with minimal computational overhead.
Papers
Consolidating Rewarded Perturbations for LLM Post-Training
The paper introduces CoRP, a gradient-free operator that consolidates the benefits of ensemble-based post-training methods into a single, deployable model update, significantly improving performance w…