Prakhar Gupta
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper proposes a novel safety fine-tuning method that uses the target model's own rollouts to identify and train on the hardest prompts, significantly reducing jailbreak success rates while maintaining usability.
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserving monitorability.
Papers
Consistency Training while Mitigating Obfuscation via Rate Matching
The paper introduces Rate Matching Consistency Training (RMCT), a novel method that improves model robustness against extraneous input cues without forcing the model to ignore those cues, thus preserv…