Rahul Gupta
4 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial prompts, leading to enhanced safety robustness.
SWAN introduces a novel, training-free framework that embeds watermarks directly into the semantic structure of a sentence using Abstract Meaning Representation (AMR), achieving superior robustness against paraphrasing compared to existing methods.
PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of responses.
The paper introduces SkillHarm, a comprehensive benchmark and automated framework for evaluating skill-based attacks across the entire agent skill-use lifecycle, demonstrating that current agents remain highly vulnerable to both fixed-payload and self-mutating poisoning attacks.
Papers
SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou +7 more
The paper introduces SkillHarm, a comprehensive benchmark and automated framework for evaluating skill-based attacks across the entire agent skill-use lifecycle, demonstrating that current agents rema…