Subhadip Mitra
4 indexed papers
Publications per year
Top categories
Research Timeline
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and successor.
The paper introduces a quality-diversity evolutionary framework that discovers diverse, interpretable vulnerabilities in large language models by evolving attack strategies at the semantic level, revealing systematic, model-specific weaknesses.
The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its predecessors and successors.
The paper introduces a quality-diversity evolutionary framework that evolves interpretable attack strategies, successfully discovering distinct and systematic vulnerabilities in major LLMs like GPT-4o-mini and Gemini.
Papers
Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…