Michael Backes
7 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, particularly in generating unsafe content and creating harder-to-detect fake images compared to traditional diffusion models.
The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and defense mechanisms.
This paper presents HarmfulSkillBench, a large-scale benchmark demonstrating that even small percentages of publicly available skills can be misused for harmful actions, significantly lowering LLM refusal rates when integrated into agent workflows.
The paper proposes SafeReview, a co-evolutionary adversarial training framework that significantly improves the robustness of LLM-based peer review systems against sophisticated adversarial hidden prompts.
The PopQuiz Attack is a novel black-box membership inference attack that successfully tests whether large language models memorize specific training data by framing the target data as multiple-choice quiz questions.
This paper introduces Dependency Steering, a novel attack paradigm demonstrating that malicious agent skills can actively bias LLM coding agents to use attacker-controlled packages, posing a significant, hard-to-detect software supply chain risk.
The paper introduces BadBone, a stealthy and adaptive backdoor attack that compromises a backbone model specifically to target downstream tasks utilizing prompt learning, demonstrating high attack success rates against state-of-the-art defenses.
Papers
BadBone: Backdoor Attacks Against Backbone Models in Visual Prompt Learning
Ziqing Yang, Rui Wen, Xinlei He, Yun Shen +2 more
The paper introduces BadBone, a stealthy and adaptive backdoor attack that compromises a backbone model specifically to target downstream tasks utilizing prompt learning, demonstrating high attack suc…