Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models
The PopQuiz Attack is a novel black-box membership inference attack that successfully tests whether large language models memorize specific training data by framing the target data as multiple-choice quiz questions.
Abstract
More Like ThisLarge language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.