~ similar to 2605.30448· 20 results
This paper investigates the phenomenon of 'copying' in Distribution Matching Distillation (DMD), finding that high-dimensional distillation causes student models to spontaneously reproduce the teacher…
The paper introduces $(l, b)$-inextractability, a new formal measure that demonstrates that standard indistinguishability properties are insufficient for guaranteeing protection against data extractio…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Zibo Diao, Jingchu Gai, Xinyue Ai, Zhang Zhang +2 more
The paper introduces Lossless Anti-Distillation Sampling (LADS), a novel sampling scheme that makes harvested data correlated for malicious distillers while ensuring benign users receive statistically…
Guang Yang, Amir Ghasemian, Fengchen Liu, Zhong Wang +2 more
The paper proposes interaction-layer antidistillation watermarks by embedding behavioral markers into the system prompt, which successfully track knowledge distillation even when paraphrasing attacker…
Max Hartman, Vidhata Jayaraman, Moulik Choraria, Yash Savani +1 more
The paper introduces TraceGuard, a detectability-aware antidistillation method that identifies and poisons 'thought anchors'—sparsely critical sentences—to degrade student model learning without makin…
Yiwei Zhang, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia +2 more
The paper proposes WARDEN, a distributionally robust adversarial training framework that significantly reduces LLM vulnerability to adversarial attacks by dynamically reweighting hard adversarial exam…
The paper introduces Involuntary In-Context Learning (IICL), an effective few-shot pattern completion attack that can bypass safety alignments in large language models, achieving a 24.0% bypass rate a…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
The paper demonstrates that encoding harmful prompts as genuine mathematical problems, rather than just using mathematical formatting, effectively bypasses the safety filters of large language models.
Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang +1 more
The OISD framework improves language model reasoning by distilling on-policy predictive signals from the final output layer to intermediate representations, leading to substantial improvements on math…
Zeyuan Chen, Yihan Ma, Xinyue Shen, Michael Backes +1 more
The PopQuiz Attack is a novel black-box membership inference attack that successfully tests whether large language models memorize specific training data by framing the target data as multiple-choice…
Yanjiang Liu, Jie Lou, Xinyan Guan, Yuqiu Ji +6 more
The paper introduces Lookahead Group Reward (&) to combat Supervision Fidelity Decay (SFD) in on-policy distillation, significantly improving student model performance on long reasoning tasks.
This paper introduces the Data-Model Compatibility (DMC) metric to quantify how suitable a dataset is for reasoning distillation, showing that optimizing data selection using DMC significantly improve…
The paper proposes Distribution-Aligned Self-Distillation (DASD) to improve self-distillation by dynamically filtering high-perplexity tokens, thereby preserving useful logical knowledge while suppres…
Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang +4 more
OmniOPD introduces a logit-free, chunk-level distillation framework that improves on standard On-Policy Distillation by using semantic similarity and peak-entropy scheduling, achieving state-of-the-ar…
The paper introduces Behavioral Canaries, a novel auditing mechanism that detects unauthorized use of private retrieved context data during Reinforcement Learning Fine-Tuning (RLFT) by inducing detect…
The paper proposes a local perturbation theory showing that cross-domain interference in multi-domain RL occurs via a low-dimensional shared conflict subspace, which can be selectively mitigated by sh…
Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes +4 more
LARK introduces a novel learnability-grounded approach for selecting reasoning trajectories, significantly improving the efficiency of reasoning distillation by prioritizing trajectories that the stud…