Efficient ML
Model compression, pruning, quantization, distillation
20 papers indexed
Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models
Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong +4 more
The paper proposes a compression pipeline combining few-step distillation and low-bit quantization to significantly reduce the deployment cost and parameter footprint of large dual-expert video diffus…
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang +2 more
This paper investigates whether model compression techniques (like quantization and pruning) preserve a Large Language Model's ability to quantify its own uncertainty, finding that accuracy-only evalu…
VertMark: A Unified Training-Free Robust Watermarking Framework for Vertical Domain Pre-trained Language Models
Cong Kong, Xin Cheng, Zhaoxia Yin, Shuai Li +2 more
VertMark introduces a novel, unified, and training-free framework to embed robust watermarks into vertical domain pre-trained language models (VPLMs) for copyright protection across multiple specializ…
Pruning and Distilling Mixture-of-Experts into Dense Language Models
Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim +2 more
The paper introduces a systematic framework to convert large Mixture-of-Experts (MoE) models into memory-efficient, fully dense architectures, achieving superior performance compared to traditional pr…
Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration
This paper formalizes token optimization as a multi-objective constrained transformation problem for LLM-based Oracle-to-PostgreSQL migration, demonstrating that adaptive routing offers the best balan…
Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors
Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma +1 more
The paper proposes Multi-Teacher Bayesian Knowledge Distillation (MT-BKD), a framework that uses Bayesian inference and teacher-informed priors to improve model compression, enhance predictive accurac…
CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu +1 more
CIVIC is a path-consistent compact visual inference framework that achieves genuine hardware efficiency in Vision-Language Models by maintaining contiguous sequence representations across all inferenc…
OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning
Geng Li, Guohao Chen, Ting Chen, Shilin Shan +5 more
OccamToken introduces a training-free, adaptive token pruning framework that replaces fixed token budgets with relative evidence testing against a register-based reference, significantly improving VLM…
MiCU: End-to-End Smart Home Command Understanding with Large Language Model
Haowei Han, Kexin Hu, Weiwei Cai, Debiao Zhang +5 more
The paper introduces MiCU, a domain-specific LLM that significantly improves smart home command understanding, especially for ambiguous commands, by synthesizing training data and optimizing the model…
AttnDiff: Attention-based Differential Fingerprinting for Large Language Models
Haobo Zhang, Zhenhua Xu, Junxian Li, Shangfeng Sheng +2 more
AttnDiff introduces a data-efficient white-box framework that extracts intrinsic attention-based fingerprints to verify the provenance and detect unauthorized derivation of large language models (LLMs…
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
The paper introduces Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes deep learning training in non-smooth loss landscapes by dynamically damping updates based on local geometric ins…
Trust Region On-Policy Distillation
Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li +1 more
The paper introduces Trust Region On-Policy Distillation (TrOPD), a robust method that stabilizes the on-policy distillation of large language models by restricting training to regions where teacher s…
Neural Network Compression by Approximate Differential Equivalence
The paper proposes a novel neural network compression technique that aggregates neurons with similar functional dynamics, achieving significant model size reduction while maintaining high accuracy.
VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning
Aoduo Li, Jiancheng Li, Huan Ye, Hongjian Xu +4 more
VEDAL introduces a variational, error-driven asynchronous learning framework to efficiently prune 3D Gaussian Splatting, achieving high compression ratios with minimal loss in novel view synthesis qua…
Locality-Aware Redundancy Pruning for LLM Depth Compression
Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo +2 more
The paper proposes Locality-Aware Redundancy Pruning (LoRP), a training-free method that prunes LLM layers by exploiting localized inter-layer redundancy, leading to improved efficiency while maintain…
SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation
The paper introduces SecRL-Prune, a structured reinforcement learning framework that effectively prunes CodeLLMs while preserving their critical ability to generate adversarial, functionality-preservi…
When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression
This paper systematically analyzes combining dimensionality reduction and quantization to compress text embeddings, showing that this combined approach achieves substantial compression (e.g., 0.1% siz…
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
The paper introduces eXTC, a novel framework that combines structured prompt optimization, knowledge distillation, and reinforcement learning to create a highly performant and fully interpretable text…
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong +2 more
ConMoE proposes a train-free method for compressing Mixture-of-Experts (MoE) models by consolidating the large expert pool into a smaller set of reusable prototypes and deterministically remapping all…
STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing
The paper introduces STARFISH, a novel healing method that efficiently recovers significant accuracy in heavily pruned neural networks by optimizing the pruned model to match the original network's in…