~ similar to 2606.01062· 20 results
Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim +2 more
The paper introduces a systematic framework to convert large Mixture-of-Experts (MoE) models into memory-efficient, fully dense architectures, achieving superior performance compared to traditional pr…
Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong +2 more
The paper introduces RA-MoE, a novel fine-tuning framework that leverages the internal routing structure of Mixture-of-Experts (MoE) models to improve performance on multilingual downstream tasks by a…
Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong +2 more
ConMoE proposes a train-free method for compressing Mixture-of-Experts (MoE) models by consolidating the large expert pool into a smaller set of reusable prototypes and deterministically remapping all…
Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig +1 more
DOT-MoE introduces a novel framework that treats the decomposition of dense layers into Mixture of Experts (MoE) as a Differentiable Optimal Transport problem, achieving superior efficiency while pres…
Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An +3 more
MoG proposes a novel Mixture of Experts framework for graph-based RAG, which uses hub graphs to guide the sparse activation of domain-specific expert graphs, significantly improving retrieval accuracy…
Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more
dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…
The paper introduces ProbMoE, a probabilistic routing framework that tackles the non-differentiability of top-$k$ routing in Mixture-of-Experts (MoE) models, achieving strong performance with improved…
The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…
Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He +3 more
Misrouter introduces an input-only adversarial framework to exploit the routing mechanisms of Mixture-of-Experts (MoE) LLMs, enabling unsafe behavior induction against remotely hosted, black-box servi…
Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao +2 more
The paper proposes Task-Aware Coactivation Grouping (TACG) to significantly reduce communication costs in multi-task MoE inference by grouping experts based on task-specific co-activation patterns, ou…
Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more
MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…
Yitong Sun, Yao Huang, Teng Li, Ranjie Duan +4 more
MESA is a targeted alignment framework that decentralizes safety responsibilities across multiple experts in Mixture-of-Experts (MoE) LLMs using Optimal Transport theory, thereby improving safety robu…
Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy +5 more
The paper introduces PithTrain, a compact, agent-native Mixture-of-Experts (MoE) training framework that significantly improves agent-task efficiency compared to existing production stacks.
This paper proposes a new router redesign for Mixture-of-Experts models using Manifold Power Iteration to align router rows with the principal singular directions of associated experts.
The paper analyzes the routing behavior of Mixtral MoE under benign and harmful prompts using activation and gradient signals, finding that safety-relevant routing is subtle, depth-dependent, and dist…
Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok +5 more
Mellum 2 is an open-weight 12B Mixture-of-Experts (MoE) language model specialized for software engineering, achieving performance competitive with larger models while maintaining the efficiency of a…
Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu +3 more
The paper introduces Confidence-Adaptive SwiGLU ($κ$-SwiGLU), a novel gating mechanism for Mixture-of-Experts (MoE) models that dynamically adjusts the gate sharpness based on token-level routing conf…
FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…
The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…
The paper proposes EPIC, an efficient and parallel decoding framework that significantly speeds up the process of constraining diffusion language model outputs using Context-Free Grammars (CFG).