Papers similar to 2606.01062

~ similar to 2606.01062· 20 results

cs.CLcs.AIcs.LGRecentMay 27, 2026

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim +2 more

The paper introduces a systematic framework to convert large Mixture-of-Experts (MoE) models into memory-efficient, fully dense architectures, achieving superior performance compared to traditional pr…

View →

cs.CLcs.AIRecentMay 27, 2026

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong +2 more

The paper introduces RA-MoE, a novel fine-tuning framework that leverages the internal routing structure of Mixture-of-Experts (MoE) models to improve performance on multilingual downstream tasks by a…

View →

cs.AIRecentMay 28, 2026

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong +2 more

ConMoE proposes a train-free method for compressing Mixture-of-Experts (MoE) models by consolidating the large expert pool into a smaller set of reusable prototypes and deterministically remapping all…

View →

cs.LGcs.AIRecentJun 1, 2026

DOT-MoE: Differentiable Optimal Transport for MoEfication

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig +1 more

DOT-MoE introduces a novel framework that treats the decomposition of dense layers into Mixture of Experts (MoE) as a Differentiable Optimal Transport problem, achieving superior efficiency while pres…

View →

cs.CLRecentMay 29, 2026

MoG: Mixture of Experts for Graph-based Retrieval-Augmented Generation

Zheng Yuan, Chuang Zhou, Linhao Luo, Siyu An +3 more

MoG proposes a novel Mixture of Experts framework for graph-based RAG, which uses hub graphs to guide the sparse activation of domain-specific expert graphs, significantly improving retrieval accuracy…

View →

cs.CLRecentMay 29, 2026

dMoE: dLLMs with Learnable Block Experts

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more

dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…

View →

cs.LGcs.AIRecentJun 1, 2026

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng

The paper introduces ProbMoE, a probabilistic routing framework that tackles the non-differentiability of top-$k$ routing in Mixture-of-Experts (MoE) models, achieving strong performance with improved…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…

View →

cs.CRRecentMay 6, 2026

Misrouter: Exploiting Routing Mechanisms for Input-Only Attacks on Mixture-of-Experts LLMs

Zekun Fei, Zihao Wang, Weijie Liu, Ruiqi He +3 more

Misrouter introduces an input-only adversarial framework to exploit the routing mechanisms of Mixture-of-Experts (MoE) LLMs, enabling unsafe behavior induction against remotely hosted, black-box servi…

View →

cs.LGcs.AIRecentMay 31, 2026

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao +2 more

The paper proposes Task-Aware Coactivation Grouping (TACG) to significantly reduce communication costs in multi-task MoE inference by grouping experts based on task-specific co-activation patterns, ou…

View →

cs.CRRecentApr 30, 2026

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more

MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…

View →

cs.LGcs.AIcs.CLRecentMay 30, 2026

MESA: Improving MoE Safety Alignment via Decentralized Expertise

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan +4 more

MESA is a targeted alignment framework that decentralizes safety responsibilities across multiple experts in Mixture-of-Experts (MoE) LLMs using Optimal Transport theory, thereby improving safety robu…

View →

cs.LGcs.AIcs.CLRecentMay 29, 2026

PithTrain: A Compact and Agent-Native MoE Training System

Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy +5 more

The paper introduces PithTrain, a compact, agent-native Mixture-of-Experts (MoE) training framework that significantly improves agent-task efficiency compared to existing production stacks.

View →

cs.LGcs.AIcs.CLEmpiricalRecentJun 10, 2026

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

This paper proposes a new router redesign for Mixture-of-Experts models using Manifold Power Iteration to align router rows with the principal singular directions of associated experts.

View →

cs.AIcs.CRRecentMay 22, 2026

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

Md Nurul Absar Siddiky

The paper analyzes the routing behavior of Mixtral MoE under benign and harmful prompts using activation and gradient signals, finding that safety-relevant routing is subtle, depth-dependent, and dist…

View →

cs.CLRecentMay 29, 2026

Mellum2 Technical Report

Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok +5 more

Mellum 2 is an open-weight 12B Mixture-of-Experts (MoE) language model specialized for software engineering, achieving performance competitive with larger models while maintaining the efficiency of a…

View →

cs.LGcs.CLRecentMay 30, 2026

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu +3 more

The paper introduces Confidence-Adaptive SwiGLU ($κ$-SwiGLU), a novel gating mechanism for Mixture-of-Experts (MoE) models that dynamically adjusts the gate sharpness based on token-level routing conf…

View →

cs.PLcs.AIcs.CLRecentMay 27, 2026

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

FPMoE introduces a sparse Mixture-of-Experts (MoE) architecture to improve functional code generation across multiple functional programming languages, achieving state-of-the-art performance with fewe…

View →

cs.LGcs.AIcs.DCRecentMay 27, 2026

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare +8 more

The paper systematically analyzes the benefits and limits of Attention-FFN Disaggregation (AFD) for Mixture-of-Experts (MoE) LLM serving, demonstrating that AFD is crucial for achieving high throughpu…

View →

cs.CLcs.AIRecentMay 30, 2026

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

Hyundong Jin, Yo-Sub Han

The paper proposes EPIC, an efficient and parallel decoding framework that significantly speeds up the process of constraining diffusion language model outputs using Context-Free Grammars (CFG).

View →