Papers similar to 2606.02502

~ similar to 2606.02502· 19 results

cs.CVcs.LGRecentJun 1, 2026

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou

ProtoAda introduces a prototype-guided, format-aware adaptive tuning framework to improve multimodal continual instruction tuning by ensuring task assignment and parameter updates respect heterogeneou…

View →

cs.CLcs.AIRecentMay 27, 2026

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong +2 more

The paper introduces RA-MoE, a novel fine-tuning framework that leverages the internal routing structure of Mixture-of-Experts (MoE) models to improve performance on multilingual downstream tasks by a…

View →

cs.CLcs.AIcs.CVRecentMay 28, 2026

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui +3 more

The paper quantifies the exact parametric memory capacity of LLMs using LoRA and proposes a new optimization strategy, MemFT, to enhance memory fidelity.

View →

cs.CLRecentMay 29, 2026

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Sanchit Ahuja, Terra Blevins

The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…

View →

cs.CLRecentMay 29, 2026

dMoE: dLLMs with Learnable Block Experts

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more

dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…

View →

cs.CLRecentMay 29, 2026

MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning

Yi Bai, Wenhao Zhang, Yao Chen, Jiao Xue +2 more

The paper proposes MADS, a Model-Aware Diverse Core Set Selection method that uses LLM internal activation states to select a small, diverse core set of instructions, significantly improving model per…

View →

cs.LGcs.AIRecentMay 30, 2026

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more

The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…

View →

cs.AIcs.CLRecentMay 29, 2026

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi +5 more

UniScale proposes a unified framework that jointly optimizes model routing and test-time scaling to achieve a superior, fine-grained quality-cost trade-off for large language model inference.

View →

cs.LGcs.AIRecentMay 28, 2026

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen +1 more

This paper introduces Anchored Weight Decay (AWD), a regularization technique that effectively prevents prior-task forgetting during LLM fine-tuning with Evolution Strategies (ES), positioning ES as a…

View →

cs.CLRecentMay 29, 2026

TRACE: Discovering Task-Specific Parameter via Adaptation-Aware Probing for Continual Fine-Tuning

Xiaosong Han, Ke Chen, Xindi Dai, Di Liang +6 more

TRACE proposes a novel method to mitigate catastrophic forgetting in continual LLM fine-tuning by identifying and isolating a small, task-specific subset of essential parameters for each task.

View →

cs.CLRecentMay 29, 2026

Mellum2 Technical Report

Marko Kojic, Ivan Bondyrev, Aral de Moor, Joseph Shtok +5 more

Mellum 2 is an open-weight 12B Mixture-of-Experts (MoE) language model specialized for software engineering, achieving performance competitive with larger models while maintaining the efficiency of a…

View →

cs.LGcs.AIRecentMay 29, 2026

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Daize Dong, Junlin Chen, Haolong Jia, Jiawei Wu +8 more

The paper proposes Predictive Routing Replay (PR2) to stabilize reinforcement learning on Mixture of Experts (MoE) LLMs by predicting and incorporating short-horizon router evolution during training a…

View →

cs.CRRecentApr 30, 2026

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin +1 more

MASCing is a novel framework that enables flexible, non-retraining reconfiguration of Mixture-of-Experts (MoE) models for specific safety objectives by applying activation steering masks to control ex…

View →

cs.CLcs.AIRecentJun 1, 2026

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

Elia Cunegatti, Marcus Vukojevic, Erik Nielsen, Giovanni Iacca

The paper proposes SubFit, a novel compression technique that achieves superior LLM compression by replacing non-contiguous, submodule-level components (Attention and FeedForward) with lightweight res…

View →

cs.AIRecentMay 28, 2026

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong +2 more

ConMoE proposes a train-free method for compressing Mixture-of-Experts (MoE) models by consolidating the large expert pool into a smaller set of reusable prototypes and deterministically remapping all…

View →

cs.LGcs.AIcs.CLRecentMay 27, 2026

Parallax: Parameterized Local Linear Attention for Language Modeling

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf +2 more

The paper introduces Parallax, a scalable and numerically stable parameterized Local Linear Attention mechanism that significantly improves LLM performance and efficiency compared to existing methods…

View →

cs.CVcs.AIcs.CLRecentMay 31, 2026

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…

View →

cs.CVcs.AIRecentMay 29, 2026

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik +1 more

The paper proposes Dynamic Adapter Routing (DAR), a novel method that significantly improves continual multimodal retrieval by adaptively selecting and merging specialized adapters.

View →

cs.CLRecentMay 29, 2026

ElasticMem: Latent Memory as a Learnable Resource for LLM Agents

Tao Feng, Chongrui Ye, Tianyang Luo, Jingjun Xu +4 more

ElasticMem introduces a novel framework that treats memory as an elastic latent resource, allowing LLM agents to adaptively manage and inject variable-budget memories for improved performance in long-…

View →