~ similar to 2606.03014· 20 results
Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao +2 more
The paper proposes Task-Aware Coactivation Grouping (TACG) to significantly reduce communication costs in multi-task MoE inference by grouping experts based on task-specific co-activation patterns, ou…
The paper evaluates multi-agent LLM oracle systems for prediction market resolution, finding that independent aggregation with confidence-weighted voting significantly outperforms single-model baselin…
The paper introduces DEFT, a novel Mixture-of-Experts DRL architecture, to intelligently schedule dynamic cloud workflows with varying deadlines, significantly improving performance over existing sing…
Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan +2 more
Lodestar is a novel online learning-based request routing system that significantly improves LLM inference efficiency by dynamically assigning incoming requests to the optimal GPU instance to minimize…
The paper proposes Multi-Agent Computer Use (MACU) systems, which significantly improve performance on complex, long-horizon tasks by enabling parallel execution and dynamic task decomposition compare…
The paper introduces Rotary GPU, an exploratory execution approach demonstrating that large Mixture-of-Experts models can be run locally on consumer GPUs with limited VRAM, achieving usable decode thr…
Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen +2 more
The paper introduces StreamMA, a streaming multi-agent reasoning system that significantly reduces latency and improves effectiveness by passing reasoning steps to downstream agents as they are genera…
This paper proposes a new router redesign for Mixture-of-Experts models using Manifold Power Iteration to align router rows with the principal singular directions of associated experts.
The paper proposes using an LLM aggregator that analyzes complete reasoning traces, demonstrating that trace-level synthesis is superior to traditional consensus methods like majority voting for solvi…
Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang +3 more
The paper proposes BaSE, a multi-armed bandit approach, to optimally allocate a fixed budget of LLM calls across parallel evolutionary search trajectories, significantly improving mean fitness and rel…
RACE-Sched is an asynchronous agentic framework that successfully integrates low-latency, real-time scheduling decisions with advanced, long-horizon reasoning provided by Large Language Models.
This paper systematically analyzes the complex design space of hybrid multi-agent systems combining on-device and cloud AI models, finding that the optimal architecture is highly task-dependent and th…
MOSAIC introduces a structured agentic framework that treats automated data science as a staged, context-grounded model selection problem, improving performance and traceability over traditional AutoM…
The paper proposes scheduling LLM agent workloads at the conversation level rather than the turn level, significantly reducing latency and improving energy efficiency by transforming unpredictable mul…
Jiarui Feng, Hanqing Zeng, Karish Grover, Ruizhong Qiu +10 more
The paper proposes DAG-MoE, a novel sparse Mixture-of-Experts framework that replaces standard weighted-sum aggregation with structural aggregation to enhance model performance and enable multi-step r…
Yannan Wang, Longli Yang, Zhen Liu, Abhishek Kumar +1 more
CoMIC is a cloud-edge framework that enables resource-constrained LLM agents to successfully complete complex, long-horizon tasks by collaboratively sharing and refining memory and insights between lo…
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more
The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…
Ruihang Lai, Hao Kang, Haozhan Tang, Akaash R. Parthasarathy +5 more
The paper introduces PithTrain, a compact, agent-native Mixture-of-Experts (MoE) training framework that significantly improves agent-task efficiency compared to existing production stacks.
Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu +5 more
KLineage introduces a novel method to teach LLMs when and how to apply GPU kernel optimizations by reverse-engineering expert kernel lineages, resulting in superior optimization skills compared to exi…
Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more
dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…