~ similar to 2605.30324· 20 results
The paper quantifies the cost of privacy in language identification and generation using differentially private (DP) methods, finding that the cost is surprisingly mild, particularly absent under appr…
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more
This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…
Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan +4 more
The paper investigates the limits of self-evolution in LLM reasoning under closed-loop settings, finding that while self-improvement is significant, it consistently falls short of perfect oracle super…
The paper proposes a unified framework for designing efficient and expressive token mixing layers by separating the direct and recurrent influences of inputs, allowing for a principled trade-off betwe…
The paper introduces Reasoning in Memory (RiM), a latent reasoning method that replaces autoregressive token generation with fixed memory blocks to enable compute-efficient internal working memory for…
The paper demonstrates that positional encodings are not necessary for transformers to achieve universal computation, showing that the inherent mechanism of sliding context windows already provides su…
Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang +5 more
The paper introduces DOMINO, a novel inductive framework that synthesizes domain-specific data for LLMs using only reference examples, significantly improving performance on challenging, implicitly de…
This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.
This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.
Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue +6 more
The paper introduces StreamSynth, a sequential setting for synthetic data generation, and proposes SynLearner, a framework that enables LLMs to improve synthesis performance by accumulating and transf…
The paper establishes information-theoretic lower bounds for stochastic optimization using low-bit gradients by reducing the problem to compressed Gaussian mean estimation, yielding sharp bounds on co…
The paper analyzes a fragment of Higher-Order Datalog, showing that restricting recursion to a linear form shifts its expressive power from time complexity to space complexity, specifically capturing…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
The paper introduces and evaluates bounded behavioral indistinguishability, showing that while LoRA distillation improves semantic similarity, it does not guarantee that the student model is behaviora…
The paper proposes $D^3$, a dynamic graph-constrained scheduling framework that optimizes LLM training order by modeling sample interactions as a dynamic influence graph.
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui +3 more
The paper quantifies the exact parametric memory capacity of LLMs using LoRA and proposes a new optimization strategy, MemFT, to enhance memory fidelity.
The paper provides a unified algebraic framework to determine the formal language expressivity of recurrent neural language models, resolving conflicts in existing literature by linking expressivity t…
This paper proposes using offline reinforcement learning (RL) as an efficient alternative to online RL for post-training code-generating LLMs, demonstrating its effectiveness, especially for smaller m…
Weak self-training on synthetic data can amplify a language model's existing capabilities, but this effect is strictly dependent on the compatibility between the source and student models, not on the…