~ similar to 2605.29387· 17 results
The study finds that specific, interpretable neuron populations (Rosetta Neurons) exhibit predictable, scale-dependent changes in selectivity and specialization as neural models grow larger.
The paper introduces and analyzes several novel data appraisal metrics, including the Vendi Score and matrix spectral functions, demonstrating that efficient optimization techniques make these metrics…
This paper develops a unified spectral analysis framework to explain how knowledge transfer (KT) works across different machine learning regimes, such as Knowledge Distillation and Weak-to-Strong gene…
This paper re-examines the role of temperature ($ au$) in LLM distillation, demonstrating that while Reverse KL (RKL) is often preferred, Forward KL (FKL) significantly outperforms RKL at higher tempe…
The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…
Yuxin Wang, Yuanzhe Hu, Xiaokun Zhong, Xiaopeng Wang +6 more
This paper analyzes the multi-regime behavior of Scientific Machine Learning (SciML) models, finding that optimization effectiveness is regime-specific and that failure modes require a unified, regime…
Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang +3 more
This paper proposes a preconditioning layer for stable weight conditioning in LLM training.
Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang +3 more
This paper proposes a preconditioning layer for stable weight conditioning in LLM training.
The paper introduces a Jacobian-based spectral audit to evaluate neural operators, demonstrating that standard prediction error metrics fail to capture crucial local dynamical structures and operator…
The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…
Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more
This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more
The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…
ProbScale is a novel framework that combines neural scaling laws and language model probing to identify highly efficient, task-specific subnetworks within pre-trained Small Language Models, achieving…
The paper introduces a comprehensive benchmark to test if physics foundation models learn generalizable dynamics, finding that their performance is highly conditional and not universally general.
The paper introduces Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes deep learning training in non-smooth loss landscapes by dynamically damping updates based on local geometric ins…
This paper analyzes the poor performance of Meta-learning for Training-data Selection (MTS) and proposes that increasing the batch size and incorporating informative features can significantly improve…