Papers similar to 2606.03990

~ similar to 2606.03990· 17 results

cs.LGcs.AIstat.MLRecentMay 28, 2026

On the Optimizer Dependence of Neural Scaling Laws

The scaling exponent in neural scaling laws is not fixed but systematically depends on the optimizer used, with preconditioned optimizers generally yielding steeper scaling.

View →

cs.LGcs.AIRecentJun 1, 2026

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Yongzhong Xu

The paper tracks the developmental emergence of attention circuits in 1B-class language models, finding that the formation of induction and attention-sink circuits are distinct, temporally separated t…

View →

cs.NEcs.LGRecentJun 2, 2026

Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

Carlo Wenig, Raoul-Martin Memmesheimer, Christian Klos

The paper demonstrates that quadratic integrate-and-fire (QIF) neurons are superior to leaky integrate-and-fire (LIF) neurons for gradient descent training in spiking neural networks because their con…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

Sourav Das

ProbScale is a novel framework that combines neural scaling laws and language model probing to identify highly efficient, task-specific subnetworks within pre-trained Small Language Models, achieving…

View →

cs.LGcs.CLRecentMay 30, 2026

Task Structure Reverses Layerwise State Encoding in Sequence Models

Yuhang Jiang

The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…

View →

cs.AIRecentMay 27, 2026

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

Simardeep Singh, Paras Chopra

This paper demonstrates that large language models spontaneously develop geometric structures corresponding to human perceptual domains (like color or pitch) within their internal layers, suggesting t…

View →

cs.LGcs.CLRecentJun 1, 2026

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

Lei Yang, Siyu Ding, Deyi Xiong

The paper proposes a local perturbation theory showing that cross-domain interference in multi-domain RL occurs via a low-dimensional shared conflict subspace, which can be selectively mitigated by sh…

View →

cs.CVcs.AIRecentJun 1, 2026

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

This paper investigates the application of Parameter-Efficient Fine-Tuning (PEFT) methods, specifically adapters and LoRA, to large pretrained models for instance segmentation, demonstrating that thes…

View →

cs.LGcs.AIcs.CVRecentMay 30, 2026

On the Difficulty of Learning a Meta-network for Training Data Selection

Zilin Du, Junqi Zhao, Boyang Albert Li

This paper analyzes the poor performance of Meta-learning for Training-data Selection (MTS) and proposes that increasing the batch size and incorporating informative features can significantly improve…

View →

cs.LGcs.AIRecentMay 31, 2026

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal +5 more

This paper introduces a new scaling law for sparse language models trained with limited data, demonstrating that sparsity can significantly improve performance and delay data saturation during multi-e…

View →

cs.LGRecentJun 1, 2026

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

The paper argues that large activation spikes in LLMs are structural vector biases, and proposes a novel quantization framework (INSERTQUANT) to eliminate these spikes, enabling robust low-bit quantiz…

View →

q-bio.NCcs.AIRecentMay 27, 2026

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

Joséphine Raugel, Maximilian Seitzer, Marc Szafraniec, Huy V. Vo +5 more

While backpropagated gradients can predict human brain activity in the visual cortex, their spatial and temporal organization fundamentally diverges from the expected patterns of a biologically plausi…

View →

cs.CLcs.AIcs.CVRecentMay 28, 2026

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui +3 more

The paper quantifies the exact parametric memory capacity of LLMs using LoRA and proposes a new optimization strategy, MemFT, to enhance memory fidelity.

View →

cs.LGcs.AIRecentMay 30, 2026

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal +5 more

The paper introduces Sparse Memory-Efficient Training (SMET), a method that stabilizes and optimizes Dynamic Sparse Training (DST) for large language models, enabling stable and memory-efficient spars…

View →

cs.LGcs.AIphysics.comp-phRecentMay 27, 2026

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

Yuxin Wang, Yuanzhe Hu, Xiaokun Zhong, Xiaopeng Wang +6 more

This paper analyzes the multi-regime behavior of Scientific Machine Learning (SciML) models, finding that optimization effectiveness is regime-specific and that failure modes require a unified, regime…

View →

cs.LGcs.AIcs.CVRecentMay 28, 2026

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

The paper introduces and analyzes several novel data appraisal metrics, including the Vendi Score and matrix spectral functions, demonstrating that efficient optimization techniques make these metrics…

View →

cs.CLRecentMay 31, 2026

Before and After Temperature: A Distributional View of Creative LLM Generation

V. S. Raghu Parupudi, Harsha Ponnada, Aditi Kaushal, S. Shria Parupudi +2 more

The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…

View →