Papers similar to 2606.01532

~ similar to 2606.01532· 18 results

cs.CLcs.AIRecentMay 28, 2026

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

The paper proposes explicitly disentangling positional and semantic representations in Transformer encoders, demonstrating that this separation allows for a clearer understanding of how positional inf…

View →

cs.LGcs.AIRecentMay 29, 2026

Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization

Felipe Urrutia, Juan José Alegría, Cinthia Sanchez Macias, Jorge Salas +2 more

The paper analyzes the distinct computational roles of positional versus symbolic attention heads in Transformers, demonstrating that symbolic mechanisms generalize more reliably to longer sequences t…

View →

cs.LGcs.AIcs.CCRecentMay 28, 2026

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

Anej Svete, William Merrill, Ryan Cotterell, Ashish Sabharwal

The paper analyzes the expressivity of padded transformers, proving that their computational power is primarily determined by model depth and numeric precision, rather than attention type or width.

View →

cs.LGcs.CLRecentMay 30, 2026

Task Structure Reverses Layerwise State Encoding in Sequence Models

Yuhang Jiang

The paper demonstrates that the location and nature of state encoding in sequence models are not fixed architectural traits but are highly dependent on the specific task, showing that the encoding pro…

View →

cs.LGcs.CLRecentMay 29, 2026

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen

The paper proposes a unified framework for designing efficient and expressive token mixing layers by separating the direct and recurrent influences of inputs, allowing for a principled trade-off betwe…

View →

cs.LGcs.CLeess.SPRecentMay 31, 2026

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

Athanasios Zeris

The paper introduces Morlet Positional Encoding (MoPE), a novel wavelet-based positional encoding that models position and locality simultaneously, outperforming standard sinusoidal and RoPE methods.

View →

cs.CLcs.AIRecentMay 27, 2026

Periodic RoPE for Infinite Context LLMs

Simin Huo

The paper proposes Periodic RoPE (P-RoPE) combined with a dual-layer attention mechanism to overcome the positional encoding limitations of LLMs, enabling theoretically infinite context understanding.

View →

cs.DScs.AIcs.CLRecentMay 28, 2026

On Language Generation in the Limit with Bounded Memory

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

The paper analyzes language generation and identification in the limit under bounded memory, showing that memory constraints significantly alter learnability, particularly affecting achievable density…

View →

cs.AIRecentMay 28, 2026

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

Moment-KV introduces a novel momentum-based technique to compress the Key-Value (KV) cache during the decoding phase of LLM generation, significantly improving fidelity in long-generation tasks.

View →

cs.LGcs.AIRecentJun 1, 2026

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Yongzhong Xu

The paper tracks the developmental emergence of attention circuits in 1B-class language models, finding that the formation of induction and attention-sink circuits are distinct, temporally separated t…

View →

cs.AIcs.LGRecentJun 1, 2026

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov +2 more

The paper analyzes the failure modes of aggressive 2-bit quantization in large reasoning models, proposing lightweight controls like FP16 planning and loop rescue to restore accuracy and achieve pract…

View →

cs.CVcs.AIcs.CLRecentMay 31, 2026

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…

View →

cs.CLRecentMay 29, 2026

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren +1 more

The paper proposes Speculative Pipeline Decoding (SPD), a novel framework that uses pipeline parallelism to accelerate LLM inference by processing multiple tokens in parallel, achieving higher speedup…

View →

cs.LGcs.AIRecentMay 30, 2026

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang +6 more

The paper proposes using an auxiliary reconstruction task, specifically one that captures intra-state feature dependencies, to improve the quality of state representations learned by the encoder in ne…

View →

cs.LGcs.AIRecentJun 1, 2026

Variational Learning for Insertion-based Generation

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying +3 more

The paper introduces the Insertion Process (IP), a novel stochastic generative model that learns variable-length, non-monotonic sequence generation by explicitly modeling the insertion order of tokens…

View →

cs.AIRecentMay 31, 2026

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

Nishit Singh

The paper demonstrates that Transformers trained on local comparisons implicitly learn a global, one-dimensional ordinal structure, mirroring the human ability to perform transitive inference.

View →

cs.LGcs.AIEmpiricalComprehensiveRecentJun 4, 2026

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar, Phillip Isola

This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.

View →

cs.LGcs.AIEmpiricalComprehensiveRecentJun 4, 2026

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar, Phillip Isola

This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.

View →