~ similar to 2605.30415· 20 results
The paper introduces CosmicFish-HRM, a compact language model that achieves adaptive reasoning by dynamically allocating computational effort through a Hierarchical Reasoning Module (HRM), showing tha…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
Zhenting Qi, Susanna Maria Baby, Stefanie Anna Baby, Kan Yuan +4 more
The paper investigates the limits of self-evolution in LLM reasoning under closed-loop settings, finding that while self-improvement is significant, it consistently falls short of perfect oracle super…
The paper investigates compositional abilities in LLMs and humans using the Personal Relation Task, finding that LLMs excel at the structured (Intensional) task while humans are better at the real-wor…
The paper proposes Continuous Reasoning for Vision-Language-Action (VLA) models, arguing that effective reasoning must be a shared, verifiable internal latent space rather than discrete text tokens, l…
The paper argues that purported anthropomorphic attributes of LLMs are not unique to language models but are substrate-dependent, demonstrating this by training a neural network on the game Age of Emp…
This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…
Xudong Zhang, Jian Yang, Shengkai Wang, Jiangpeng Tian +4 more
The paper proposes a dual-interventional framework to characterize how linguistic structures and contextual cues influence LLMs' spatial reasoning for navigation, finding that topological information…
The paper investigates anthropomorphic reflection markers (like 'hmm' or 'wait') in LLM reasoning and finds that these markers are often surface cues, not necessary for strong reasoning performance.
Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong +7 more
The paper argues that current embodied planning benchmarks prioritize superficial language prediction over true physical reasoning, introducing new benchmarks and a large-scale dataset to demonstrate…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
The paper proposes an unsupervised Reinforcement Learning approach that enforces cross-lingual self-consistency to significantly enhance the multilingual reasoning capabilities of large language model…
This paper analyzes the internal decision-making process of large language models by tracking how the answer score changes across multiple internal computational steps (trajectories), finding that mod…
The study finds that institutional experience may leave detectable, yet suppressible, traces in language that shape Large Language Model moral reasoning, particularly when institutional stakes are amb…
COFT is a training-free decoding method that significantly reduces societal biases in large language model chain-of-thought reasoning by applying token-level fairness control at decode time.
The paper investigates whether modestly sized open-source language models can grasp the semantics of rare Paired-Focus constructions, finding that understanding emerges later in training and correlate…
This paper localizes the attention heads within LLMs responsible for specific reasoning steps, finding that specialized heads handle factual retrieval while higher layers manage global information int…
The paper introduces a new quantitative metric, Contextual Alternative Choice (CAC), to rigorously test language models' syntactic and functional understanding of determiners, showing that current mod…
This paper investigates how different types of compressed reasoning data (Explicit, Composed, Implicit CoT) affect LLM performance during post-training, finding that the choice of compression and subs…
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.