~ similar to 2605.31494· 19 results
The paper introduces Multi-Response Training (MRT) to combat the 'mode lottery' problem in language model fine-tuning, showing that retaining multiple valid responses significantly improves distributi…
Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan +1 more
The paper proposes COSE, a method that uses an LLM's intrinsic confidence as an uncertainty signal to improve self-evolutionary training, achieving state-of-the-art performance on general reasoning an…
Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing +1 more
The paper introduces 'reward bias substitution,' demonstrating that single-axis mitigations of reward model biases merely shift optimization pressure to correlated proxies, and proposes augmenting eva…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng +4 more
The paper proposes EKSFT, a selective fine-tuning method that masks high-entropy or high-KL divergence tokens during Supervised Fine-Tuning (SFT) to prevent distribution shift and improve subsequent R…
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
The paper introduces Expected Value Alignment (EVA), a novel reward modeling procedure that allows continuous scoring of intermediate reasoning steps in formal mathematics verification while maintaini…
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…
Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more
EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li +2 more
The paper introduces SAVE, a framework that uses on-policy feedback and the value function to self-supervise and improve reward models, significantly enhancing RLHF performance across multiple benchma…
The paper introduces a novel, per-token feature derived from how sampling temperature reshapes the token distribution, demonstrating it is a significantly stronger predictor of LLM creativity than sta…
This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.
This paper introduces RREDCoT, a method for approximating optimal reward redistribution in Chain-of-Thought reasoning language models without additional generation.
Weak self-training on synthetic data can amplify a language model's existing capabilities, but this effect is strictly dependent on the compatibility between the source and student models, not on the…
The paper proposes using Maximum Independent Set (MIS) algorithms on similarity graphs to select a maximally diverse and non-redundant subset of prompts for LLM benchmarking, achieving consistent rank…
The paper proposes Sensitivity-Uncertainty Alignment (SUA), a framework that measures the misalignment between a model's prediction instability and its stated uncertainty to improve model reliability.
The paper introduces and evaluates five parameter alignment strategies that significantly mitigate catastrophic forgetting when continually pretraining multilingual expert language models across multi…
The paper introduces Logit-aware Final-block Quantization (LFQ), an enhancement to block-wise quantization that quantizes the final Transformer block using a cross-entropy loss to significantly boost…