Yutaka Matsuo
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This paper investigates how different types of compressed reasoning data (Explicit, Composed, Implicit CoT) affect LLM performance during post-training, finding that the choice of compression and subsequent fine-tuning method significantly impacts generalization and data scaling.
The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple samples, and proposes RePPO for efficient optimization.
Papers
Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
The paper introduces ReMax, a novel objective function that naturally encourages stochastic exploration in policy gradient reinforcement learning by evaluating expected maximum returns over multiple s…