Papers similar to 2605.30478

~ similar to 2605.30478· 19 results

cs.CLcs.SERecentMay 29, 2026

Combinatorial Synthesis: Scaling Code RLVR via Atomic Decomposition and Recombination

Jiasheng Zheng, Boxi Cao, Boxi Yu, Yuzhong Zhang +5 more

The paper introduces Atomic Decomposition and Recombination (ADR), a novel framework that generates genuinely novel and challenging verifiable code tasks, significantly improving the scalability of Re…

View →

cs.AIRecentMay 31, 2026

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

Jaideep Ray

The paper introduces a verifier-fuzzing framework to detect and analyze failure modes in Reinforcement Learning with Verifiable Rewards (RLVR) where bugs in the reward verifier can be exploited by the…

View →

cs.CRcs.AIRecentApr 10, 2026

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou +2 more

This paper introduces a novel backdoor attack (ACB) against Reinforcement Learning with Verifiable Rewards (RLVR), demonstrating that poisoning the training data can implant a backdoor that significan…

View →

cs.AIRecentMay 27, 2026

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

Mingze Wu, Abhinav Anand, Shweta Verma, Mira Mezini

This paper proposes using offline reinforcement learning (RL) as an efficient alternative to online RL for post-training code-generating LLMs, demonstrating its effectiveness, especially for smaller m…

View →

cs.LGcs.AIcs.SERecentMay 30, 2026

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

The paper demonstrates that standard LLM evaluation metrics overestimate performance because they fail to account for the stability of outcomes, showing a significant gap between reported pass rates a…

View →

cs.CVcs.AIRecentMay 28, 2026

Reinforcement Learning with Robust Rubric Rewards

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu +14 more

The paper introduces $ ext{RLR}^3$, a novel framework that extends verifiable rewards in Reinforcement Learning to handle partially verifiable, multi-criteria vision-language tasks by integrating robu…

View →

cs.LGcs.AIRecentMay 27, 2026

Label-Free Reinforcement Learning via Cross-Model Entropy

Matt Gorbett, Hossein Shirazi

The paper introduces Cross-Model Entropy (CME), a novel label-free reward signal that uses an independent verifier model to assess the quality of a generator's output, significantly improving LLM perf…

View →

cs.LGcs.AIRecentMay 29, 2026

EchoRL: Reinforcement Learning via Rollout Echoing

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou +8 more

EchoRL proposes a lightweight module to exploit valuable learning signals from advantage-degenerated rollouts in Reinforcement Learning with Verifiable Rewards (RLVR), significantly improving LLM post…

View →

cs.CRcs.AIRecentJun 2, 2026

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

Wenqi Chen, Ziyan Zhang, Bing Wang, Lin Liu +2 more

The paper introduces Tree-like Self-Play (TSP), a novel framework that treats secure code generation as a fine-grained decision process, significantly improving LLM security by forcing the model to se…

View →

cs.SEcs.AIRecentMay 28, 2026

Inferring Code Correctness from Specification

Tambon Florian, Papadakis Mike

The paper introduces TRAILS~, a novel method that improves code correctness validation by grounding LLM reasoning in concrete (input, output) pairs derived from specifications, achieving state-of-the-…

View →

cs.AIcs.ARcs.CLRecentJun 2, 2026

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

Prashanth Vijayaraghavan, Apoorva Nitsure, Luyao Shi, Ehsan Degan +1 more

StepPRM-RTL is a novel framework that enhances LLM-based RTL code generation for digital hardware designs.

View →

cs.AIRecentMay 31, 2026

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li

The paper introduces Expected Value Alignment (EVA), a novel reward modeling procedure that allows continuous scoring of intermediate reasoning steps in formal mathematics verification while maintaini…

View →

cs.SEcs.CRRecentMay 5, 2026

KVerus: Scalable and Resilient Formal Verification Proof Generation for Rust Code

Yuwei Liu, Xinyi Wan, Yanhao Wang, Minghua Wang +2 more

KVerus is a retrieval-augmented system that significantly improves the scalability and resilience of formal verification for Rust code by managing complex cross-module dependencies and adapting to cod…

View →

cs.SEcs.AIRecentMay 31, 2026

FVSpec: Real-World Property-Based Tests as Lean Challenges

Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds

The paper introduces FVSpec, a large-scale benchmark that translates thousands of real-world Python property-based tests into formal Lean 4 specifications to evaluate AI models for formal software ver…

View →

cs.CRcs.SERecentMay 29, 2026

How to Compare the Security of Code Written by Humans to LLM-generated Code

Rebecca Balebako, Jasmine Egl

The paper proposes an automated, standardized framework to empirically compare the security quality of code generated through human-only, LLM-only, and hybrid collaboration methods.

View →

cs.LGcs.AIcs.CLRecentMay 27, 2026

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring +3 more

The paper demonstrates that extrapolative weight averaging can effectively navigate and extend the correctness-efficiency frontier in code RL, leading to improved performance on complex programming ta…

View →

cs.AIRecentJun 1, 2026

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

Bin Chen, Xinye Liao, Yiming Liu, Xin Liao +1 more

The paper proposes Credit-Attenuated Privileged Feedback (CAPF), a training-time mechanism that uses verifier-side information to guide LLM search agents, significantly improving their performance on…

View →

cs.LGcs.AIRecentMay 31, 2026

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou +1 more

The paper introduces Group Prioritized Off-Policy Optimization (POPO), a novel framework that efficiently accelerates RL finetuning for LLM reasoning by leveraging effective off-policy training batche…

View →

cs.AIcs.CLcs.LGRecentMay 27, 2026

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Soeun Kim, Albert No

The paper introduces REFT, a novel method that diversifies rollouts by sampling the first token after the reasoning marker, significantly improving performance in Reinforcement Learning with Verifiabl…

View →