The paper introduces a verifier-fuzzing framework to detect and analyze failure modes in Reinforcement Learning with Verifiable Rewards (RLVR) where bugs in the reward verifier can be exploited by the learning model.
Abstract
More Like ThisReinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.