Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems | ArxivCSExplorer