Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design | ArxivCSExplorer