MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization | ArxivCSExplorer