When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL | ArxivCSExplorer