Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation | ArxivCSExplorer