On the Limits of Token Reduction for Efficient Unified Vision Language Training | ArxivCSExplorer