The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
The paper introduces extsc{Ptah}, a multi-agent harness designed to improve ver…
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
The paper proposes DecomposeR, a planner-centric framework that structures deep…
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demons…
MosaicLeaks:Privacy Risks in Querying-in-the-Open for Deep Research Agents
The paper introduces MosaicLeaks, a benchmark demonstrating that deep research a…
Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reaso…
The paper proposes CSMR, a cognitive scheduling framework that allows a language…
Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context…
The paper proposes In-Context Visual Contrastive Optimization (IC-VCO) to rigoro…
Deep Research as Rubric for Reinforcement Learning
The paper proposes Deep Research as Rubric (DR-rubric), a novel evidence-driven…
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
ROVER is a lightweight, learnable plugin that efficiently routes and integrates…