~ similar to 2606.02321· 19 results
The paper proposes a zero-shot reason-then-retrieve pipeline using Qwen3.5-27B to solve the challenging task of composed video retrieval (CoVR-R), achieving high performance on both validation and bli…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
The paper proposes a question-aware evidence ledger pipeline that significantly improves video relational reasoning by explicitly guiding the model to extract necessary evidence for complex spatial, t…
Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…
Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu +2 more
The paper introduces MOV-Bench, a challenging benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that significantly improves open-source Omni-LLMs' ability to…
The paper introduces CERA, a novel contrastive retrieval framework that improves RAG factuality and interpretability by using subjectivity-based hard negative selection and an auxiliary attention alig…
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…
The paper introduces an Integrated, cross-Architecture Reasoning (IAR) framework to provide a unified and robust method for interpreting the opaque reasoning processes within Large Language Models.
Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen +6 more
The paper introduces Plan, a structured agentic behavior that decomposes multi-hop questions into ordered sub-questions before retrieval, and proposes a self-bootstrapping paradigm to train it without…
Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen +3 more
This paper proposes a post-training framework called Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT) to teach language models to reason by analogy.
The paper introduces LinTree, a method that explicitly structures the search history of LLM reasoning traces using parent pointers, significantly improving task performance and search efficiency compa…
Shihang Zhang, Mingjin Kuai, Ye Wei, Zhen Zhang +1 more
GIRL-DETR introduces Gradient-Isolated Reinforcement Learning to enhance temporal localization in lightweight Video Moment Retrieval models, achieving high accuracy by decoupling feature representatio…
V-LynX is a framework that enhances Video LLMs by integrating new modalities into their existing token interface, achieving state-of-the-art performance across diverse video understanding tasks.