~ similar to 2605.31212· 19 results
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more
The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…
This study investigated the stability and prompt-responsiveness of AI tools in classifying the cognitive demand of math tasks, finding that few-shot prompting was a more reliable performance booster t…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
Yeil Jeong, Youngjin Yoo, Seobin Sohn, Hyejin Han +3 more
The paper introduces TeachObs, a comprehensive, human-validated benchmark for multimodal teaching observation, and evaluates frontier LLMs, finding that no single model consistently outperforms others…
The paper introduces UniKE, a benchmark showing that successful knowledge edits in text-only multimodal models do not reliably transfer to image generation, revealing a significant modality gap.
Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more
The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
DenseSteer is a training-free inference-time framework that improves the math reasoning capabilities of small language models by steering their internal representations toward a 'Dense Reasoning' patt…
The paper introduces Knowledge-Intensive Video Generation (KIVI) as a challenging benchmark for evaluating video models on factuality and practical usefulness, showing that current state-of-the-art sy…
The paper introduces BRo-JEPA, a latent world model that successfully learns modular arithmetic (like addition modulo 10) by explicitly imposing the circular structure of the problem into the latent s…
The paper introduces SPAWN, a training-free method that allows users to inject specified visual concepts into existing autoregressive world models, enabling controllable scene composition beyond the i…
Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat +4 more
The paper introduces AutoformBot, a multi-agent system that successfully autoformalizes a large corpus of open-access graduate-level mathematics textbooks into a verified library in Lean 4, demonstrat…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
The paper introduces the Image Reconstruction Game, a benchmark showing that the quality of the descriptive model is the primary determinant of image reconstruction success, while the generator's role…
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…