~ similar to 2605.30794· 19 results
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao +5 more
The paper introduces BilliardPhys-Bench, a new benchmark that demonstrates that current multimodal LLMs struggle with complex physical reasoning and predicting object dynamics in simulated environment…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
Fan Wu, Lishuai Dong, Cuiyun Gao, Yujia Chen +3 more
The paper introduces WebIGBench, a novel benchmark designed to rigorously evaluate multimodal LLMs' ability to generate code for complex, interactive webpages, addressing the limitations of existing s…
Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin +5 more
The paper introduces InsightVQA, a large-scale benchmark dataset designed for hierarchical visual question answering that assesses complex emotion understanding and cognitive reasoning beyond simple e…
Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more
The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.
Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more
The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…
PlanarBench introduces a novel benchmark to test LLM spatial reasoning by requiring them to draw planar graphs as ASCII art from an edge list, finding that edge count is a stronger difficulty predicto…
The paper introduces Brain-IT-VQA, a novel framework that significantly improves visual question answering from fMRI signals, and presents NSD-VQA, a new, highly controlled dataset for this task.
This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more
The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…