Papers similar to 2605.30794

~ similar to 2605.30794· 19 results

cs.AIRecentMay 28, 2026

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more

The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…

View →

cs.AIRecentMay 27, 2026

MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

Xiaoyu Dong, Zhi Li, Xiao-Ming Wu

The paper introduces MUSE, a comprehensive benchmark that evaluates Text-to-CAD generation by assessing complex assemblies based on functionality, manufacturability, and assemblability, moving beyond…

View →

cs.CLcs.AIRecentMay 30, 2026

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

Ravil Mussabayev, Rustam Mussabayev

The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…

View →

cs.AIRecentMay 27, 2026

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more

The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…

View →

cs.AIphysics.app-phRecentMay 29, 2026

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao +5 more

The paper introduces BilliardPhys-Bench, a new benchmark that demonstrates that current multimodal LLMs struggle with complex physical reasoning and predicting object dynamics in simulated environment…

View →

cs.CLcs.RORecentMay 29, 2026

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…

View →

cs.SEcs.AIRecentMay 29, 2026

Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages

Fan Wu, Lishuai Dong, Cuiyun Gao, Yujia Chen +3 more

The paper introduces WebIGBench, a novel benchmark designed to rigorously evaluate multimodal LLMs' ability to generate code for complex, interactive webpages, addressing the limitations of existing s…

View →

cs.CVRecentJun 1, 2026

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin +5 more

The paper introduces InsightVQA, a large-scale benchmark dataset designed for hierarchical visual question answering that assesses complex emotion understanding and cognitive reasoning beyond simple e…

View →

cs.CVRecentJun 1, 2026

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao +3 more

The paper proposes using Vision-Language Models (VLMs) as 'teachers' to guide Video Generation Models (VGMs) during test-time optimization, significantly improving video reasoning capabilities.

View →

cs.AIRecentMay 27, 2026

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more

The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…

View →

cs.CLcs.AIRecentJun 1, 2026

PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing

Oleksandr Nikitin

PlanarBench introduces a novel benchmark to test LLM spatial reasoning by requiring them to draw planar graphs as ASCII art from an edge list, finding that edge count is a stronger difficulty predicto…

View →

cs.CVcs.AIq-bio.NCRecentMay 28, 2026

Brain-IT-VQA: From Brain Signals to Answers

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman +1 more

The paper introduces Brain-IT-VQA, a novel framework that significantly improves visual question answering from fMRI signals, and presents NSD-VQA, a new, highly controlled dataset for this task.

View →

cs.MMcs.AIcs.CLRecentMay 29, 2026

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana +1 more

This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…

View →

cs.CVcs.AIRecentMay 28, 2026

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more

The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…

View →

cs.CVcs.AIcs.GRRecentMay 31, 2026

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more

The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.

View →

cs.AIcs.LGRecentMay 29, 2026

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou

This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…

View →

cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →

cs.CVcs.AIRecentMay 30, 2026

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more

The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…

View →

cs.AIRecentMay 27, 2026

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more

The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…

View →