~ similar to 2606.02580· 19 results
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…
The paper introduces the Image Reconstruction Game, a benchmark showing that the quality of the descriptive model is the primary determinant of image reconstruction success, while the generator's role…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita +1 more
The paper proposes a visual program synthesis framework using a VLM to generate accurate training data for semiconductor inspection, mitigating the sim-to-real gap by applying input binarization to st…
Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim +2 more
SimuScene introduces a novel compositional 3D reconstruction pipeline that integrates physics simulation directly into the shape and layout estimation process to generate stable, simulation-ready 3D s…
The paper reframes industrial visual sim-to-real transfer as a domain-gap problem categorized by the availability of explicit object geometry (CAD), arguing that the required prior evidence dictates t…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
PatchPoison introduces a lightweight dataset-poisoning method that injects small, high-frequency adversarial patches into multi-view image datasets to systematically corrupt feature matching and degra…
GeM-NR proposes a novel, training-free framework to achieve general multi-view image editing, enabling consistent edits that drastically change both the geometry and appearance of a nonrigid scene.
CubePart is a generative framework that enables the creation of complex 3D meshes by explicitly controlling and generating individual, semantically defined parts based on open-vocabulary text prompts.
Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao +3 more
The paper proposes Astra, an agentic framework that equips Vision-Language Models (VLMs) with the ability to perform spatial reasoning by actively generating and utilizing imagined visual evidence fro…
Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang +1 more
The paper proposes MindDiffuser, a two-stage framework that significantly improves image reconstruction from brain activity by combining semantic guidance from text-to-image models with structural ref…
Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler +1 more
Real2SAM2Real introduces a framework that uses explicit 3D caches, derived from 3D lifting models, to provide robust geometric guidance to Video Diffusion Models, significantly improving spatiotempora…
Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen +6 more
The paper addresses the challenge of multi-turn view planning for VLMs by proposing an iterative framework that uses self-exploration and view graph distillation, significantly improving planning perf…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…