~ similar to 2606.02350· 17 results
Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen +6 more
The paper addresses the challenge of multi-turn view planning for VLMs by proposing an iterative framework that uses self-exploration and view graph distillation, significantly improving planning perf…
The paper introduces MetricScenes, a new large-scale, in-the-wild dataset, and demonstrates that fine-tuning existing geometry models on this dataset significantly mitigates the scale-collapse problem…
Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…
RayDer introduces a unified, feed-forward transformer that simplifies self-supervised novel view synthesis (NVS) by consolidating camera estimation, scene reconstruction, and rendering into a single,…
Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more
The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…
Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang +1 more
The paper proposes a lightweight post-processing framework that enhances identity continuity in thermal pedestrian tracking by leveraging scene-level spatial-temporal consistency, achieving improved t…
Jingyun Liang, Min Wei, Shikai Li, Yizeng Han +4 more
The paper proposes a novel render-free framework that conditions video diffusion models directly on compressed 3D human mesh tokens, enabling robust 3D-aware human motion control without relying on re…
The paper introduces TouchSafeBench, a physics-grounded benchmark, to evaluate collision grounding—the ability to predict robot-human collisions—and finds that current Vision-Language Models (VLMs) ar…
Minseok Joo, Dogyun Park, Taehoon Lee, Kyujin Lee +1 more
The paper proposes COVRAG, a depth-based memory retrieval framework that maximizes the coverage of target-view regions to significantly improve long-term geometric consistency in autoregressive long v…
The paper introduces the Image Reconstruction Game, a benchmark showing that the quality of the descriptive model is the primary determinant of image reconstruction success, while the generator's role…
CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.
Places in the Wild introduces a massive, high-resolution RAW photograph dataset of 67,574 images captured in situ across 810 locations, providing unprecedented detail for ecologically valid vision res…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
The paper introduces a novel two-stage framework to achieve robust, category-agnostic object localization in-context (ICL) by optimizing attention and minimizing localization error using reinforcement…
Tianyi Xie, Haotian Zhang, Jinhyung Park, Zi Wang +16 more
This paper presents GRAIL, a digital generation pipeline that synthesizes human-object interactions for humanoid robots.
Xuanyi Liu, Deyi Ji, Liqun Liu, Lanyun Zhu +7 more
CamGeo is a novel framework that improves sparse camera-conditioned image-to-video generation by distilling rich 3D geometric priors into the diffusion backbone, resulting in geometrically consistent…
Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu +2 more
The paper introduces MOV-Bench, a challenging benchmark for multi-hop audio-visual reasoning, and proposes AOP-Agent, an agentic framework that significantly improves open-source Omni-LLMs' ability to…