~ similar to 2606.02481· 17 results
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
FLORO is a multimodal geospatial foundation model that learns transferable remote sensing representations from a small, diverse corpus, achieving strong performance across various sensor types and res…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
The paper proposes a real-time, predictive, and task-aware foveated imaging system that dynamically allocates limited sensor bandwidth to task-relevant regions of interest, significantly improving per…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
The paper introduces CAFOSat, a large-scale, strongly annotated, and infrastructure-aware dataset designed to improve the accuracy of mapping Concentrated Animal Feeding Operations (CAFOs) from high-r…
The paper proposes a unified framework to systematically redefine instance matching for Panoptic Quality evaluation, moving beyond the standard One-to-One matching to accommodate complex scenarios lik…
This paper proposes using color statistics, specifically through novel color transformations, to detect AI-generated synthetic images by exploiting the color-imitation weaknesses of current generative…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
The paper introduces 3D aesthetic portrait planning, a method that pre-calculates optimal human pose, camera, and lighting configurations within a 3D scene to generate visually compelling and physical…
Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou +4 more
The paper introduces ERGeoBench, a comprehensive diagnostic benchmark designed to evaluate the fine-grained capabilities of multimodal large language models (MLLMs) for embodied geo-localization acros…
The paper introduces MetricScenes, a new large-scale, in-the-wild dataset, and demonstrates that fine-tuning existing geometry models on this dataset significantly mitigates the scale-collapse problem…
TROPHIES introduces a unified framework to jointly reconstruct dynamic humans, static scenes, and camera poses from multi-view videos, achieving globally consistent and physically plausible 4D reconst…
Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer +2 more
The paper advocates for a paradigm shift toward joint Spatial Representation Learning (SRL) that unifies raster imagery and structured vector data into a single embedding space for developing more sem…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben +1 more
The paper introduces TECCI, a novel and challenging benchmark dataset of 7550 image-edit pairs, and demonstrates that current state-of-the-art text-guided image editing models struggle significantly w…
CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.