ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.02481· 17 results

cs.CVcs.AIRecentMay 30, 2026

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

Rashid Mushkani

The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…

View →
cs.CVcs.AIRecentMay 27, 2026

FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara +5 more

FLORO is a multimodal geospatial foundation model that learns transferable remote sensing representations from a small, diverse corpus, achieving strong performance across various sensor types and res…

View →
cs.CVRecentJun 1, 2026

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more

The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…

View →
cs.CVRecentJun 1, 2026

Policy-based Foveated Imaging and Perception

Howard Xiao, Jan Ackermann, Boyang Deng, Gordon Wetzstein

The paper proposes a real-time, predictive, and task-aware foveated imaging system that dynamically allocates limited sensor bandwidth to task-relevant regions of interest, significantly improving per…

View →
cs.CVcs.AIcs.LGRecentMay 29, 2026

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li

The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…

View →
cs.CVcs.AIcs.LGRecentMay 30, 2026

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup +2 more

The paper introduces CAFOSat, a large-scale, strongly annotated, and infrastructure-aware dataset designed to improve the accuracy of mapping Concentrated Animal Feeding Operations (CAFOs) from high-r…

View →
cs.CVcs.AIRecentMay 29, 2026

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster +8 more

The paper proposes a unified framework to systematically redefine instance matching for Panoptic Quality evaluation, moving beyond the standard One-to-One matching to accommodate complex scenarios lik…

View →
cs.CVRecentJun 1, 2026

Chroma Clues: Leveraging Color Statistics to Detect Synthetic Images

Lea Uhlenbrock, Davide Cozzolino, Christian Riess

This paper proposes using color statistics, specifically through novel color transformations, to detect AI-generated synthetic images by exploiting the color-imitation weaknesses of current generative…

View →
cs.CVcs.AIcs.LGRecentJun 1, 2026

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…

View →
cs.GRcs.AIcs.CVRecentMay 28, 2026

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Ruixiang Jiang, Chang Wen Chen

The paper introduces 3D aesthetic portrait planning, a method that pre-calculates optimal human pose, camera, and lighting configurations within a 3D scene to generate visually compelling and physical…

View →
cs.CVcs.AIRecentMay 29, 2026

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou +4 more

The paper introduces ERGeoBench, a comprehensive diagnostic benchmark designed to evaluate the fine-grained capabilities of multimodal large language models (MLLMs) for embodied geo-localization acros…

View →
cs.CVRecentJun 1, 2026

Honey, I Shrunk the Arc de Triomphe!

Yuanbo Xiangli, Hanyu Chen, Xueqing Tsang, Noah Snavely

The paper introduces MetricScenes, a new large-scale, in-the-wild dataset, and demonstrates that fine-tuning existing geometry models on this dataset significantly mitigates the scale-collapse problem…

View →
cs.CVRecentJun 1, 2026

TROPHIES: Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos

Jinpeng Liu, Yukang Xu, Yutong Li, Xingyu Liu

TROPHIES introduces a unified framework to jointly reconstruct dynamic humans, static scenes, and camera poses from multi-view videos, achieving globally consistent and physically plausible 4D reconst…

View →
cs.AIRecentJun 1, 2026

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

Steffen Knoblauch, Hao Li, Gengchen Mai, Konstantin Klemmer +2 more

The paper advocates for a paradigm shift toward joint Spatial Representation Learning (SRL) that unifies raster imagery and structured vector data into a single embedding space for developing more sem…

View →
cs.CVcs.AIRecentMay 27, 2026

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…

View →
cs.CVcs.AIcs.CLRecentMay 31, 2026

TECCI: Tricky Edits of Collected and Curated Images

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben +1 more

The paper introduces TECCI, a novel and challenging benchmark dataset of 7550 image-edit pairs, and demonstrates that current state-of-the-art text-guided image editing models struggle significantly w…

View →
cs.CVcs.RORecentJun 3, 2026

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.

View →