~ similar to 2606.02022· 19 results
The paper proposes a unified framework to systematically redefine instance matching for Panoptic Quality evaluation, moving beyond the standard One-to-One matching to accommodate complex scenarios lik…
Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more
The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…
The paper introduces MetricScenes, a new large-scale, in-the-wild dataset, and demonstrates that fine-tuning existing geometry models on this dataset significantly mitigates the scale-collapse problem…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
The paper proposes FedSAP, a framework that stabilizes federated prototype learning by delaying global alignment and enforcing inter-class structure, significantly improving representation quality und…
Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang +2 more
This paper formalizes the binding problem using information theory and develops a probing method to measure binding information in deep learning representations, demonstrating that binding is crucial…
The paper introduces CalArena, a large-scale, standardized benchmark covering nearly 2000 experiments to comprehensively evaluate post-hoc calibration methods, finding that smooth calibration function…
The paper proposes AlignG, a method that learns context-conditioned predicate semantics by using prototype feedback to adapt relation representations based on image-specific evidence, significantly im…
David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin +2 more
The paper proposes a novel global multi-modal alignment framework to robustly learn video representations from noisy and complementary sensor data, significantly improving driver distraction detection…
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang +1 more
The paper proposes a lightweight post-processing framework that enhances identity continuity in thermal pedestrian tracking by leveraging scene-level spatial-temporal consistency, achieving improved t…
The paper proposes MoEIoU, a novel mixture-of-experts based regression loss that adaptively models bounding-box localization errors, achieving superior convergence and accuracy in object detection.
CIPER proposes a unified transformer framework to simultaneously perform cross-view image retrieval and precise 3-DoF pose estimation, overcoming the limitations of cascaded, separate methods.
Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao +2 more
GiPL proposes a novel two-branch framework combining iterative pseudo-label self-training and generative data augmentation to significantly improve Cross-Domain Few-Shot Object Detection by better uti…
Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain +4 more
KLAS introduces a novel framework that uses KL divergence to automatically select optimal pairs of pretrained models for stitching, significantly improving the accuracy-efficiency tradeoff of resultin…
The paper introduces AdvScene, a novel scene-grounded framework that measures the real-world 'scene robustness' of adversarial patches by characterizing their operational envelope across varying viewp…
The paper introduces PInVerify, an offline embodied benchmark for Active Instance Verification (AIV), a task requiring agents to actively select viewpoints to confirm if a candidate object matches a f…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
TROPHIES introduces a unified framework to jointly reconstruct dynamic humans, static scenes, and camera poses from multi-view videos, achieving globally consistent and physically plausible 4D reconst…