~ similar to 2606.03976· 17 results
The paper identifies specific attention heads in LLMs responsible for 'cultural binding'—associating cultural items with appropriate identities—and demonstrates that this capability is pre-trained and…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more
The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…
BayesNCL introduces a probabilistic gating mechanism to resolve the optimization conflict in Contrastive Learning, leading to highly disentangled and semantically consistent representations.
The paper proposes AlignG, a method that learns context-conditioned predicate semantics by using prototype feedback to adapt relation representations based on image-specific evidence, significantly im…
The paper formally addresses the challenging question of cross-domain transferability of latent predictive models by proposing a structured framework that quantifies the relationship between source an…
The paper demonstrates that Transformers trained on local comparisons implicitly learn a global, one-dimensional ordinal structure, mirroring the human ability to perform transitive inference.
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang +8 more
VLA-Trace is a diagnostic framework that analyzes Vision-Language-Action (VLA) models by tracing their internal representations and external behaviors, revealing that while these models are good at vi…
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele +2 more
PARCEL introduces a novel visual tokenization architecture that combines spatial pooling anchors with conditioned elastic queries, efficiently reducing the computational cost of large Vision-Language…
The paper introduces Brain-IT-VQA, a novel framework that significantly improves visual question answering from fMRI signals, and presents NSD-VQA, a new, highly controlled dataset for this task.
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
Ziying Chen, Yang Cao, He Sun, Beining Yang +1 more
The paper proposes a novel geometric embedding hashing method to recover object correspondences (vector links) between two embedding clouds generated by different black-box encoders using only a small…