~ similar to 2605.29610· 19 results
Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more
The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…
The paper introduces a novel two-stage framework to achieve robust, category-agnostic object localization in-context (ICL) by optimizing attention and minimizing localization error using reinforcement…
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
The paper proposes FedSAP, a framework that stabilizes federated prototype learning by delaying global alignment and enforcing inter-class structure, significantly improving representation quality und…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen +2 more
SSR3D-LLM introduces a structured spatial reasoning interface for unified 3D-LLMs, allowing fine-grained object grounding by generating and processing sequential latent spatial steps.
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…
ProtoAda introduces a prototype-guided, format-aware adaptive tuning framework to improve multimodal continual instruction tuning by ensuring task assignment and parameter updates respect heterogeneou…
Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao +2 more
GiPL proposes a novel two-branch framework combining iterative pseudo-label self-training and generative data augmentation to significantly improve Cross-Domain Few-Shot Object Detection by better uti…
CORE-MTL proposes a representation-centric framework that uses causal orthogonal representations to disentangle task-relevant structure from nuisance variation in multi-task learning, achieving superi…
The paper proposes an agentic pipeline for spatial reasoning by introducing a dynamic cognitive map and Spatial Assertion Codes (SAC), achieving state-of-the-art performance on complex spatial tasks.
BayesNCL introduces a probabilistic gating mechanism to resolve the optimization conflict in Contrastive Learning, leading to highly disentangled and semantically consistent representations.
The paper proposes BRACS, a training-free steering framework that adaptively corrects visual grounding failures in large vision-language models, significantly reducing object hallucination without sac…
The paper proposes pretraining a Perceiver-style in-context learner on synthetic data to solve Multiple Instance Learning (MIL) tasks efficiently in the low-label regime.
Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele +2 more
PARCEL introduces a novel visual tokenization architecture that combines spatial pooling anchors with conditioned elastic queries, efficiently reducing the computational cost of large Vision-Language…
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more
The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.