~ similar to 2606.02168· 20 results
The paper introduces Brain-IT-VQA, a novel framework that significantly improves visual question answering from fMRI signals, and presents NSD-VQA, a new, highly controlled dataset for this task.
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…
The paper introduces the Vector Network (VN), a novel recurrent architecture that replaces fixed weight matrices with reusable weight atoms, enabling superior compositional generalization by making st…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…
Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin +5 more
The paper introduces InsightVQA, a large-scale benchmark dataset designed for hierarchical visual question answering that assesses complex emotion understanding and cognitive reasoning beyond simple e…
CORE-MTL proposes a representation-centric framework that uses causal orthogonal representations to disentangle task-relevant structure from nuisance variation in multi-task learning, achieving superi…
Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma +2 more
The paper proposes GASP, a framework that injects fundamental geometric priors directly into Vision-Language Models (VLMs) using ground-truth video geometry, significantly enhancing 3D spatial reasoni…
The paper proposes AlignG, a method that learns context-conditioned predicate semantics by using prototype feedback to adapt relation representations based on image-specific evidence, significantly im…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
The paper identifies a failure mode called unfaithful capitulation (UC), where reasoning models maintain a correct internal thought process (chain-of-thought) but output an incorrect final answer when…
Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang +12 more
VCap introduces a novel Witness-Adjudicator reward mechanism that provides highly precise, factually grounded feedback for visual captioning, enabling state-of-the-art performance in RL-trained multim…
The paper proposes a disentangled representation framework to significantly improve few-shot layout-to-image generation by separating semantic identity from local visual details, thereby mitigating re…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
BayesNCL introduces a probabilistic gating mechanism to resolve the optimization conflict in Contrastive Learning, leading to highly disentangled and semantically consistent representations.
Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma +4 more
This paper introduces interpretability-guided, training-free interventions that systematically improve the accuracy and controllability of latent reasoning in LLMs by leveraging structural and causal…
Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li +4 more
The paper proposes Reasoning-Conditioned Direct Preference Optimization (RC-DPO) to effectively mitigate hallucinations in multimodal large reasoning models by explicitly conditioning the preference o…
Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton +2 more
This paper introduces a new evaluation framework, SpatialUncertain, demonstrating that current Vision-Language Models (VLMs) are prone to overconfident and incorrect answers to spatial questions when…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
The paper proposes explicitly disentangling positional and semantic representations in Transformer encoders, demonstrating that this separation allows for a clearer understanding of how positional inf…