~ similar to 2605.31266· 17 results
The paper introduces Staged Executable Inverse Graphics (SEIG), an agentic framework that uses general-purpose Vision-Language Models (VLMs) to reconstruct editable 3D scenes directly into executable…
The paper proposes AlignG, a method that learns context-conditioned predicate semantics by using prototype feedback to adapt relation representations based on image-specific evidence, significantly im…
Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more
This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben +1 more
The paper introduces TECCI, a novel and challenging benchmark dataset of 7550 image-edit pairs, and demonstrates that current state-of-the-art text-guided image editing models struggle significantly w…
Haolin Deng, Xin Zou, Zhiwei Jin, Chen Chen +2 more
The paper proposes In-Context Visual Contrastive Optimization (IC-VCO) to rigorously mitigate multimodal hallucinations in Vision-Language Models by optimizing contrastive learning within a shared mul…
The paper introduces GPIC, a massive, permissively licensed, and safety-filtered image corpus of 28 trillion pixels, designed to serve as a stable and accessible benchmark for large-scale visual gener…
Places in the Wild introduces a massive, high-resolution RAW photograph dataset of 67,574 images captured in situ across 810 locations, providing unprecedented detail for ecologically valid vision res…
The paper proposes Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework that effectively removes specific concepts from text-to-image models while minimizing the unintended degra…
BayesNCL introduces a probabilistic gating mechanism to resolve the optimization conflict in Contrastive Learning, leading to highly disentangled and semantically consistent representations.
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more
The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
The paper introduces Diversity-inducing Initialization (DivIn), a novel method that improves image diversity by re-weighting the initial noise selection based on the guidance potential, thereby mitiga…
The paper proposes VRPO, a reinforcement learning-based optimization strategy that replaces static alignment losses in diffusion models, significantly improving both convergence and image fidelity.
Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei +4 more
The paper introduces REFORGE, a black-box red-teaming framework that uses adversarial image prompts to reveal persistent vulnerabilities in current Image Generation Model Unlearning (IGMU) methods.
The paper proposes AHV-D&S, a novel training-free inference-time safeguard that detects and suppresses risky content in Diffusion Transformers (DiTs) by quantifying token sensitivity across attention…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…