cs.AI

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu

May 27, 2026

AI Summarygemma4:e4b

CIVIC is a path-consistent compact visual inference framework that achieves genuine hardware efficiency in Vision-Language Models by maintaining contiguous sequence representations across all inference stages, significantly reducing memory and latency.

Abstract

More Like This

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.