~ similar to 2606.05115· 15 results
The paper introduces a novel production-based evaluation showing that child-directed speech (CDS) significantly improves a BabyLM's ability to generate grammatically correct language, even if standard…
Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more
The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.
This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…
The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…
Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more
The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more
The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin +3 more
The paper introduces RefMem-Bench, a new benchmark for measuring reflective memory in long-horizon dialogue, and proposes REMIND, a framework that significantly improves models' ability to synthesize…