ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.05115· 15 results

cs.CLRecentMay 31, 2026

Child-directed speech facilitates production, not comprehension, in BabyLMs

Bastian Bunzeck, Sina Zarrieß

The paper introduces a novel production-based evaluation showing that child-directed speech (CDS) significantly improves a BabyLM's ability to generate grammatically correct language, even if standard…

View →
cs.CVcs.AIRecentMay 31, 2026

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more

The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…

View →
cs.AIcs.CLRecentJun 1, 2026

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Yiheng Shu, Bernal Jiménez Gutiérrez, Saisri Padmaja Jonnalagedda, Yuguang Yao +2 more

The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…

View →
cs.CVcs.CRRecentMar 17, 2026

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel

KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.

View →
cs.CVcs.AIRecentMay 31, 2026

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more

The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…

View →
cs.CVcs.AIRecentJun 1, 2026

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more

The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…

View →
cs.CVcs.AIRecentMay 28, 2026

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more

The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.

View →
cs.LGcs.AIEmpiricalComprehensiveRecentJun 4, 2026

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar, Phillip Isola

This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.

View →
cs.LGcs.AIEmpiricalComprehensiveRecentJun 4, 2026

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar, Phillip Isola

This paper proposes Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely.

View →
cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →
cs.CVcs.AIRecentMay 30, 2026

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

Shivam Singh, Saptarshi Majumdar, Pratik Prabhanjan, Zicheng Liu +1 more

The paper introduces pause-and-think-T, a reasoning-centric dataset and benchmark that enables compact Vision-Language Models to perform visually grounded, context-aware action suggestion, matching la…

View →
cs.CLcs.AIRecentMay 27, 2026

ChildEval: When large language models meet children's personalities

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more

The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…

View →
cs.CVRecentJun 1, 2026

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more

The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…

View →
cs.AIcs.LGRecentMay 29, 2026

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Yunpeng Zhou

This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…

View →
cs.CLcs.AIRecentMay 31, 2026

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin +3 more

The paper introduces RefMem-Bench, a new benchmark for measuring reflective memory in long-horizon dialogue, and proposes REMIND, a framework that significantly improves models' ability to synthesize…

View →