~ similar to 2606.00174· 17 results
The paper introduces semantic motion anchors, a method that bridges the gap between spoken text and gesture meaning by providing structured, semantically grounded supervision, significantly improving…
EVA-Net proposes a two-stage framework that uses action videos as semantic priors to achieve strong subject-independent EEG motor decoding, significantly outperforming text-based methods.
Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou +4 more
The paper introduces ERGeoBench, a comprehensive diagnostic benchmark designed to evaluate the fine-grained capabilities of multimodal large language models (MLLMs) for embodied geo-localization acros…
The paper proposes CTRL-STEER, a closed-loop framework that adaptively adjusts intervention strength to stabilize concept regulation and improve task success in Vision-Language-Action models without r…
The paper proposes UF-AMA, a unified framework that achieves state-of-the-art cross-domain emotion recognition by adaptively aligning and fusing multimodal physiological signals like EEG and eye-track…
Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen +7 more
The paper proposes Agentic ASR, a closed-loop framework that treats ASR as a multi-turn refinement task, significantly improving semantic accuracy over traditional token-level metrics.
The paper introduces REST-ASMR, a novel multimodal dataset combining PPG and behavioral responses to ASMR and nature videos, and demonstrates that a deep learning model can accurately predict ASMR tin…
Yuefeng Peng, Mingzhe Li, Kejing Xia, Renhao Zhang +1 more
This paper presents the first systematic study of membership inference attacks (MIAs) against Vision-Language-Action (VLA) models, demonstrating that these models are highly vulnerable to privacy brea…
Xiangyi Chen, Zelun Wang, Xinyi Li, Yi-Ping Hsu +2 more
The paper proposes PrefixMem, a dedicated encoder for Semantic IDs (SIDs), demonstrating that structured, prefix-conditioned representations significantly improve the accuracy and recall of generative…
Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang +8 more
VLA-Trace is a diagnostic framework that analyzes Vision-Language-Action (VLA) models by tracing their internal representations and external behaviors, revealing that while these models are good at vi…
Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali +2 more
This paper introduces a Hybrid Mixture-of-Experts (HybridMoE) framework and a specialized corpus (Varnika) to significantly improve language models' ability to understand and retain figurative, cultur…
The paper introduces Partial Information Decomposition (PID) to quantitatively separate unique, redundant, and synergistic contributions of different modalities (e.g., vision, language) in multimodal…
Kıvanç Kuzey Dikici, Serdar Kara, Semih Çağlar, Eray Tüzün +1 more
SERSEM introduces a selective entropy-weighted scoring framework to significantly improve Membership Inference Attacks (MIAs) against code LLMs by focusing on human-centric coding anomalies rather tha…
The paper introduces Q-ALIGN DT, a novel framework that improves conditioned sequence models by enforcing alignment between the input return-to-go (RTG) signal and the output policy's expected Q-value…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye +36 more
Qwen-VLA introduces a unified embodied foundation model that extends vision-language understanding to continuous action generation, enabling robust, multi-task generalization across diverse robotic ta…
Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang +5 more
The paper introduces SpatialAct, a challenging benchmark that reveals a significant 'reasoning-to-action gap,' showing that current VLMs struggle to maintain coherent spatial understanding and perform…