20 results for “Multimodal parsing”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali +2 more
This paper introduces a Hybrid Mixture-of-Experts (HybridMoE) framework and a specialized corpus (Varnika) to significantly improve language models' ability to understand and retain figurative, cultur…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
The paper introduces Partial Information Decomposition (PID) to quantitatively separate unique, redundant, and synergistic contributions of different modalities (e.g., vision, language) in multimodal…
The paper systematically compares multimodal transformer and LLM approaches for document type classification, finding that specialized multimodal Transformers outperform LLM-based models, especially w…
The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…
Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang +8 more
The paper introduces PMC-InterCPT, a refined biomedical interleaved corpus that enhances multimodal continued pretraining by integrating figure-referencing body text alongside captions, leading to imp…
This paper proposes a lightweight encoder-based MEL solution called FAST-MEL that meets three objectives: high linking accuracy, computational efficiency, and storage efficiency.
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…
The paper enhances French parsing accuracy by integrating data from a syntactic lexicon and applying word clustering methods to verbs within a Probabilistic Context-Free Grammar framework.
The paper proposes CYKNN, a novel recurrent neural network architecture that directly encodes the CYK parsing algorithm, demonstrating superior performance over large language models on syntactic pars…
This paper evaluates biases in multimodal speech recognition by testing how pairing different faces with the same audio affects transcription accuracy, finding significant quality-of-service drops acr…
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more
The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.
KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.
The paper proposes a novel multimodal framework for session-based music recommendation that jointly models audio, lyric, and semantic content signals within a unified LLM-based sequential reasoning sy…
Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more
MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.
This survey provides a comprehensive taxonomy and vulnerability-centric analysis of adversarial attacks targeting Multimodal Large Language Models (MLLMs), offering an explanatory framework for enhanc…
The paper introduces a Conflict-aware Penalty (CP) and Statistical Loss (SL) framework to stabilize and balance the training of multimodal sentiment analysis models, achieving state-of-the-art perform…
Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more
The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…
The paper introduces TalkTag, an LLM-based tool that automates fine-grained morphosyntactic error annotation for spoken-language transcripts, providing a scalable alternative to labor-intensive manual…