"Multimodal parser" | ArxivCSExplorer

20 results for “Multimodal parser”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.CLRecentJun 1, 2026

When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models

Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali +2 more

This paper introduces a Hybrid Mixture-of-Experts (HybridMoE) framework and a specialized corpus (Varnika) to significantly improve language models' ability to understand and retain figurative, cultur…

View →

cs.CLcs.AIRecentMay 30, 2026

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

Ravil Mussabayev, Rustam Mussabayev

The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…

View →

cs.CVcs.AIcs.CLRecentJun 1, 2026

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

Catyana Heyne, Jürgen Frikel, Filippo Riccio

The paper systematically compares multimodal transformer and LLM approaches for document type classification, finding that specialized multimodal Transformers outperform LLM-based models, especially w…

View →

cs.IREmpiricalRecentJun 10, 2026

FAST-MEL: A Fast, Accurate, and Storage Efficient Solution for Multimodal Entity Linking

Derrien Thomas, Laurent Amsaleg, Pascale Sébillot

This paper proposes a lightweight encoder-based MEL solution called FAST-MEL that meets three objectives: high linking accuracy, computational efficiency, and storage efficiency.

View →

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

cs.CLcs.AIcs.CVRecentMay 31, 2026

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more

The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…

View →

cs.AIRecentMay 31, 2026

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

The paper introduces Partial Information Decomposition (PID) to quantitatively separate unique, redundant, and synergistic contributions of different modalities (e.g., vision, language) in multimodal…

View →

cs.CLcs.LGRecentMay 30, 2026

French parsing enhanced with a word clustering method based on a syntactic lexicon

Anthony Sigogne, Matthieu Constant, Eric Laporte

The paper enhances French parsing accuracy by integrating data from a syntactic lexicon and applying word clustering methods to verbs within a Probabilistic Context-Free Grammar framework.

View →

cs.SDcs.AIRecentJun 1, 2026

MOSS-Audio Technical Report

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more

MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.

View →

cs.CLRecentMay 31, 2026

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang +8 more

The paper introduces PMC-InterCPT, a refined biomedical interleaved corpus that enhances multimodal continued pretraining by integrating figure-referencing body text alongside captions, leading to imp…

View →

cs.CLRecentMay 28, 2026

Your Multimodal Speech Model Says I Have a Face for Radio

Maya K. Nachesa, Vlad Niculae, Vagrant Gautam

This paper evaluates biases in multimodal speech recognition by testing how pairing different faces with the same audio affects transcription accuracy, finding significant quality-of-service drops acr…

View →

cs.CLcs.AIcs.DSRecentMay 29, 2026

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

Fabio Massimo Zanzotto, Federico Ranaldi, Giorgio Satta

The paper proposes CYKNN, a novel recurrent neural network architecture that directly encodes the CYK parsing algorithm, demonstrating superior performance over large language models on syntactic pars…

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…

View →

cs.CVRecentJun 4, 2026

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more

The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.

View →

cs.MMcs.AIcs.CLRecentMay 29, 2026

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana +1 more

This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…

View →

cs.CVcs.CRRecentMar 17, 2026

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel

KidsNanny is a two-stage multimodal content moderation pipeline that achieves high accuracy and efficiency in detecting child safety threats, particularly excelling in text-embedded content.

View →

cs.IRcs.AIcs.LGRecentMay 28, 2026

Multimodal Music Recommendation System using LLMs

Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan +5 more

The paper proposes a novel multimodal framework for session-based music recommendation that jointly models audio, lyric, and semantic content signals within a unified LLM-based sequential reasoning sy…

View →

cs.CRcs.AIRecentMar 30, 2026

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur

This survey provides a comprehensive taxonomy and vulnerability-centric analysis of adversarial attacks targeting Multimodal Large Language Models (MLLMs), offering an explanatory framework for enhanc…

View →

cs.CLRecentJun 1, 2026

TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech

Shamira Venturini, Oliver Hennhöfer, Steffen Kinkel, Jannik Strötgen

The paper introduces TalkTag, an LLM-based tool that automates fine-grained morphosyntactic error annotation for spoken-language transcripts, providing a scalable alternative to labor-intensive manual…

View →

cs.IRcs.AIRecentMay 29, 2026

LLMs Need Encoders for Semantic IDs Too

Xiangyi Chen, Zelun Wang, Xinyi Li, Yi-Ping Hsu +2 more

The paper proposes PrefixMem, a dedicated encoder for Semantic IDs (SIDs), demonstrating that structured, prefix-conditioned representations significantly improve the accuracy and recall of generative…

View →