~ similar to 2606.01936· 20 results
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…
The paper systematically compares multimodal transformer and LLM approaches for document type classification, finding that specialized multimodal Transformers outperform LLM-based models, especially w…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
Sangwon Ryu, Yihong Liu, Mingyang Wang, Yunsu Kim +3 more
The paper introduces a new benchmark for multi-target cross-lingual summarization (MTXLS) and proposes an activation steering method that significantly improves LLM performance by guiding the generati…
This paper introduces a new benchmark dataset and evaluation framework for 'data snapshot extraction,' focusing on identifying and localizing semantically meaningful analytical artifacts within operat…
The paper introduces OpAI-Bench, a novel benchmark designed to study how AI authorship signals evolve and accumulate during the progressive co-editing process between humans and AI.
The paper systematically compares multiple content representations for RAG pipelines and finds that answer retention—the ability of the representation to preserve the original answer-bearing content—i…
SkillPager is a novel two-stage framework that efficiently selects minimal, execution-sufficient context from large procedural skill documents by leveraging typed semantic nodes, significantly reducin…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou +1 more
This paper demonstrates that targeted adaptation of the small language model Phi Silica, using dataset curation and fine-tuning, significantly improves its performance in short-form text rewriting, na…
The paper introduces TSM-Bench, a new benchmark that demonstrates existing LLM-generated text detectors fail to accurately identify task-specific machine-generated content found in real-world Wikipedi…
This study systematically evaluates a wide range of chunking methods for Retrieval-Augmented Generation (RAG) to assess their effectiveness and highlight the overlooked challenges associated with chun…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more
This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…
The paper introduces XLGoBench, a synthetic benchmark of algorithmic tasks designed to detect persistent cross-lingual skill gaps in large language models.
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
This paper introduces robustness indicators to systematically analyze how multilingual text embedding model rankings change based on dataset composition and aggregation methods, revealing that only a…