~ similar to 2606.06481· 18 results
The paper introduces TSM-Bench, a new benchmark that demonstrates existing LLM-generated text detectors fail to accurately identify task-specific machine-generated content found in real-world Wikipedi…
The paper introduces TELL, a novel explainable AI-generated text detection architecture that provides detailed, human-understandable explanations for its scores, achieving competitive performance whil…
Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin +5 more
The paper introduces DocFormBench, a new benchmark for content-aware document formatting, and proposes DocFormFlow, a workflow that improves formatting accuracy and efficiency by decoupling target loc…
Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more
The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.
The paper introduces a distribution-free statistical framework that allows existing rewrite-based detectors to achieve finite-sample False Discovery Rate (FDR) guarantees for detecting LLM-generated t…
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…
The paper proposes 'Uncertainty,' a multiscale uncertainty estimator that focuses on low-probability tokens to improve the detection of AI-generated text by addressing boilerplate dominance and score…
The paper introduces UniKE, a benchmark showing that successful knowledge edits in text-only multimodal models do not reliably transfer to image generation, revealing a significant modality gap.
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
The paper introduces HOPM, a hierarchical online prompt mutation framework that significantly improves the performance of language models in high-stakes evidence document generation by integrating dua…
This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…
The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…
This paper introduces a new benchmark dataset and evaluation framework for 'data snapshot extraction,' focusing on identifying and localizing semantically meaningful analytical artifacts within operat…
Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more
The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu +20 more
The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.
The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…