Papers similar to 2606.06481

~ similar to 2606.06481· 18 results

cs.CLRecentMay 29, 2026

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

The paper introduces TSM-Bench, a new benchmark that demonstrates existing LLM-generated text detectors fail to accurately identify task-specific machine-generated content found in real-world Wikipedi…

View →

cs.AIcs.CLcs.CYRecentMay 27, 2026

Show, Don't TELL: Explainable AI-Generated Text Detection

Aldan Creo, Suraj Ranganath

The paper introduces TELL, a novel explainable AI-generated text detection architecture that provides detailed, human-understandable explanations for its scores, achieving competitive performance whil…

View →

cs.CLRecentJun 1, 2026

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin +5 more

The paper introduces DocFormBench, a new benchmark for content-aware document formatting, and proposes DocFormFlow, a workflow that improves formatting accuracy and efficiency by decoupling target loc…

View →

cs.LGcs.CLRecentMay 28, 2026

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more

The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.

View →

stat.MEcs.AIstat.APRecentMay 29, 2026

A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering

Yi Liu

The paper introduces a distribution-free statistical framework that allows existing rewrite-based detectors to achieve finite-sample False Discovery Rate (FDR) guarantees for detecting LLM-generated t…

View →

cs.CLcs.AIcs.CVRecentMay 31, 2026

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more

The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…

View →

cs.CLRecentJun 1, 2026

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke +1 more

The paper proposes 'Uncertainty,' a multiscale uncertainty estimator that focuses on low-probability tokens to improve the detection of AI-generated text by addressing boilerplate dominance and score…

View →

cs.CLcs.CVRecentMay 30, 2026

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

The paper introduces UniKE, a benchmark showing that successful knowledge edits in text-only multimodal models do not reliably transfer to image generation, revealing a significant modality gap.

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →

cs.DCcs.AIcs.LGRecentMay 31, 2026

Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study

Nataraj Agaram Sundar Tejas Morabia

The paper introduces HOPM, a hierarchical online prompt mutation framework that significantly improves the performance of language models in high-stakes evidence document generation by integrating dua…

View →

cs.CLcs.AIRecentJun 1, 2026

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen +9 more

This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…

View →

cs.AIcs.IRRecentMay 28, 2026

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Gaurav Sahu, Laurent Charlin, Christopher Pal

The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…

View →

cs.CLcs.AIcs.CVRecentJun 4, 2026

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

AJ Carl P. Dy, Aivin V. Solatorio

This paper introduces a new benchmark dataset and evaluation framework for 'data snapshot extraction,' focusing on identifying and localizing semantically meaningful analytical artifacts within operat…

View →

cs.CVcs.AIRecentMay 30, 2026

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen +11 more

The paper introduces CV-Arena, a large-scale open benchmark for instructional computer vision, demonstrating that professional-grade image editing requires advanced capabilities in physical reasoning…

View →

cs.AIRecentJun 1, 2026

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more

The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…

View →

cs.CLRecentJun 1, 2026

TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu +20 more

The paper introduces TVIR, a new benchmark and multi-agent framework for deep research, to evaluate and improve the generation of factually reliable, text-visual interleaved reports.

View →

cs.CLRecentMay 30, 2026

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li +1 more

The paper introduces I-WebGenBench, a framework and benchmark that converts static scientific papers into executable, interactive web systems, allowing users to dynamically explore the paper's mechani…

View →