Papers similar to 2605.28965

~ similar to 2605.28965· 20 results

cs.CLRecentMay 31, 2026

A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

The paper introduces a robust, four-mechanism LLM pipeline that generates auditable, evidence-grounded structured trait records for hundreds of thousands of diverse species across multiple taxa.

View →

cs.CLcs.AIRecentJun 1, 2026

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen +9 more

This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…

View →

cs.CLcs.AIRecentMay 29, 2026

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang +2 more

The paper introduces BioConCal, a supervised scoring mechanism that evaluates biomedical NER candidates surfaced by multiple LLMs, significantly improving the quality of the candidate pool for human c…

View →

cs.IRcs.AIRecentMay 27, 2026

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

The study compares agentic data retrieval using unstructured web data versus structured, semantically-annotated datasets, concluding that semantic metadata remains essential for high-precision, reliab…

View →

cs.CLRecentMay 31, 2026

Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination

Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more

The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

cs.CLRecentMay 29, 2026

Extending AI for Research to the Humanities: A Multi-Agent Framework for Evidence-Grounded Scholarship

Yating Pan, Jiajun Zhang, Jun Wang, Qi Su

The paper introduces SPIRE, a multi-agent framework designed to extend LLM research capabilities to the humanities by enabling evidence-grounded interpretive reasoning over primary sources.

View →

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →

cs.AIEmpiricalRecentJun 11, 2026

Agents-K1: Towards Agent-native Knowledge Orchestration

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang +21 more

This paper introduces Agents-K1, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs.

View →

cs.CLcs.AIRecentJun 1, 2026

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan +7 more

AutoForest is an end-to-end system that automatically generates publication-ready forest plots directly from biomedical papers, streamlining the labor-intensive process of meta-analysis.

View →

cs.AIRecentMay 30, 2026

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more

The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…

View →

cs.CLcs.AIcs.IRRecentMay 28, 2026

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu +5 more

The paper introduces SkillBrew, a multi-objective framework that treats skill bank curation as a constrained optimization problem to build efficient and well-curated skill repositories for LLM agents.

View →

cs.CLcs.AIRecentJun 1, 2026

EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang

EvoPool introduces an evolutionary multi-agent framework that efficiently generates high-quality, specialized supervision labels, significantly outperforming LLM annotation baselines across complex, l…

View →

cs.CLcs.AIcs.IRRecentMay 28, 2026

Exploring Autonomous Agentic Data Engineering for Model Specialization

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang +9 more

The paper introduces Autonomous Agentic Data Engineering, demonstrating that LLMs can autonomously plan and optimize end-to-end data curation pipelines, leading to substantial performance gains in spe…

View →

cs.CLRecentMay 31, 2026

Agentic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement

Simon Löwe, Emily Silcock

The paper introduces an agentic framework for text clustering that dynamically adapts the taxonomy generation process using specialized LLM agents, achieving state-of-the-art performance on multiple b…

View →

cs.AIRecentMay 28, 2026

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…

View →

cs.SEcs.CRRecentMay 10, 2026

Evaluating Tool Cloning in Agentic-AI Ecosystems

Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more

The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…

View →

cs.HCcs.AIRecentMay 29, 2026

Agentic Authoring of Interactive Multiview Visualizations in Genomics

Astrid van den Brandt, Kiroong Choe, Sehi L'Yi, Devin Lange +1 more

The paper evaluates various LLM-based agentic schemes for authoring complex, interactive, multiview genomics visualizations, finding that agentic iteration significantly improves visualization quality…

View →

cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →