ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.00994· 20 results

cs.AIRecentMay 27, 2026

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

James P. Balhoff, Hilmar Lapp

Frontier LLM-based agents can effectively overcome the manual bottleneck of phenotype annotation by achieving consistency comparable to human experts, significantly outperforming existing NLP tools.

View →
cs.CLcs.AIRecentMay 29, 2026

Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage

Shuheng Cao, Ruiqi Chen, Renjie Cao, Zhenhao Zhang +2 more

The paper introduces BioConCal, a supervised scoring mechanism that evaluates biomedical NER candidates surfaced by multiple LLMs, significantly improving the quality of the candidate pool for human c…

View →
cs.AIRecentMay 30, 2026

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more

The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…

View →
cs.AIEmpiricalRecentJun 11, 2026

Automated reproducibility assessments in the social and behavioral sciences using large language models

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger +6 more

This paper shows that large language models can automate reproducibility assessments in the social and behavioral sciences.

View →
cs.CLcs.AIRecentJun 1, 2026

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen +9 more

This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…

View →
cs.CRcs.CLRecentApr 28, 2026

The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

Alex Bogdan, Adrian de Valois-Franklin

The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…

View →
cs.CLcs.AIcs.LGRecentJun 1, 2026

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

Atoosa Chegini, Soheil Feizi

The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…

View →
cs.AIcs.CRcs.LGRecentMay 28, 2026

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Caleb DeLeeuw

The paper introduces BioRefusalAudit, a method that audits the structural soundness of language model biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under min…

View →
cs.AIcs.CRcs.LGRecentMay 28, 2026

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

Caleb DeLeeuw

The paper audits the structural soundness of LLM biosecurity refusals, finding that refusal behavior is highly unstable, often collapsing under minor prompt changes, and may track legal salience rathe…

View →
cs.CLcs.AIRecentJun 1, 2026

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan +7 more

AutoForest is an end-to-end system that automatically generates publication-ready forest plots directly from biomedical papers, streamlining the labor-intensive process of meta-analysis.

View →
cs.CRcs.AIRecentMay 28, 2026

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…

View →
cs.CRcs.AIRecentMay 28, 2026

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…

View →
cs.CLRecentJun 1, 2026

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao +4 more

The paper introduces CultureForest, a new benchmark for evaluating Cultural Norm Grounded Reasoning in LLMs, demonstrating that models struggle to apply their cultural knowledge effectively in realist…

View →
cs.LGcs.CRRecentApr 29, 2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

Guillermo Iglesias, Gema Bello-Orgaz, María Navas-Loro, Cristian Ramirez-Atencia +2 more

This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…

View →
cs.CLcs.AIRecentMay 28, 2026

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Nazarii Shportun

The authors created ImmigrationQA, a large source-grounded QA dataset for U.S. immigration law, and fine-tuned a small language model (Llama 3.2 3B) on it, achieving a significant performance boost ov…

View →
cs.CLcs.AIRecentMay 28, 2026

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang +11 more

EviLink addresses the ambiguity of schema linking in Text-to-SQL by treating it as an uncertainty-aware inference over multiple plausible SQL paths, significantly improving recall and efficiency.

View →
cs.AIRecentMay 28, 2026

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

Yeong-Joon Ju, Seong-Whan Lee

The paper introduces RefWalk, a novel framework designed to improve regulatory compliance question answering by ensuring rigorous citation traceability and explicit per-rule attribution across complex…

View →
cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →
cs.CRcs.LGcs.SERecentApr 30, 2026

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

Jun Yeon Won, Xin Jin, Shiqing Ma, Zhiqiang Lin

The paper introduces REBench, a comprehensive, standardized benchmark dataset designed to enable fair and rigorous evaluation of Large Language Models (LLMs) on complex binary reverse engineering task…

View →
cs.AIRecentMay 28, 2026

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang +5 more

The paper introduces Compass, an expert-guided LLM agent framework that successfully extracts and integrates thousands of previously inaccessible marine lead records from vast corpora of scientific pa…

View →