~ similar to 2605.30582· 20 results
Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang +1 more
The paper introduces CITETRACE, a large-scale dataset and evaluation framework that systematically measures structural citation failures in search-augmented LLMs, revealing a pattern called Verified M…
The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…
Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang +7 more
This paper proposes four guidelines and two novel data ordering methods (STR and SAW) to systematically optimize data organization, significantly enhancing the stability and performance of LLM trainin…
The paper introduces an LLM-based pipeline that tags learning resources with structured competencies, achieving strong performance while providing traceable evidence and leveraging graph constraints.
The paper introduces a typed claim network that models cross-document references by explicitly labeling the stance (e.g., agreement, disagreement) of a citation, significantly improving downstream tas…
The paper proposes a comprehensive benchmark to systematically audit how varying persona prompts and model choices affect the technical quality and social representativeness of scholar recommendations…
The paper introduces PRAIB, a benchmark that demonstrates that LLM-generated peer reviews, while often verbose, systematically diverge from human norms by being less variable, positively biased, and f…
The paper introduces a multilingual corpus and demonstrates that small, fine-tuned language models (SLMs) are highly effective for Citation Needed Detection (CND) in lower-resource languages, often ou…
The paper introduces SPIRE, a multi-agent framework designed to extend LLM research capabilities to the humanities by enabling evidence-grounded interpretive reasoning over primary sources.
The paper introduces TSM-Bench, a new benchmark that demonstrates existing LLM-generated text detectors fail to accurately identify task-specific machine-generated content found in real-world Wikipedi…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
The paper introduces RefWalk, a novel framework designed to improve regulatory compliance question answering by ensuring rigorous citation traceability and explicit per-rule attribution across complex…
This paper conducts a large-scale audit of human annotation reporting in NLP, finding that while reporting has improved, critical details needed to assess annotation validity, such as training and agr…
Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter +2 more
The paper explains the 'table-chart gap' in scientific claim verification by showing that multimodal LLMs successfully encode information from charts but fail to route it to the final prediction layer…
The paper introduces Citation Grounding (CG), a novel metric and framework, to systematically detect and reduce the hallucination of legal citations by verifying LLM outputs against a massive, structu…
The paper introduces FOSSIL, a new multilingual dataset and specialized workflow designed to significantly improve the extraction of citations embedded within complex footnotes common in law and human…
The authors introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.