~ similar to 2606.01958· 20 results
The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…
Ziyue Yang, Da Ma, Hanqi Li, Zijian Wang +7 more
DeepSurvey is an agentic system that significantly enhances automated survey generation by extracting deep, structured knowledge from full-text papers and rigorously validating citations, achieving su…
The paper introduces a novel, scalable framework to monitor and classify dataset usage within research literature, addressing the current lack of infrastructure for tracking data citations.
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper proposes a comprehensive benchmark to systematically audit how varying persona prompts and model choices affect the technical quality and social representativeness of scholar recommendations…
The paper introduces the VET Framework, a tool for analyzing polarized public discourse on AI by categorizing narratives based on valence, effectiveness, and trajectory, thereby promoting AI literacy.
ResearchLoop introduces an evidence-gated control plane to manage and audit the state of AI-assisted computational research, mitigating the risk of unverified claims.
This study analyzes global usage patterns of generative AI among early adopters, finding that usage varies significantly by country income, with schooling being the primary use in low-income countries…
The paper introduces a typed claim network that models cross-document references by explicitly labeling the stance (e.g., agreement, disagreement) of a citation, significantly improving downstream tas…
RealityTest introduces a large-scale, multimodal, and multilingual benchmark using real-world human data to test how AI systems disclose their identity, finding that context and phrasing are more crit…
Hao Chen, Xing Tang, Qirui Liu, Weijie Shi +5 more
The paper introduces the Data-centric Reasoning Compiler (DCRC), a novel data-driven framework that enhances financial QA systems by compiling user queries and retrieved documents into verifiable, exe…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
The paper evaluates LLM-generated reactions to Spanish online news, finding that off-the-shelf models fail to accurately reproduce the measurable properties of real audience discourse, and even fine-t…
The paper introduces TELL, a novel explainable AI-generated text detection architecture that provides detailed, human-understandable explanations for its scores, achieving competitive performance whil…