~ similar to 2606.01434· 20 results
The paper introduces RefWalk, a novel framework designed to improve regulatory compliance question answering by ensuring rigorous citation traceability and explicit per-rule attribution across complex…
This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…
Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more
The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…
The paper introduces Factual Density (FD*), a novel retrieval signal that measures the proportion of verified facts, demonstrating that optimizing RAG retrieval based on this density significantly imp…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu +4 more
The paper demonstrates that while distilling large language models for medical QA can significantly improve final answer accuracy, this gain often comes at the cost of factual accuracy and detailed re…
HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…
Qing Wang, Tianshi Liu, Minghao Zhou, Jialu Liang +4 more
UniD$^3$ is a novel Knowledge Graph-enhanced RAG framework that processes vast biomedical literature to systematically extract, organize, and validate comprehensive drug-disease knowledge, achieving h…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu +6 more
The paper introduces SafeRx-Agent, a knowledge-grounded multi-agent framework that improves medication recommendation accuracy and safety by incorporating fine-grained ATC codes and rigorous safety ve…
Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more
The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…
This study benchmarks four local LLMs for natural-language-to-SQL querying in biopharma manufacturing, finding that general-purpose code-tuned models like Llama 3.1 8B and Qwen 2.5 Coder 7B outperform…
The paper introduces CGM-Agent, a privacy-preserving framework that allows users to ask free-form questions about their continuous glucose data using LLMs while ensuring all computation remains local…
HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang +4 more
The paper argues that current search agents often verify existing knowledge rather than genuinely searching, and introduces LiveBrowseComp, a new benchmark to measure true evidence-driven discovery.
The paper introduces a Deep Research pipeline that significantly improves literature search recall and demonstrates that human-curated citation lists are often unreliable and do not serve as a true gr…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
The paper proposes a rigorous, fixed-budget, cluster-aware standard for LLM-as-a-judge evaluation of multi-hop RAG systems, demonstrating that current evaluation methods often overstate performance.