~ similar to 2606.00272· 20 results
Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more
The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
The authors created ImmigrationQA, a large source-grounded QA dataset for U.S. immigration law, and fine-tuned a small language model (Llama 3.2 3B) on it, achieving a significant performance boost ov…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei +4 more
LegalGraphRAG introduces a multi-agent, hierarchical graph retrieval-augmented generation framework to overcome the limitations of traditional RAG in legal domains, achieving state-of-the-art reliable…
This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…
The paper analyzes federal civil litigation data and finds that the widespread use of generative AI has significantly increased the rate of self-represented plaintiffs, but this AI-assisted drafting d…
Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang +5 more
The paper introduces Prosecution Decision Prediction (PDP), a new legal AI task that assesses prosecutorial review decisions, showing that current state-of-the-art LLMs perform significantly worse on…
The paper introduces HOPM, a hierarchical online prompt mutation framework that significantly improves the performance of language models in high-stakes evidence document generation by integrating dua…
This study finds that when users do not specify a jurisdiction, the language used in the prompt strongly biases the LLM's response toward a specific national legal framework (U.S. for English, China f…
Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more
The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…
The paper introduces a cross-encoder re-ranker trained on attribution scores to improve the retrieval of highly relevant citation passages for legal question answering, outperforming standard semantic…
The paper proposes a rigorous, fixed-budget, cluster-aware standard for LLM-as-a-judge evaluation of multi-hop RAG systems, demonstrating that current evaluation methods often overstate performance.
Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more
The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…
Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more
PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…
Zelin Guan, Shengda Zhuo, Zeyan Li, Jinchun He +3 more
E-MIA introduces a novel, stealthy black-box membership inference attack that converts verifiable hard evidence within a candidate document into an objective, multi-part exam score to determine if the…
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.