~ similar to 2605.29170· 20 results
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more
The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
The paper proposes a novel KAN-enhanced BiGRU architecture to improve legal document classification and summarization in a low-resource, multilingual setting using Bengali and English legal texts.
Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang +5 more
The paper introduces Prosecution Decision Prediction (PDP), a new legal AI task that assesses prosecutorial review decisions, showing that current state-of-the-art LLMs perform significantly worse on…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…
The paper introduces a cross-encoder re-ranker trained on attribution scores to improve the retrieval of highly relevant citation passages for legal question answering, outperforming standard semantic…
The paper introduces Citation Grounding (CG), a novel metric and framework, to systematically detect and reduce the hallucination of legal citations by verifying LLM outputs against a massive, structu…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
The authors created ImmigrationQA, a large source-grounded QA dataset for U.S. immigration law, and fine-tuned a small language model (Llama 3.2 3B) on it, achieving a significant performance boost ov…
The paper evaluates an automated legal triage system (FETCH) that uses follow-up questions, demonstrating that while low-cost LLMs are effective for classification, generating high-quality questions r…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei +4 more
LegalGraphRAG introduces a multi-agent, hierarchical graph retrieval-augmented generation framework to overcome the limitations of traditional RAG in legal domains, achieving state-of-the-art reliable…
The paper introduces 'bundesrecht,' an open-source, end-to-end pipeline for processing complex German statutory references, which parses, normalizes, and resolves raw citation strings into structured,…
The paper analyzes federal civil litigation data and finds that the widespread use of generative AI has significantly increased the rate of self-represented plaintiffs, but this AI-assisted drafting d…
Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more
The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.
The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…
Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more
The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…