Papers similar to 2605.29170

~ similar to 2605.29170· 20 results

cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →

cs.CLcs.AIRecentMay 27, 2026

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak +5 more

The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…

View →

cs.CLRecentMay 28, 2026

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more

The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…

View →

cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →

cs.CLcs.AIcs.LGRecentMay 27, 2026

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique +3 more

The paper proposes a novel KAN-enhanced BiGRU architecture to improve legal document classification and summarization in a low-resource, multilingual setting using Bengali and English legal texts.

View →

cs.CLcs.AIRecentMay 27, 2026

The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment

Junyu Lu, Qi Wei, Peishuo Zheng, Jie Zhang +5 more

The paper introduces Prosecution Decision Prediction (PDP), a new legal AI task that assesses prosecutorial review decisions, showing that current state-of-the-art LLMs perform significantly worse on…

View →

cs.CRcs.AIcs.LGRecentMar 23, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…

View →

cs.CLcs.IRRecentJun 2, 2026

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

Mohamed Hesham Elganayni, Selim Saleh

The paper introduces a cross-encoder re-ranker trained on attribution scores to improve the retrieval of highly relevant citation passages for legal question answering, outperforming standard semantic…

View →

cs.CLcs.DLRecentMay 30, 2026

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

Volodymyr Ovcharov

The paper introduces Citation Grounding (CG), a novel metric and framework, to systematically detect and reduce the hallucination of legal citations by verifying LLM outputs against a massive, structu…

View →

cs.CLcs.AIRecentMay 27, 2026

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…

View →

cs.CLcs.AIRecentMay 28, 2026

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Nazarii Shportun

The authors created ImmigrationQA, a large source-grounded QA dataset for U.S. immigration law, and fine-tuned a small language model (Llama 3.2 3B) on it, achieving a significant performance boost ov…

View →

cs.AIcs.CLcs.CYRecentMay 29, 2026

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

Quinten Steenhuis, Jacqueline Harvey

The paper evaluates an automated legal triage system (FETCH) that uses follow-up questions, demonstrating that while low-cost LLMs are effective for classification, generating high-quality questions r…

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →

cs.CLcs.AIcs.MARecentMay 27, 2026

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei +4 more

LegalGraphRAG introduces a multi-agent, hierarchical graph retrieval-augmented generation framework to overcome the limitations of traditional RAG in legal domains, achieving state-of-the-art reliable…

View →

cs.CLRecentMay 29, 2026

Bundesrecht: An Open Library and Corpus for German Statutory Reference Processing

Harshil Darji, Martin Heckelmann, Christina Kratsch, Gerard de Melo

The paper introduces 'bundesrecht,' an open-source, end-to-end pipeline for processing complex German statutory references, which parses, normalizes, and resolves raw citation strings into structured,…

View →

cs.CYcs.AIRecentMay 28, 2026

The New Pro Se: Generative AI and the Surge in Federal Civil Self-Representation

Or Cohen-Sasson

The paper analyzes federal civil litigation data and finds that the widespread use of generative AI has significantly increased the rate of self-represented plaintiffs, but this AI-assisted drafting d…

View →

cs.AIRecentMay 27, 2026

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more

The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.

View →

cs.SEcs.AIRecentMay 28, 2026

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Vedant Padwal

The paper introduces CodeGolf Bench, a novel multi-language benchmark using code golf to measure LLMs' ability to generate highly concise and efficient code, showing that reasoning models significantl…

View →

cs.AIRecentMay 27, 2026

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu +3 more

The paper introduces PetroBench, a comprehensive benchmark for evaluating Large Language Models across various domains of petroleum engineering, finding that models perform better on subjective tasks…

View →