~ similar to 2605.30589· 20 results
The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…
The paper evaluates an automated legal triage system (FETCH) that uses follow-up questions, demonstrating that while low-cost LLMs are effective for classification, generating high-quality questions r…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…
Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead +1 more
The paper introduces CanLegalRAGBench, a new Canadian legal QA benchmark, and evaluates RAG systems, finding that while open-source models are competitive, automatic evaluations struggle with nuanced…
The paper introduces a cross-encoder re-ranker trained on attribution scores to improve the retrieval of highly relevant citation passages for legal question answering, outperforming standard semantic…
This study finds that when users do not specify a jurisdiction, the language used in the prompt strongly biases the LLM's response toward a specific national legal framework (U.S. for English, China f…
The paper introduces CyberCertBench, a new benchmark suite for evaluating LLMs against industry cybersecurity certifications, finding that while frontier models perform well on general knowledge, thei…
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more
The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…
Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more
PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…
The paper analyzes federal civil litigation data and finds that the widespread use of generative AI has significantly increased the rate of self-represented plaintiffs, but this AI-assisted drafting d…
Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei +4 more
LegalGraphRAG introduces a multi-agent, hierarchical graph retrieval-augmented generation framework to overcome the limitations of traditional RAG in legal domains, achieving state-of-the-art reliable…
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…
The paper benchmarks local, offline LLMs for confidential translation workflows, demonstrating that while they are viable for privacy-sensitive use, they generally lag behind top commercial NMT system…
The paper introduces WIRE, a pipeline for diagnosing live intra-policy rule conflicts in LLM agents by identifying and testing specific rule pairs within a single prompt policy that can co-govern a re…
Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki +1 more
The paper introduces a robust four-stage NLP framework that uses schema-constrained LLMs and ESCO vocabulary to accurately extract and align educational competencies with labor market demands, quantif…
The paper introduces HOPM, a hierarchical online prompt mutation framework that significantly improves the performance of language models in high-stakes evidence document generation by integrating dua…
This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao +4 more
The paper introduces CultureForest, a new benchmark for evaluating Cultural Norm Grounded Reasoning in LLMs, demonstrating that models struggle to apply their cultural knowledge effectively in realist…