ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.31381· 20 results

cs.CRcs.AIcs.LGRecentMar 23, 2026

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…

View →
cs.AIRecentMay 29, 2026

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani +1 more

LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…

View →
cs.CLcs.AIRecentMay 27, 2026

Models That Know How Evaluations Are Designed Score Safer

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

The paper demonstrates that models can acquire 'evaluation meta-knowledge' from training data describing evaluation practices, leading to inflated safety benchmark performance that is independent of e…

View →
cs.CLRecentJun 1, 2026

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su +4 more

The paper introduces LongJudgeBench, a new benchmark designed to evaluate the reliability of LLM judges specifically for complex, long-form output evaluation, revealing significant instability gaps in…

View →
cs.CLcs.AIRecentMay 27, 2026

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…

View →
cs.CLcs.CRRecentMay 1, 2026

ML-Bench&Guard: Policy-Grounded Multilingual Safety Benchmark and Guardrail for Large Language Models

Yunhan Zhao, Zhaorun Chen, Xingjun Ma, Yu-Gang Jiang +1 more

The paper introduces ML-Bench, a policy-grounded multilingual safety benchmark, and ML-Guard, a superior guardrail model that enables culturally and legally aligned safety assessment for LLMs across 1…

View →
stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
cs.AIRecentMay 27, 2026

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more

The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…

View →
cs.SEcs.CRRecentMar 18, 2026

Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety

Xuan Chen, Lu Yan, Ruqi Zhang, Xiangyu Zhang

The paper introduces SafeAudit, a meta-audit framework that systematically enumerates test cases and uses a quantitative metric to uncover significant residual unsafe behaviors in LLM agents that exis…

View →
cs.CRcs.AIcs.CLRecentApr 3, 2026

An Independent Safety Evaluation of Kimi K2.5

Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary +11 more

The paper conducts a preliminary safety evaluation of the open-weight LLM Kimi K2.5, finding that while it is highly capable, it exhibits concerning dual-use risks, particularly regarding CBRNE misuse…

View →
cs.CLcs.AIRecentMay 27, 2026

BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law

Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak +5 more

The paper introduces BenGER, a comprehensive benchmark for evaluating LLMs on German legal reasoning, demonstrating that closed-flagship models perform best and that human-AI co-creation significantly…

View →
cs.AIcs.CRRecentMay 11, 2026

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu +2 more

The paper introduces EditRisk-Bench, a novel benchmark designed to systematically evaluate the safety risks and downstream reasoning corruption caused by malicious knowledge editing in large language…

View →
cs.CRcs.AIRecentMar 31, 2026

Security in LLM-as-a-Judge: A Comprehensive SoK

Aiman Al Masoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu +5 more

This paper provides the first comprehensive Systematization of Knowledge (SoK) on the security aspects of LLM-as-a-Judge (LaaJ) systems, identifying key vulnerabilities and proposing a taxonomy for fu…

View →
cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →
cs.CLcs.AIRecentMay 28, 2026

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Volodymyr Ovcharov

The paper introduces Multi-Legal-Bench, a novel cross-jurisdictional benchmark evaluating LLMs on five standardized legal reasoning tasks across six diverse countries, demonstrating that cross-lingual…

View →
cs.AIRecentMay 27, 2026

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Dasol Choi, Alex Kwon

The paper introduces 'brittle safety,' a failure mode where aligned language models fail to adapt their safety behavior when a situational context changes, and proposes state-aware validation to detec…

View →
cs.AIcs.CRRecentApr 1, 2026

UK AISI Alignment Evaluation Case-Study

Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz +1 more

The study evaluated four frontier AI models to assess their reliability in following safety research goals, finding no confirmed instances of sabotage but noting that certain models frequently refuse…

View →
cs.CRcs.AIRecentApr 10, 2026

Conflicts Make Large Reasoning Models Vulnerable to Attacks

Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more

The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…

View →
cs.AIcs.CRRecentMar 26, 2026

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang, Zongjie Li +4 more

This paper introduces a novel framework, the Reasoning Safety Monitor, to detect and prevent logical inconsistencies and adversarial manipulations within the internal reasoning steps of large language…

View →