~ similar to 2606.02009· 20 results
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
The paper proposes a TEE-based architecture that enables external, auditable verification of AI-assisted grant evaluations without exposing the proprietary model, scoring logic, or intermediate reason…
Hang Li, Fedor Filippov, Yuling Lin, Pengfei He +5 more
This paper investigates the vulnerability of LLM-based automatic grading systems to prompt injection (PI) attacks, demonstrating that current systems are highly susceptible to manipulation that can le…
This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…
Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris +4 more
PReMISE introduces a framework to audit and improve the quality of rubrics used to guide LLM judges, demonstrating that it can significantly increase judge accuracy and reduce the exploitability of re…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…
The paper proposes a rigorous, fixed-budget, cluster-aware standard for LLM-as-a-judge evaluation of multi-hop RAG systems, demonstrating that current evaluation methods often overstate performance.
This paper proposes an empirical methodology to automate web application trustworthiness assessment by leveraging Large Language Models (LLMs) to verify adherence to secure coding practices, showing t…
Zelin Guan, Shengda Zhuo, Zeyan Li, Jinchun He +3 more
E-MIA introduces a novel, stealthy black-box membership inference attack that converts verifiable hard evidence within a candidate document into an objective, multi-part exam score to determine if the…
The paper proposes a trust-boundary architecture using Lean 4 to verify the deterministic structured computations surrounding LLM pipelines, providing verifiable certificates for high-stakes deploymen…
Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki +1 more
The paper introduces a robust four-stage NLP framework that uses schema-constrained LLMs and ESCO vocabulary to accurately extract and align educational competencies with labor market demands, quantif…
The paper proposes decomposing the assessment of massive multilingual parallel data into separate parallelism and quality estimation components, concluding that no single universal metric is reliable…
The paper introduces PRAIB, a benchmark that demonstrates that LLM-generated peer reviews, while often verbose, systematically diverge from human norms by being less variable, positively biased, and f…
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
This paper investigates whether the gains from using a Popperian falsificationist prompt skill in large language models are due to the skill's content or its structure.
This paper investigates whether the gains from using a Popperian falsificationist prompt skill in large language models are due to the skill's content or its structure.
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu +4 more
The paper demonstrates that while distilling large language models for medical QA can significantly improve final answer accuracy, this gain often comes at the cost of factual accuracy and detailed re…