~ similar to 2605.30273· 20 results
This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal +7 more
This study demonstrates that an LLM's assigned support role (e.g., Inform, Coach, Relate) significantly alters its safety profile and the types of risks it presents when assisting users in complex car…
Giulia Pucci, Emily Hemendinger, Ruizhe Li, Gavin Abercrombie +2 more
This paper systematically evaluates how LLMs uncritically adapt to potentially dangerous user prompts related to eating disorders, finding that specific linguistic cues significantly increase the like…
The paper finds that while LLMs can detect distress regardless of delusional framing, they significantly fail to intervene safely when distress is intertwined with delusion, suggesting a critical reco…
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
This paper shows that large language models can automate reproducibility assessments in the social and behavioral sciences.
Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast +2 more
The paper introduces EUDAIMONIA, a new framework and benchmark for evaluating how well LLMs align with user welfare in social interactions, finding that even state-of-the-art models frequently violate…
The paper introduces Med-HEAL, a comprehensive framework and dataset for systematically identifying and mitigating hallucinations in medical LLMs, demonstrating that a self-critique pipeline significa…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
This study investigated user reactions to inferred personal information from their own ChatGPT histories, finding that acceptability is governed by context-sensitive norms regarding generation, retent…
Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang +4 more
The paper introduces EvoNote, a self-evolving agentic framework that significantly improves the generation of evidence-grounded health community notes by utilizing an accumulated memory of past misinf…
Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more
The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…
The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…
LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…
The paper introduces CYBERMASKQA, a novel privacy-aware benchmark designed to evaluate Large Language Models' ability to perform accurate cybersecurity question answering while simultaneously preservi…
The paper introduces PrivacySIM, an evaluation suite that benchmarks how well LLMs can simulate individual user privacy decisions based on persona attributes, finding that while conditioning improves…
The paper proposes an Interpretive Audit Pipeline to evaluate LLMs for public comment analysis, arguing that measuring inter-model disagreement is crucial because standard accuracy metrics fail to det…
This paper evaluates the reliability of using Large Language Models (LLMs) as automated judges to assess the quality of other LLMs, finding a high correlation with human judgment when suitable prompts…