Papers similar to 2605.29711

~ similar to 2605.29711· 17 results

cs.CLRecentMay 29, 2026

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more

The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…

View →

cs.CLcs.AIRecentMay 27, 2026

ChildEval: When large language models meet children's personalities

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more

The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…

View →

cs.CLcs.IRRecentMay 29, 2026

Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory

Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more

The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.

View →

cs.CLRecentMay 30, 2026

From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

Wuqiang Zheng, Chengbing Wang, Yilin Yang, Junyi Cheng +5 more

This paper introduces personalized empathy, a capability for LLMs to adapt empathetic strategies based on individual user history, and proposes PereGRM, a reward modeling framework that significantly…

View →

cs.CLRecentJun 1, 2026

Not What, But How: A Communicative Audit of LLM Response Framing

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more

The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…

View →

cs.CLcs.AIRecentMay 31, 2026

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

Jingjie Lin, Bingbing Wang, Zihan Wang, Zhengda Jin +3 more

The paper introduces RefMem-Bench, a new benchmark for measuring reflective memory in long-horizon dialogue, and proposes REMIND, a framework that significantly improves models' ability to synthesize…

View →

cs.AIRecentMay 28, 2026

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

The study demonstrates that conditioning AI brand recommendations on a user's persona significantly alters the recommended product set, particularly for mid-market brands, and this effect is largest o…

View →

cs.CLRecentMay 29, 2026

RealityTest: How People Probe AI Identity and Whether Models Disclose It

Anna Gausen, Sarenne Wallbridge, Bessie O'Dell, Christopher Summerfield +1 more

RealityTest introduces a large-scale, multimodal, and multilingual benchmark using real-world human data to test how AI systems disclose their identity, finding that context and phrasing are more crit…

View →

cs.AIcs.LGRecentMay 28, 2026

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more

Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.

View →

cs.AIRecentMay 30, 2026

NBQ: Next-Best-Question for Dynamic Profiling

Yimin Shi, Clarice Wang, Haixun Wang, Xiaokui Xiao

The paper proposes NBQ, a framework for dynamically selecting the next best question in a conversation to maximize information gain, and introduces QuickMatch to efficiently scale this process for rec…

View →

cs.AIcs.CLcs.LGRecentMay 29, 2026

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

Atahan Karagoz

The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…

View →

cs.CLcs.AIRecentMay 28, 2026

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao +5 more

The paper introduces DynSess, a novel session-level framework that evaluates and optimizes role-playing agents by assessing long-horizon conversational quality, significantly outperforming existing tu…

View →

cs.CLcs.IREmpiricalRecentJun 10, 2026

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

Simon Lupart, Kidist Amde Mekonnen, Zahra Abbasiantaeb, Mohammad Aliannejadi

This paper proposes a multi-turn retrieval-augmented generation pipeline for conversational systems across four domains.

View →

cs.AIRecentMay 28, 2026

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

The paper introduces a diagnostic benchmark for selective Question Answering over conflicting, multi-source personal memory, demonstrating that specialized fusion resolvers outperform general LLMs, es…

View →

cs.AIRecentJun 1, 2026

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang +8 more

The paper introduces MCP-Persona, a novel benchmark designed to evaluate LLM agents' performance on real-world, personalized applications using the Model Context Protocol (MCP), revealing that current…

View →

cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →