~ similar to 2605.29018· 20 results
Han Zhang, Zihao Tang, Xin Yu, Xiao Liu +7 more
The paper introduces RHELM, a new benchmark designed to test LLMs' long-term memory by simulating realistic, complex, and evolving dialogues that integrate multiple heterogeneous data sources.
Liang Wang, Xinyi Mou, Xiaoyou Liu, Tiannan Wang +2 more
The paper proposes a hierarchical framework, PHF (Practice-Habitus-Field), inspired by Bourdieu's Theory of Practice, to improve LLM personalization by modeling user behaviors at three distinct levels…
This study quantifies the privacy risk of inferring sensitive personality traits from user interactions with LLM-based conversational agents, demonstrating that machine learning models can accurately…
This study investigated user reactions to inferred personal information from their own ChatGPT histories, finding that acceptability is governed by context-sensitive norms regarding generation, retent…
Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast +2 more
The paper introduces EUDAIMONIA, a new framework and benchmark for evaluating how well LLMs align with user welfare in social interactions, finding that even state-of-the-art models frequently violate…
This paper systematically measured web tracking across 20 popular AI chatbots, finding that a majority share both conversational content and user identity information with third parties.
Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more
The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…
Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more
The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…
This paper characterizes the risk of covert influence—where a sender's hidden behavioral payload transfers to a receiver through undetectable carriers—across three common LLM interfaces, demonstrating…
This paper systematically evaluates LLMs' ability to infer pragmatic meaning from non-verbal responses, finding that their accuracy significantly drops compared to verbal inputs.
This study analyzes global usage patterns of generative AI among early adopters, finding that usage varies significantly by country income, with schooling being the primary use in low-income countries…
The paper introduces a framework to quantitatively measure evolving agent behaviors (traits) by analyzing changes in their configuration text files, achieving high accuracy in classifying behavioral s…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Despite having access to web search, users' reliance on conversational AI for information remains high, driven primarily by pre-existing trust and influenced indirectly by the chatbot's conversational…
Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu +3 more
Persona prompting does not universally improve LLM performance; instead, it systematically trades increased expertise depth for reduced clarity, making multi-metric evaluation essential.
The study demonstrates that conditioning AI brand recommendations on a user's persona significantly alters the recommended product set, particularly for mid-market brands, and this effect is largest o…
Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain +2 more
The paper formalizes LLM degradation during long generation as 'cognitive fatigue' and introduces the Fatigue Index (FI), a measurable, model-agnostic diagnostic tool for real-time monitoring.
The paper introduces 'layered mutability,' a framework for analyzing how persistent self-modifying AI agents drift away from intended behavior due to the accumulation of locally reasonable, uncoordina…
The paper finds that while LLMs can detect distress regardless of delusional framing, they significantly fail to intervene safely when distress is intertwined with delusion, suggesting a critical reco…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…