Papers similar to 2606.01755

~ similar to 2606.01755· 20 results

cs.LGcs.CLcs.GTRecentMay 31, 2026

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi +1 more

The paper establishes a benchmark based on the cheap-talk model to test LLM honesty when their incentives conflict with the user's, finding that models consistently over-reveal information regardless…

View →

cs.CLRecentMay 29, 2026

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more

The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…

View →

cs.LGcs.AIcs.CRRecentMay 11, 2026

Leveraging RAG for Training-Free Alignment of LLMs

John T. Halloran

The paper introduces RAG-Pref, a novel, training-free Retrieval Augmented Generation (RAG) method for preference alignment that significantly improves LLM refusal guardrails against agentic attacks wi…

View →

cs.LGcs.AIcs.DCRecentMay 29, 2026

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

The paper proposes FedVPA-GP, a federated learning framework that uses a Gumbel-Softmax prior and orthogonal loss to personalize LLM alignment by disentangling conflicting user preferences while maint…

View →

cs.CLRecentJun 1, 2026

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Danqing Wang, Akshay Sivaraman, Lei Li

The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…

View →

cs.CRcs.AIcs.LGRecentMay 29, 2026

Differentially Private Preference Data Synthesis for Large Language Model Alignment

Fengyu Gao, Jing Yang

The paper introduces DPPrefSyn, a novel algorithm that generates differentially private synthetic preference data, enabling privacy-preserving alignment of large language models.

View →

cs.CRcs.AIcs.LGRecentMay 29, 2026

Differentially Private Preference Data Synthesis for Large Language Model Alignment

Fengyu Gao, Jing Yang

The paper introduces DPPrefSyn, a novel algorithm that generates differentially private synthetic preference data, enabling privacy-preserving alignment of large language models.

View →

cs.CLcs.AIRecentMay 29, 2026

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning

Xiaoyang Ming, Jose Hernandez, Thomas Stephan Juzek

The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…

View →

cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →

cs.CLRecentJun 1, 2026

Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization

Liang Wang, Xinyi Mou, Xiaoyou Liu, Tiannan Wang +2 more

The paper proposes a hierarchical framework, PHF (Practice-Habitus-Field), inspired by Bourdieu's Theory of Practice, to improve LLM personalization by modeling user behaviors at three distinct levels…

View →

cs.AIcs.LGRecentJun 1, 2026

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more

S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…

View →

cs.CLcs.AIRecentJun 1, 2026

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

Jiaming Qu, Lucheng fu, Yibo Hu

The study finds that in multi-agent systems, peer agreement makes LLMs more susceptible to adopting misleading answers than to correcting genuinely wrong ones, suggesting a need for verification over…

View →

cs.CLcs.AIRecentMay 27, 2026

ChildEval: When large language models meet children's personalities

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more

The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…

View →

cs.CRcs.CLRecentApr 9, 2026

The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

Rui Zhang, Hongwei Li, Yun Shen, Xinyue Shen +5 more

The paper investigates how various fine-tuning methods can be used both to intentionally misalign and subsequently realign large language models (LLMs), revealing distinct strengths for attack and def…

View →

cs.CLcs.AIRecentJun 2, 2026

Quantifying Faithful Confidence Expression in Large Reasoning Models

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…

View →

cs.HCcs.AIcs.CLRecentMay 29, 2026

TUX: Measuring Human--AI Tacit Understanding

Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

The paper introduces the Tacit Understanding Index (TUX) to measure non-explicit alignment between humans and LLMs, finding that this alignment is significantly structured by individual person-level t…

View →

cs.IRcs.AIcs.CLRecentJun 2, 2026

Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation

Yuecheng Li, Zeyu Song, Jing Yao, Chi Lu +2 more

Taiji is a novel LLM-as-Enhancer framework that optimizes recommender systems by addressing the challenges of generating high-quality reasoning data and balancing semantic and ID-based rewards.

View →

cs.AIcs.CLRecentMay 28, 2026

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

Asaf Yehudai, Naama Rozen, Ariel Gera

The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.

View →

cs.AIRecentMay 27, 2026

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho +1 more

The paper proposes Faithful Agentic XAI (FAX), a verification framework that explicitly checks LLM-generated explanations against model behavior, significantly improving explanation faithfulness on a…

View →

cs.LGcs.AIRecentMay 28, 2026

In-Context Reward Adaptation for Robust Preference Modeling

Zhenyu Sun, Zheng Xu, Ermin Wei

The paper proposes In-Context Reward Adaptation, a transformer-based framework that uses in-context learning and auxiliary signals (like human response time) to robustly model diverse and unseen human…

View →