Papers similar to 2605.29256

~ similar to 2605.29256· 20 results

cs.AIRecentJun 1, 2026

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang +4 more

The paper introduces RoleCDE, a novel benchmark that evaluates role-playing agents' ability to resolve conflicts between role-specific values and general alignment constraints, revealing a 'Role Value…

View →

cs.CLcs.AIRecentMay 27, 2026

ChildEval: When large language models meet children's personalities

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more

The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…

View →

cs.AIcs.LGRecentJun 1, 2026

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more

S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…

View →

cs.CLRecentJun 1, 2026

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

Danqing Wang, Akshay Sivaraman, Lei Li

The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei +9 more

The paper introduces MMG2Skill, a closed-loop framework that converts noisy, human-oriented web guides into editable, executable skills, significantly improving agent performance across diverse tasks.

View →

cs.AIRecentMay 28, 2026

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49 more

The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural s…

View →

cs.AIcs.CLcs.LGRecentMay 29, 2026

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

Atahan Karagoz

The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…

View →

cs.CLcs.AIRecentMay 28, 2026

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

Ruoxi Su, Yuhan Liu, Jingyu Hu

The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…

View →

cs.CLRecentMay 29, 2026

Preference-Aware Rubric Learning for Personalized Evaluation

Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more

The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…

View →

cs.CLRecentMay 31, 2026

ExpWeaver: LLM Agents Learn from Experience via Latent RAG

Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua +4 more

ExpWeaver introduces a novel framework for LLM agents to learn from past experiences using latent retrieval-augmented generation, achieving state-of-the-art performance while significantly improving t…

View →

cs.CLcs.AIRecentMay 28, 2026

Personalized Turn-Level User Conversation Satisfaction Benchmark

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang +2 more

The paper introduces PersTurnBench, a novel benchmark and evaluator for assessing personalized user conversation satisfaction at specific turns, addressing the limitation of generic response quality m…

View →

cs.CLRecentMay 29, 2026

SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Wai-Chung Kwan, Aryo Pradipta Gema, Joshua Ong Jun Leang, Pasquale Minervini

SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…

View →

cs.CLRecentJun 1, 2026

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Anshun Asher Zheng, Kanishka Misra, David I. Beaver, Junyi Jessy Li

The paper introduces HERO'S JOURNEY, a benchmark for testing complex rule induction in text games, finding that while LLMs show limited rule induction ability, procedural tasks remain a significant ch…

View →

cs.CLcs.AIRecentMay 28, 2026

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more

The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…

View →

cs.AIRecentMay 30, 2026

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more

The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…

View →

cs.CLRecentMay 30, 2026

Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agentic Conversations

Adril Putra Merin, David Anugraha, Ayu Purwarianti, Genta Indra Winata

The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…

View →

cs.AIRecentMay 28, 2026

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang +6 more

The paper introduces MiraBench, a new benchmark that evaluates the action-conditioned reliability of robotic world models, finding that visual fidelity is insufficient and that optimism bias is a perv…

View →

cs.SEcs.AIcs.CVRecentMay 27, 2026

GUI Agents for Continual Game Generation

Yixu Huang, Bo Li, Na Li, Zhe Wang +7 more

The paper proposes using GUI agents, both as objective evaluators and subjective playtesters, to significantly improve the generation of playable games from prompts, demonstrating a 66.8% rubric pass-…

View →

cs.CLcs.AIRecentMay 30, 2026

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang +5 more

SelSkill introduces a dual-granularity preference learning framework that treats skill use as a 'skill-or-skip' decision, significantly improving agent performance and execution precision in complex a…

View →

cs.LGcs.CLRecentJun 2, 2026

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more

The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…

View →