~ similar to 2605.29256· 20 results
Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang +4 more
The paper introduces RoleCDE, a novel benchmark that evaluates role-playing agents' ability to resolve conflicts between role-specific values and general alignment constraints, revealing a 'Role Value…
Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai +4 more
The paper introduces ChildEval, a large-scale benchmark designed to systematically evaluate how well large language models can infer and follow complex, child-specific preferences during long-context…
Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu +12 more
S-SPPO introduces a dual-space semantic calibration framework to stabilize Self-Play Preference Optimization (SPPO), preventing policy degeneration when preference oracles assign overly confident wins…
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei +9 more
The paper introduces MMG2Skill, a closed-loop framework that converts noisy, human-oriented web guides into editable, executable skills, significantly improving agent performance across diverse tasks.
Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49 more
The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural s…
The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…
The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
Tao Feng, Tianyang Luo, Jingjun Xu, Zhigang Hua +4 more
ExpWeaver introduces a novel framework for LLM agents to learn from past experiences using latent retrieval-augmented generation, achieving state-of-the-art performance while significantly improving t…
Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang +2 more
The paper introduces PersTurnBench, a novel benchmark and evaluator for assessing personalized user conversation satisfaction at specific turns, addressing the limitation of generic response quality m…
SCOPE introduces a data-free self-play framework that co-evolves a task-generating Challenger and a document-answering Solver, significantly improving open-ended performance on language models without…
The paper introduces HERO'S JOURNEY, a benchmark for testing complex rule induction in text games, finding that while LLMs show limited rule induction ability, procedural tasks remain a significant ch…
Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang +5 more
The paper introduces Canonical-Context On-Policy Distillation (CCOPD) to improve multi-turn language model performance by mitigating 'self-anchored drift,' ensuring consistent answers regardless of wh…
Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang +7 more
The paper introduces Latent Reward Steering (LRS), an adaptive inference-time framework that implicitly improves the reasoning ability of LLMs by guiding the model's internal latent states based on a…
The paper introduces Momento, a new benchmark that evaluates agentic AI's ability to maintain state and reason across multiple, disconnected sessions, revealing that current agents struggle with integ…
Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang +6 more
The paper introduces MiraBench, a new benchmark that evaluates the action-conditioned reliability of robotic world models, finding that visual fidelity is insufficient and that optimism bias is a perv…
Yixu Huang, Bo Li, Na Li, Zhe Wang +7 more
The paper proposes using GUI agents, both as objective evaluators and subjective playtesters, to significantly improve the generation of playable games from prompts, demonstrating a 66.8% rubric pass-…
Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang +5 more
SelSkill introduces a dual-granularity preference learning framework that treats skill use as a 'skill-or-skip' decision, significantly improving agent performance and execution precision in complex a…
Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang +9 more
The paper proposes Skill-RM, a unified framework that treats reward modeling as an agentic task to consistently integrate diverse evaluation criteria, achieving superior performance over traditional m…