~ similar to 2605.28707· 20 results
The study finds that institutional experience may leave detectable, yet suppressible, traces in language that shape Large Language Model moral reasoning, particularly when institutional stakes are amb…
The paper argues that current AI systems create an 'illusion of opting,' giving the appearance of meaningful choice while eroding genuine agency, and proposes new ethical frameworks to address this.
The paper proposes a Multi-Phase Inference Mechanism (MIM) to formalize how diverse world models arise, reframing alignment as making heterogeneous representations mutually processable rather than for…
This paper simulates the Argumentative Theory of Reasoning (ATR) using multi-agent debate among LLMs, demonstrating that collective adversarial discourse significantly enhances truth-seeking performan…
Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang +4 more
The paper introduces RoleCDE, a novel benchmark that evaluates role-playing agents' ability to resolve conflicts between role-specific values and general alignment constraints, revealing a 'Role Value…
The paper introduces Rationalize, a role-pair framework that facilitates shared semantic reasoning between humans and AI models to achieve deep alignment of intent and action.
The paper proposes a persona-based evaluation framework that replaces monolithic AI benchmarks with structured cognitive profiles to capture diverse human perspectives, while also identifying the chal…
Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun +2 more
This paper synthesizes over 150 scattered studies and reports to provide the first comprehensive primer on post-training reasoning data, organizing the field around data objects, utility, construction…
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…
Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank +1 more
This paper proposes a unified evaluation framework for hate speech detection that systematically assesses model performance and explainability across various label and rationale representation spaces,…
The paper introduces an adaptive interview framework to gather rich persona context, demonstrating that LLMs improve decision alignment in moral dilemmas only when they selectively ground their decisi…
The paper introduces the Triangulated Preference Shift score, an automated, curation-free metric to quantify systematic lexical biases introduced into Large Language Models during the preference-learn…
The paper introduces a novel framework to quantify faithful confidence expression (FC) in Large Reasoning Models (LRMs), finding that FC remains a significant and challenging reliability target for th…
The paper introduces an outer-loop AI agent that autonomously redesigns LLM policy-synthesis pipelines for multi-agent social dilemmas, demonstrating that the optimal pipeline structure depends critic…
COFT is a training-free decoding method that significantly reduces societal biases in large language model chain-of-thought reasoning by applying token-level fairness control at decode time.
Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng +49 more
The paper introduces Mindgames, a comprehensive multi-game arena for evaluating LLM agents' sustained social and strategic reasoning, demonstrating that current evaluations are limited by structural s…
The paper argues that traditional identity-based reputation mechanisms are structurally inapplicable to language model agents because their mutable, modular nature makes them ontologically dissociativ…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.
The paper introduces Calibrated Collective Oversight (CCO), a novel framework that uses aggregated auxiliary scoring functions and Conformal Decision Theory to provide statistically guaranteed, scalab…