~ similar to 2605.30632· 20 results
The paper proposes a framework to model moral reasoning as an ethical distribution (ethical pluralism) rather than a single binary judgment, achieving high classification accuracy by integrating norma…
The paper proposes a Multi-Phase Inference Mechanism (MIM) to formalize how diverse world models arise, reframing alignment as making heterogeneous representations mutually processable rather than for…
The paper introduces the Tacit Understanding Index (TUX) to measure non-explicit alignment between humans and LLMs, finding that this alignment is significantly structured by individual person-level t…
Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun +2 more
This paper synthesizes over 150 scattered studies and reports to provide the first comprehensive primer on post-training reasoning data, organizing the field around data objects, utility, construction…
Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig +3 more
This study investigates human-AI collaboration in question answering, finding that while collaboration is beneficial, humans make suboptimal decisions by both under-relying on correct AI suggestions a…
Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang +4 more
The paper introduces RoleCDE, a novel benchmark that evaluates role-playing agents' ability to resolve conflicts between role-specific values and general alignment constraints, revealing a 'Role Value…
MOOSE-Copilot is a novel web-based framework that unifies scientific hypothesis discovery by formalizing human-AI interaction, significantly improving performance over autonomous LLM baselines.
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…
Tianyi Zhou, Dongrui Liu, Leitao Yuan, Jing Shao +1 more
COLLEAGUE.SKILL introduces an automated system that distills heterogeneous traces of human expertise and role-specific knowledge into portable, inspectable, and usable AI skill packages.
Yisen Gao, Yixi Cai, Tianshi Zheng, Jiaxin Bai +1 more
HypoAgent is an agentic framework that enables interactive, multi-turn abductive hypothesis generation over knowledge graphs, achieving state-of-the-art performance by integrating specialized agents f…
Yang He, Xiao Ding, Bibo Cai, Yufei Zhang +4 more
DeepTool introduces a novel Process-Supervised Reinforcement Learning framework to enhance Tool-Integrated Reasoning by explicitly supervising and rewarding intermediate, interleaved deliberation step…
Benedetta Muscato, Beiduo Chen, Gizem Gezici, Barbara Plank +1 more
This paper proposes a unified evaluation framework for hate speech detection that systematically assesses model performance and explainability across various label and rationale representation spaces,…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper introduces an ontology-driven framework, From Prompts to Context, to explicitly model and structure the often-opaque context of human-Generative AI collaborations, thereby improving traceabi…
Honghao Liu, Chengjin Xu, Xuhui Jiang, Cehao Yang +4 more
The paper demonstrates that confronting Large Reasoning Models (LRMs) with conflicting objectives, such as contradictory choices or conflicting alignment values, significantly increases their vulnerab…
The paper successfully demonstrates that Large Language Models (LLMs) can be induced to adopt coherent, human-like value structures, showing strong alignment with human psychological patterns.
Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang +1 more
The paper introduces HRBench, a unified and comprehensive evaluation framework for systematically benchmarking and comparing various thinking-mode switching strategies in hybrid-reasoning LLMs.
The paper introduces CRAB-Bench and RUSE, a rigorous evaluation framework that tests LLM agents on complex, interdependent tasks with realistic human user interactions, revealing significant performan…
This paper investigates the 'faithfulness gap' in LLM agents—the discrepancy between stated reasoning and actual action—by decomposing it into two opposing steps: reasoning-to-conclusion and conclusio…