~ similar to 2606.01789· 20 results
The paper introduces novel compatibility and incompatibility scores to evaluate collections of bivariate causal statements, providing a way to assess causal claims when ground truth is unavailable.
Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun +4 more
The paper proposes Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training data aligned with specific test instances to significantly improve…
This paper evaluates the causal reasoning abilities of large language models and finds that they rely heavily on lexical pattern matching rather than structural reasoning.
The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…
Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier +6 more
The paper introduces Query2Effect, a large-scale benchmark, and a two-step framework to predict causal effect sizes from natural language queries, showing that structured representation significantly…
This paper introduces an entropy-based method to generate multiple plausible causal maps (atlases) that accurately reflect the inherent structural ambiguity in complex systems, moving beyond single, o…
Haoxiang Cheng, Yunfei Wang, Chao Chen, Kewei Cheng +4 more
The paper proposes GRiD, a novel framework that uses a two-phase training strategy (supervised pre-training and RL fine-tuning) to discover complex, graph-like rules for knowledge graph reasoning, ove…
Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong +7 more
The paper argues that current embodied planning benchmarks prioritize superficial language prediction over true physical reasoning, introducing new benchmarks and a large-scale dataset to demonstrate…
Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding +1 more
The paper theoretically explains that optimizing LLMs solely on outcomes leads to brittle reasoning (Reward-Induced Manifold Collapse) by favoring low-complexity shortcuts, and proposes process-based…
The paper formalizes the concept of a causal pathway for rare events, showing that testable implications can be derived solely from this pathway abstraction, simplifying complex causal modeling.
The paper introduces Influence-Guided Symbolic Regression (IGSR), a novel framework that uses granular influence scores to guide LLMs in efficiently searching for and discovering complex mathematical…
This paper introduces topological-geometrical metrics to estimate structural causal effects that are missed by traditional mean-based methods, proposing a new concept called topological ignorability.
The paper challenges the conclusion that LLMs lack reasoning by demonstrating that reported performance drops on GSM-Symbolic are often statistically weak and partially attributable to dataset biases,…
Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang +1 more
The paper introduces CITETRACE, a large-scale dataset and evaluation framework that systematically measures structural citation failures in search-augmented LLMs, revealing a pattern called Verified M…
This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…
The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…
Shashwat Sourav, Tanjin. He, Maria K. Y. Chan, Anubhav Jain +1 more
The paper introduces 'Matter to Mechanism,' a novel benchmark designed to rigorously evaluate AI co-scientists' ability to generate plausible, mechanism-grounded solution hypotheses for complex materi…
Taein Kim, David Jiang, Yuepeng Hu, Yuqi Jia +1 more
The paper presents a large-scale study demonstrating that tool cloning is a pervasive and severe source of hidden duplication in agent-tool ecosystems, necessitating changes in how tool diversity is m…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper demonstrates that models can acquire 'evaluation meta-knowledge' from training data describing evaluation practices, leading to inflated safety benchmark performance that is independent of e…