~ similar to 2606.02258· 20 results
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more
The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.
The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…
Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao +3 more
The paper introduces Battery-Sim-Agent, an LLM-based framework that reframes the difficult inverse problem of battery parameter estimation as a reasoning task, significantly outperforming traditional…
The paper introduces ProvMind, a provenance-grounded reasoning framework that significantly improves materials synthesis process optimization by accurately predicting optimal synthesis routes under ch…
Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma +9 more
The paper introduces Science Earth, a planet-scale scientific runtime that enables diverse, siloed AI capabilities to connect and collaborate dynamically, demonstrating that scientific discovery can b…
The paper proposes a category-theoretic framework for agentic AI that models scientific discovery not as answer generation, but as a verifiable transition and revision of the underlying representation…
Edward W. Staley, Tom Arbaugh, Michael Pekala, Alexander New +5 more
The paper proposes a novel hybrid framework that couples Large Language Models (LLMs) with simplified physics-based simulations to improve the synthesis planning of novel inorganic crystalline materia…
The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…
This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…
This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
This review surveys advanced techniques—including generative models, multimodal learning, and closed-loop workflows—for automated inverse materials design, enabling the targeted discovery of novel cry…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…