ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2606.02258· 20 results

cs.AIRecentMay 28, 2026

OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields

Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more

The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…

View →
cs.AIRecentMay 31, 2026

The Case for Model Science: Verify, Explore, Steer, Refine

Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more

The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.

View →
cs.AIRecentMay 28, 2026

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

A. J. Lew, Y. Cao, M. J. Buehler

The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…

View →
cs.AIRecentMay 28, 2026

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao +3 more

The paper introduces Battery-Sim-Agent, an LLM-based framework that reframes the difficult inverse problem of battery parameter estimation as a reasoning task, significantly outperforming traditional…

View →
cs.AIcs.LGRecentMay 27, 2026

ProvMind: Provenance-grounded reasoning for materials synthesis

Yiming Zhang, Ryo Tamura, Koji Tsuda

The paper introduces ProvMind, a provenance-grounded reasoning framework that significantly improves materials synthesis process optimization by accurately predicting optimal synthesis routes under ch…

View →
cs.AIRecentMay 31, 2026

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma +9 more

The paper introduces Science Earth, a planet-scale scientific runtime that enables diverse, siloed AI capabilities to connect and collaborate dynamically, demonstrating that scientific discovery can b…

View →
cs.AIcond-mat.mtrl-scics.CLRecentMay 31, 2026

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

Fiona Y. Wang, Markus J. Buehler

The paper proposes a category-theoretic framework for agentic AI that models scientific discovery not as answer generation, but as a verifiable transition and revision of the underlying representation…

View →
cs.AIcond-mat.mtrl-sciRecentMay 29, 2026

Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials

Edward W. Staley, Tom Arbaugh, Michael Pekala, Alexander New +5 more

The paper proposes a novel hybrid framework that couples Large Language Models (LLMs) with simplified physics-based simulations to improve the synthesis planning of novel inorganic crystalline materia…

View →
cs.AIRecentJun 1, 2026

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García +4 more

The paper introduces POIROT, a novel protocol that uses the agents within a multi-agent system itself to diagnose and detect failures, demonstrating superior performance over traditional evaluation me…

View →
cs.AIRecentMay 31, 2026

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

Teddy Ferdinan, Bartłomiej Koptyra, Mikołaj Langner, Tomasz Adamczyk +41 more

This survey provides a comprehensive analysis of Reasoning Language Model (RLM) adoption across 28 scientific disciplines, revealing significant disparities in RLM maturity across different scientific…

View →
cs.CRcs.CVRecentMar 18, 2026

Toward Reliable, Safe, and Secure LLMs for Scientific Applications

Saket Sanjeev Chaturvedi, Joshua Bergerson, Tanwi Mallick

This paper addresses the critical need for trustworthy LLMs in science by proposing a comprehensive, multi-layered defense framework and methodology to evaluate unique scientific vulnerabilities.

View →
cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →
cs.AIRecentMay 27, 2026

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

Sara Metcalf, William Schoenberg

The BEAMS initiative establishes comprehensive benchmarks and evaluates AI tools for modeling and simulation, finding that current AI tools excel at qualitative discussion tasks but struggle with comp…

View →
cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →
cond-mat.mtrl-scics.ETcs.LGRecentJun 1, 2026

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

Anand Babu, Rogério Almeida Gouvêa, Gian-Marco Rignanese

This review surveys advanced techniques—including generative models, multimodal learning, and closed-loop workflows—for automated inverse materials design, enabling the targeted discovery of novel cry…

View →
cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →
cs.CRcs.AIRecentMay 28, 2026

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…

View →
cs.CRcs.AIRecentMay 28, 2026

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…

View →
cs.AIcs.CLcs.CRRecentApr 18, 2026

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

The paper demonstrates that for edge-native SLMs used in decentralized governance, simpler, intuitive reasoning (System 1) is significantly more robust and efficient than complex, iterative deliberati…

View →
cs.LGcs.AIRecentMay 27, 2026

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Nishal Thomas, Noel Thomas

The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…

View →