~ similar to 2605.29446· 20 results
Wanhao Liu, Jiaqing Xie, Qian Tan, Weida Wang +9 more
The paper introduces OmniMatBench, a comprehensive, human-calibrated multimodal reasoning benchmark covering 19 materials science subfields, revealing that current multimodal language models (MLLMs) h…
The paper introduces a novel padding method that leverages crystal symmetry to enhance the encoding of complex inorganic structures, significantly improving the generation of stable, novel materials.
This review surveys advanced techniques—including generative models, multimodal learning, and closed-loop workflows—for automated inverse materials design, enabling the targeted discovery of novel cry…
Edward W. Staley, Tom Arbaugh, Michael Pekala, Alexander New +5 more
The paper proposes a novel hybrid framework that couples Large Language Models (LLMs) with simplified physics-based simulations to improve the synthesis planning of novel inorganic crystalline materia…
The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…
Aravind Mandiga, Guoming Li, Jin Lu, Ismailcem Budak Arpinar +2 more
The paper introduces ProtStructQA, an executable benchmark that tests protein structural reasoning by requiring language models to generate measurable 3D coordinates, revealing a capability-dependent…
Shashwat Sourav, Tanjin. He, Maria K. Y. Chan, Anubhav Jain +1 more
The paper introduces 'Matter to Mechanism,' a novel benchmark designed to rigorously evaluate AI co-scientists' ability to generate plausible, mechanism-grounded solution hypotheses for complex materi…
Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo +21 more
The paper introduces Dr. DocBench, a difficulty-aware, comprehensive benchmark designed to rigorously test expert-level and challenging document parsing capabilities for VLMs, demonstrating that curre…
The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks (like financial factors) due to memorization, which…
The paper introduces NumLeak, a framework demonstrating that top-tier LLMs often exhibit high fidelity recall of specific public numeric benchmarks, suggesting that their apparent skill may be due to…
The paper identifies a universal, statistically predictable distribution (Mandelbrot) governing LLM outputs, enabling a highly efficient, model-agnostic scoring primitive for provenance and quality as…
The paper introduces the Vector Network (VN), a novel recurrent architecture that replaces fixed weight matrices with reusable weight atoms, enabling superior compositional generalization by making st…
Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter +2 more
The paper explains the 'table-chart gap' in scientific claim verification by showing that multimodal LLMs successfully encode information from charts but fail to route it to the final prediction layer…
The paper introduces CalArena, a large-scale, standardized benchmark covering nearly 2000 experiments to comprehensively evaluate post-hoc calibration methods, finding that smooth calibration function…
The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…
This study empirically benchmarks classical and quantum machine learning models for image recognition, finding that while quantum models offer superior accuracy and resource efficiency at high dimensi…
The paper introduces Chunk-Level Guided Generation, a training-free method that uses an off-the-shelf large language model (LLM) as a process scorer to guide small model generation, achieving performa…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
The paper identifies a fundamental mismatch between standard pairwise ranking metrics (like AP and FPR-95) and the true assignment objective in multi-view object association, proposing a Sinkhorn-base…