20 results for “Reproducibility”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper shows that large language models can automate reproducibility assessments in the social and behavioral sciences.
The paper introduces Croissant Tasks, a declarative metadata format designed to achieve conceptual reproducibility in machine learning by abstracting problem specifications from brittle implementation…
The paper introduces the Artificial Intelligence Bill of Materials (AIBOM) schema to provide verifiable provenance and lifecycle assurance for complex AI systems, achieving high fidelity in reproducib…
The paper introduces NICE, a declarative framework that uses NixOS to build and automatically validate reproducible environments for demonstrating software vulnerabilities (CVEs), thereby improving th…
This paper analyzes a large corpus of research artifacts, finding that many contain insecure code patterns, and proposes SAFE, a novel framework for context-aware security assessment of these artifact…
The paper argues that computer science conferences must mandate nonrepudiable, tamper-evident attestations of experimental results to ensure reported numbers accurately reflect executed computations.
The paper proposes using Trusted-Execution Environments (TEEs) to create a scalable, privacy-preserving system where authors can submit cryptographic proofs of correct research replication, thereby ad…
The paper introduces FormInv, a measurement protocol that reveals significant semantic inconsistencies in existing mathematical reasoning benchmarks, showing that standard accuracy metrics fail to cap…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
The paper proposes Self-Trained Verification (STV), a novel method that trains verifiers to catch self-generated errors by leveraging reference solutions, significantly boosting performance in both te…
The paper introduces SpatialBench-Long, a comprehensive benchmark designed to test AI agents' ability to perform end-to-end scientific reasoning and derive biological claims from complex, raw spatial…
Andreas Müller, Denis Lukovnikov, Shingo Kodama, Minh Pham +4 more
This paper analyzes existing watermarking schemes for autoregressive image generators and demonstrates that they are vulnerable to various removal and forgery attacks, suggesting they are unreliable f…
The paper introduces ProjectionBench, a novel benchmark that progressively discloses information to evaluate LLMs' ability to generate scientific hypotheses, demonstrating that advanced models like GP…
Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more
The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, often failing when the mis…
The paper introduces SciIntBench, an adversarial benchmark that reveals that LLMs' adherence to research integrity norms is highly sensitive to how the misconduct is framed, failing particularly when…
The paper introduces a robust, four-mechanism LLM pipeline that generates auditable, evidence-grounded structured trait records for hundreds of thousands of diverse species across multiple taxa.
The paper demonstrates that off-the-shelf image diffusion models, like Stable Diffusion, can be repurposed to generate synthetic structured data, posing a threat of ground truth drift in closed eviden…
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…