cs.CL

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

May 31, 2026

AI Summarygemma4:e4b

HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing the answer space.

Abstract

More Like This

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

The paper demonstrates that while distilling large language models for medical Q…

02Low14%

What Am I Missing? Question-Answering as Hidden State Probing

The paper proposes using question-asking as an inference-time intervention to pr…

03Low14%

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Sourc…

This paper introduces a framework to audit source-dependence in multi-source RAG…

04Low12%

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditi…

The paper proposes Reasoning-Conditioned Direct Preference Optimization (RC-DPO)…

05Low13%

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial P…

The paper identifies a failure mode called unfaithful capitulation (UC), where r…

06Low12%

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

The paper introduces OR-Space, a novel full-lifecycle workspace benchmark design…

07Low12%

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

The paper identifies and demonstrates that post-conclusion continuation in answe…

08Low12%

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

Eywa is a provenance-grounded memory architecture for AI agents that separates s…