Do Clinical Models Change Treatment Decisions?
The paper introduces ClinPivot, a benchmark that tests whether clinical models can correctly adjust treatment decisions when new patient context constraints are introduced, finding that strong medical QA performance does not guarantee accurate decision-making.
Abstract
More Like ThisClinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.