Date of Presentation

4-28-2026 9:30 AM

College

College of Science & Mathematics

Faculty Sponsor(s)

Dr. Sun

Poster Abstract

Medical large language models (LLMs) are increasingly used for diagnostic assistance, demonstrating strong performance in generating plausible diagnoses across a variety of clinical scenarios. However, in healthcare settings, predictions must be accompanied by explanations that are transparent, reliable, and grounded in clinical evidence. In this work, we present an empirical evaluation of seven open medical LLMs across three diseases with varying levels of diagnostic ambiguity malaria, COVID-19, and jaundice using both symptom-based and laboratory-based inputs.

Our results show that while models achieve high diagnostic agreement for low-ambiguity conditions, performance becomes inconsistent for more complex cases, particularly when relying on structured laboratory data. More importantly, analysis of model-generated explanations reveals consistent limitations: explanations are often descriptive, generic, and weakly tied to the provided inputs. They fail to identify which clinical factors influence the prediction, do not quantify the contribution of individual features, and lack reasoning about interactions between variables.

These findings highlight a critical gap between diagnostic capability and explainability in current medical LLMs. To address this limitation, we outline a future direction based on model-agnostic explainability methods, combining counterfactual analysis and SHAP-based feature attribution to generate structured, evidence-based explanations. Our goal is to move from fluent but ungrounded explanations toward measurable and clinically meaningful reasoning in medical AI systems.

Disciplines

Computer Sciences

Document Type

Poster

Available for download on Thursday, April 29, 2027

Share

COinS
 
Apr 28th, 9:30 AM

When Medical LLMs Diagnose but Cannot Explain: An Empirical Analysis

Medical large language models (LLMs) are increasingly used for diagnostic assistance, demonstrating strong performance in generating plausible diagnoses across a variety of clinical scenarios. However, in healthcare settings, predictions must be accompanied by explanations that are transparent, reliable, and grounded in clinical evidence. In this work, we present an empirical evaluation of seven open medical LLMs across three diseases with varying levels of diagnostic ambiguity malaria, COVID-19, and jaundice using both symptom-based and laboratory-based inputs.

Our results show that while models achieve high diagnostic agreement for low-ambiguity conditions, performance becomes inconsistent for more complex cases, particularly when relying on structured laboratory data. More importantly, analysis of model-generated explanations reveals consistent limitations: explanations are often descriptive, generic, and weakly tied to the provided inputs. They fail to identify which clinical factors influence the prediction, do not quantify the contribution of individual features, and lack reasoning about interactions between variables.

These findings highlight a critical gap between diagnostic capability and explainability in current medical LLMs. To address this limitation, we outline a future direction based on model-agnostic explainability methods, combining counterfactual analysis and SHAP-based feature attribution to generate structured, evidence-based explanations. Our goal is to move from fluent but ungrounded explanations toward measurable and clinically meaningful reasoning in medical AI systems.