Date of Presentation
4-28-2026 9:30 AM
College
College of Science & Mathematics
Faculty Sponsor(s)
Dr. Sun
Poster Abstract
Medical large language models (LLMs) are increasingly used for diagnostic assistance, demonstrating strong performance in generating plausible diagnoses across a variety of clinical scenarios. However, in healthcare settings, predictions must be accompanied by explanations that are transparent, reliable, and grounded in clinical evidence. In this work, we present an empirical evaluation of seven open medical LLMs across three diseases with varying levels of diagnostic ambiguity malaria, COVID-19, and jaundice using both symptom-based and laboratory-based inputs.
Our results show that while models achieve high diagnostic agreement for low-ambiguity conditions, performance becomes inconsistent for more complex cases, particularly when relying on structured laboratory data. More importantly, analysis of model-generated explanations reveals consistent limitations: explanations are often descriptive, generic, and weakly tied to the provided inputs. They fail to identify which clinical factors influence the prediction, do not quantify the contribution of individual features, and lack reasoning about interactions between variables.
These findings highlight a critical gap between diagnostic capability and explainability in current medical LLMs. To address this limitation, we outline a future direction based on model-agnostic explainability methods, combining counterfactual analysis and SHAP-based feature attribution to generate structured, evidence-based explanations. Our goal is to move from fluent but ungrounded explanations toward measurable and clinically meaningful reasoning in medical AI systems.
Disciplines
Computer Sciences
Document Type
Poster
When Medical LLMs Diagnose but Cannot Explain: An Empirical Analysis
Medical large language models (LLMs) are increasingly used for diagnostic assistance, demonstrating strong performance in generating plausible diagnoses across a variety of clinical scenarios. However, in healthcare settings, predictions must be accompanied by explanations that are transparent, reliable, and grounded in clinical evidence. In this work, we present an empirical evaluation of seven open medical LLMs across three diseases with varying levels of diagnostic ambiguity malaria, COVID-19, and jaundice using both symptom-based and laboratory-based inputs.
Our results show that while models achieve high diagnostic agreement for low-ambiguity conditions, performance becomes inconsistent for more complex cases, particularly when relying on structured laboratory data. More importantly, analysis of model-generated explanations reveals consistent limitations: explanations are often descriptive, generic, and weakly tied to the provided inputs. They fail to identify which clinical factors influence the prediction, do not quantify the contribution of individual features, and lack reasoning about interactions between variables.
These findings highlight a critical gap between diagnostic capability and explainability in current medical LLMs. To address this limitation, we outline a future direction based on model-agnostic explainability methods, combining counterfactual analysis and SHAP-based feature attribution to generate structured, evidence-based explanations. Our goal is to move from fluent but ungrounded explanations toward measurable and clinically meaningful reasoning in medical AI systems.