Large Language Models lack essential metacognition for reliable medical reasoning

Maxime Griot,C. Hemptinne,Jean Vanderdonckt,Demet Yuksel

Published 2025 in Nature Communications

ABSTRACT

Large Language Models have demonstrated expert-level accuracy on medical board examinations, suggesting potential for clinical decision support systems. However, their metacognitive abilities, crucial for medical decision-making, remain largely unexplored. To address this gap, we developed MetaMedQA, a benchmark incorporating confidence scores and metacognitive tasks into multiple-choice medical questions. We evaluated twelve models on dimensions including confidence-based accuracy, missing answer recall, and unknown recall. Despite high accuracy on multiple-choice questions, our study revealed significant metacognitive deficiencies across all tested models. Models consistently failed to recognize their knowledge limitations and provided confident answers even when correct options were absent. In this work, we show that current models exhibit a critical disconnect between perceived and actual capabilities in medical reasoning, posing significant risks in clinical settings. Our findings emphasize the need for more robust evaluation frameworks that incorporate metacognitive abilities, essential for developing reliable Large Language Model enhanced clinical decision support systems. Large Language Models demonstrate expert-level accuracy in medical exams, supporting their potential inclusion in healthcare settings. Here, authors reveal that their metacognitive abilities are underexplored, showing significant gaps in recognizing knowledge limitations, difficulties in modulating their confidence, and challenges in identifying when a problem cannot be answered due to insufficient information.

PUBLICATION RECORD

Publication year
2025
Venue
Nature Communications
Publication date
2025-01-14
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.1038/s41467-024-55628-6 PMID 39809759 PMCID 11733150
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies.
2024cited by this paper
The Llama 3 Herd of Models
2024cited by this paper
Qwen2 Technical Report
2024influential reference
Beyond thinking fast and slow: Implications of a transtheoretical model of clinical reasoning and error on teaching, assessment, and research
2024cited by this paper
Impact of high-quality, mixed-domain data on the performance of medical language models
2024cited by this paper
Capabilities of Gemini Models in Medicine
2024cited by this paper
Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions
2024cited by this paper
Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying
2024cited by this paper
Comparative Evaluation of LLMs in Clinical Oncology.
2024cited by this paper
GPT versus Resident Physicians — A Benchmark Based on Official Board Scores
2024cited by this paper
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks
2024cited by this paper
Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges
2024cited by this paper
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine
2024cited by this paper
Mixtral of Experts
2024cited by this paper
Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI
2023cited by this paper
Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.
2023cited by this paper
ChatGPT is Equivalent to First Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Exam.
2023cited by this paper
GPT-4 in Radiology: Improvements in Advanced Reasoning.
2023cited by this paper
Large language model AI chatbots require approval as medical devices
2023cited by this paper
Large Language Models Cannot Self-Correct Reasoning Yet
2023cited by this paper
Mistral 7B
2023cited by this paper
Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies.
2023cited by this paper
Requirements Engineering using Generative AI: Prompts and Prompting Patterns
2023cited by this paper
The underuse of AI in the health sector: Opportunity costs, success stories, risks and recommendations
2023cited by this paper
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review
2023cited by this paper
Dual Process Theory: Embodied and Predictive; Symbolic and Classical
2022influential reference
In medicine, how do we machine learn anything real?
2022cited by this paper
Large language models encode clinical knowledge
2022cited by this paper
A Survey on Conversational Search and Applications in Biomedicine
2022cited by this paper
Medical artificial intelligence
2021cited by this paper
Artificial intelligence in healthcare: transforming the practice of medicine
2021cited by this paper
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
2020cited by this paper
Bridging the gap between uncertainty, confidence and diagnostic accuracy: calibration is key
2019cited by this paper
The key-features approach to assess clinical decisions: validity evidence to date
2018cited by this paper
The Causes of Errors in Clinical Reasoning: Cognitive Biases, Knowledge Deficits, and Dual Process Thinking
2017cited by this paper
Pattern recognition as a concept for multiple-choice questions in a national licensing exam
2014cited by this paper
Metacognition in medical education
2014cited by this paper
Competing interests
2010cited by this paper
Thinking about diagnostic thinking: a 30-year perspective
2009cited by this paper
Effects of reflective practice on the accuracy of medical diagnoses
2008cited by this paper
Educational strategies to promote clinical diagnostic reasoning.
2006cited by this paper
Diagnosing Diagnosis Errors: Lessons from a Multi-institutional Collaborative Project
2005cited by this paper
The structure of reflective practice in medicine
2004cited by this paper
The Importance of Cognitive Errors in Diagnosis and Strategies to Minimize Them
2003cited by this paper
The National Board of Medical Examiners.
1955cited by this paper

CITED BY

MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus
2026cites this paper
AI Hallucination from Students'Perspective: A Thematic Analysis
2026cites this paper
Model confrontation and collaboration: A debate intelligence framework for enhancing medical reasoning in large language models
2026cites this paper
Higher-order representation in AI
2026cites this paper
CORE: A cognitive science-inspired framework for task-adaptive reasoning enhancement in large language models
2026cites this paper
ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging
2026cites this paper
Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning
2026cites this paper
Could You Be Wrong: Metacognitive Prompts for Improving Human Decision Making Help LLMs Identify Their Own Biases
2026cites this paper
Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving
2026influential citation
From Evidence-Based Endodontics to Generative AI: A Comparative Study of Eleven Large Language Models.
2026cites this paper
Large Language Models in Cardiovascular Prevention: A Narrative Review and Governance Framework
2026cites this paper
Fine-grained evaluation of large language models in medicine using non-parametric cognitive diagnostic modeling
2026cites this paper
Hybrid detection model for unauthorized use of doctor's code in health insurance: Integrating rule-based screening and LLM reasoning
2026cites this paper
Why People Turn to ChatGPT for Health Information: Extending UTAUT with Healthcare Dissatisfaction and Perceived Credibility.
2026cites this paper
The role of preoperative antibiotic prophylaxis in the relationship between intestinal colonization and post-ERCP biliary tract infection: a prospective cohort study
2026cites this paper
Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks
2026cites this paper
Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making
2026cites this paper
Advancing medical AI through benchmarking and competition for specialty triage.
2026cites this paper
St. Gallen International Breast Cancer Consensus-Based Clinical Decision Validation: Concordance Assessment Between Deep Large Language Model Outputs and Global Expert Panel Recommendations.
2026cites this paper
Applications of Large Language Models in Glaucoma: A Scoping Review.
2026cites this paper
Who Does What? Archetypes of Roles Assigned to LLMs During Human-AI Decision-Making
2026cites this paper
MedPAIR: Measuring Physicians and AI Relevance Alignment in Medical Question Answering
2025cites this paper
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
2025cites this paper
A large language model improves clinicians’ diagnostic performance in complex critical illness cases
2025cites this paper
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens
2025cites this paper
Human versus artificial social cognition and metacognition: the normative difference
2025cites this paper
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions
2025cites this paper
Automating expert-level medical reasoning evaluation of large language models
2025cites this paper
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
2025cites this paper
MExplore: an entity-based visual analytics approach for medical expertise acquisition
2025cites this paper
A Systematic Analysis of Declining Medical Safety Messaging in Generative AI Models
2025cites this paper
Interactive Large Language Models for Reliable Answering under Incomplete Context
2025cites this paper
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
2025cites this paper
Test-time Prompt Intervention
2025cites this paper
A hybrid deployment model for generative artificial intelligence in hospitals
2025cites this paper
Semi-supervised graph convolutional community detection empowered by large language models
2025cites this paper
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
2025cites this paper
Large language models for clinical decision support in gastroenterology and hepatology
2025cites this paper
Language models for drug–drug interactions: current applications, pitfalls, and future directions
2025cites this paper
Prior Gastrointestinal Infection and Risk of Post‐ERCP Biliary Tract Infection in Patients With Choledocholithiasis: A Prospective Cohort Study
2025cites this paper
Generative AI and the augmentation of information practices in knowledge work
2025cites this paper
EHR-MCP: Real-world Evaluation of Clinical Information Retrieval by Large Language Models via Model Context Protocol
2025cites this paper
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
2025cites this paper
Evidence for Limited Metacognition in LLMs
2025cites this paper
C2GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning
2025cites this paper
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
2025cites this paper
A longitudinal analysis of declining medical safety messaging in generative AI models
2025cites this paper
Lightweight Deep Learning Approaches on Edge Devices for Fetal Movement Monitoring
2025cites this paper
Fast, slow, and metacognitive thinking in AI
2025cites this paper
Improving Metacognition and Uncertainty Communication in Language Models
2025cites this paper
Metacognitive mechanisms of color communication
2025cites this paper
Evaluation of DeepSeek-R1 for Ophthalmic Diagnosis and Reasoning: A Comparison with OpenAI o1 and o3
2025cites this paper
ProSEA: Problem Solving via Exploration Agents
2025cites this paper
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
2025cites this paper
KnowRL: Teaching Language Models to Know What They Know
2025cites this paper
Assessing ChatGPT-4 as a clinical decision support tool in neuro-oncology radiotherapy: a prospective comparative study
2025cites this paper
A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
2025cites this paper
What Large Language Models Know About Plant Molecular Biology
2025cites this paper
Linear Causal Discovery with Interventional Constraints
2025cites this paper
Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations
2025cites this paper
Collaborative and Cooperative Hospital “In-House” Medical Device Development and Implementation in the AI Age: The European Responsible AI Development (EURAID) Framework Compatible With European Values
2025cites this paper
Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark
2025cites this paper
Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning
2025cites this paper
Assessing Automated Fact-Checking for Medical LLM Responses with Knowledge Graphs
2025cites this paper
Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes
2025cites this paper
Large Language Models in Medicine: Application Status and Challenges
2025cites this paper
Take caution in using LLMs as human surrogates
2025cites this paper
Artificial Intelligence in Traditional Chinese Medicine: Bridging Ancient Practice and Future Innovation
2025cites this paper
Counting Clues: A Lightweight Probabilistic Baseline Can Match an LLM
2025cites this paper
Leveraging Generative AI for Interpretable Clinical Decision Making Through Causal Graphs.
2025cites this paper
Implementation of large language models in electronic health records
2025cites this paper
A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care
2025influential citation
Be Aware of AI Limitations
2025cites this paper
Measuring the Accuracy and Reproducibility of DeepSeek R1, Claude 3.5 Sonnet, and GPT-4.1 on Complex Clinical Scenarios
2025cites this paper
Monitor-Generate-Verify (MGV): Formalising Metacognitive Theory for Language Model Reasoning
2025cites this paper
Limitations of large language models in clinical problem-solving arising from inflexible reasoning
2025influential citation
Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
2025cites this paper
Benchmarking the rationality of AI decision making using the transitivity axiom
2025cites this paper
RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering
2025cites this paper
Agentic AI Needs a Systems Theory
2025cites this paper
MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
2025cites this paper
Evaluating large language models and agents in healthcare: key challenges in clinical applications
2025cites this paper
Evaluating Large Reasoning Model Performance on Complex Medical Scenarios In The MMLU-Pro Benchmark
2025cites this paper
ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
2025cites this paper
Metacognition and Uncertainty Communication in Humans and Large Language Models
2025cites this paper
Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey
2025cites this paper
The Rise of Small Language Models in Healthcare: A Comprehensive Survey
2025influential citation
Zero-shot learning for clinical phenotyping: Comparing LLMs and rule-based methods
2025cites this paper
Metacognitive sensitivity: The key to calibrating trust and optimal decision making with AI
2025influential citation
Uncertainty-aware large language models for explainable disease diagnosis
2025cites this paper
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models
2025cites this paper
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
2025cites this paper
Pedagogy-R1: Pedagogically-Aligned Reasoning Model with Balanced Educational Benchmark
2025cites this paper
Bellman-optimal Decisions and Expert Intuition
2025cites this paper
MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
2025cites this paper
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine
2024cites this paper
Comparing the Performances of a 54-Year-Old Computer-Based Consultation to ChatGPT-4o
2024cites this paper
Pretraining with random noise for uncertainty calibration
2024cites this paper
Show or Tell? Interactive Task Learning with Large Language Models
year unknowncites this paper
Cognition and Metacognition: The Normative Difference
year unknowncites this paper