Beyond Correctness: Evaluating and Improving LLM Feedback in Statistical Education

Niklas Ippisch,Markus Herklotz,Anna Haensch,Carsten Schwemmer

Published 2025 in Unknown venue

ABSTRACT

Large language models (LLMs) have been proposed as scalable tools to address the gap between the importance of individualized written feedback and the practical challenges of providing it at scale. However, concerns persist regarding the accuracy, depth, and pedagogical value of their feedback responses. The present study investigates the extent to which LLMs can generate feedback that aligns with educational theory and compares techniques to improve their performance. Using mock in-class exam data from two consecutive years of an introductory statistics course at LMU Munich, we evaluated GPT-generated feedback against an established but expanded pedagogical framework. Four enhancement methods were compared in a highly standardized setting, making meaningful comparisons possible: Using a state-of-the-art model, zero-shot prompting, few-shot prompting, and supervised fine-tuning using Low-Rank Adaptation (LoRA). Results show that while all LLM setups reliably provided correctness judgments and explanations, their ability to deliver contextual feedback and suggestions on how students can monitor and regulate their own learning remained limited. Among the tested methods, zero-shot prompting achieved the strongest balance between quality and cost, while fine-tuning required substantially more resources without yielding clear advantages. For educators, this suggests that carefully designed prompts can substantially improve the usefulness of LLM feedback, making it a promising tool, particularly in large introductory courses where students would otherwise receive little or no written feedback.

PUBLICATION RECORD

Publication year
2025
Venue
Unknown venue
Publication date
2025-11-10
Fields of study
Mathematics, Computer Science, Education
Identifiers
arXiv 2511.07628
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Beyond the “Wow” Factor: Using Generative AI for Increasing Generative Sense-Making
2025cited by this paper
Fine-Tuning GPT-3.5-Turbo for Automatic Feedback Generation
2025cited by this paper
Can we trust LLMs as a tutor for our students? Evaluating the Quality of LLM-generated Feedback in Statistics Exams
2025cited by this paper
Generative AI for scalable feedback to multimodal exercises
2024influential reference
Can AI provide useful holistic essay scoring?
2024cited by this paper
From the Automated Assessment of Student Essay Content to Highly Informative Feedback: a Case Study
2024cited by this paper
Comparing the quality of human and ChatGPT feedback of students’ writing
2024cited by this paper
Empirische Arbeit: Comparing Generative AI and Expert Feedback to Students’ Writing: Insights from Student Teachers
2024cited by this paper
Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming
2024cited by this paper
On Assessing the Faithfulness of LLM-generated Feedback on Student Assignments
2024cited by this paper
Cracking the Code: Evaluating Zero-Shot Prompting Methods for Providing Programming Feedback
2024cited by this paper
Developing a Tutoring Dialog Dataset to Optimize LLMs for Educational Use
2024influential reference
The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement But May Increase Adopters' Exam Performances
2024influential reference
Next-Step Hint Generation for Introductory Programming Using Large Language Models
2023influential reference
Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT
2023influential reference
Exploring the Responses of Large Language Models to Beginner Programmers’ Help Requests
2023cited by this paper
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
2023cited by this paper
Am I Wrong, or Is the Autograder Wrong? Effects of AI Grading Mistakes on Learning
2023influential reference
Can Large Language Models Provide Feedback to Students? A Case Study on ChatGPT
2023cited by this paper
Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation
2023cited by this paper
Exploring generative AI assisted feedback writing for students’ written responses to a physics conceptual question with prompt engineering and few-shot learning
2023cited by this paper
Investigation of student experiences with ChatGPT-supported online learning applications in higher education
2023cited by this paper
Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students' text revision, motivation, and positive emotions
2023cited by this paper
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
2022influential reference
A case study of the use of the Hattie and Timperley feedback model on written feedback in thesis examination in higher education
2022cited by this paper
Large Language Models are Zero-Shot Reasoners
2022cited by this paper
Chain of Thought Prompting Elicits Reasoning in Large Language Models
2022cited by this paper
LoRA: Low-Rank Adaptation of Large Language Models
2021cited by this paper
A Review of Feedback Models and Theories: Descriptions, Definitions, and Conclusions
2021cited by this paper
Beyond right or wrong: More effective feedback for formative multiple-choice tests
2020influential reference
Visible learning: a synthesis of over 800 meta‐analyses relating to achievement
2009cited by this paper
The power of feedback.
2002influential reference

CITED BY

No citing papers are available for this paper.