Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this work, we follow a two-step approach to investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our case analysis indicates that while GPT-4 can generate relevant answers, it isn't consistently accurate. This paper explores the current limitations of LLMs in navigating complex mathematical question-answering. We make our code and findings publicly available for research: https://github.com/gipplab/LLM-Investig-MathStackExchange
Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange
Ankit Satpute,Noah Gießing,André Greiner-Petter,M. Schubotz,O. Teschke,Akiko Aizawa,Bela Gipp
Published 2024 in Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
ABSTRACT
PUBLICATION RECORD
- Publication year
2024
- Venue
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
- Publication date
2024-03-30
- Fields of study
Mathematics, Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-20 of 20 references · Page 1 of 1
CITED BY
Showing 1-45 of 45 citing papers · Page 1 of 1