Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange

Ankit Satpute,Noah Gießing,André Greiner-Petter,M. Schubotz,O. Teschke,Akiko Aizawa,Bela Gipp

Published 2024 in Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Large Language Models (LLMs) have demonstrated exceptional capabilities in various natural language tasks, often achieving performances that surpass those of humans. Despite these advancements, the domain of mathematics presents a distinctive challenge, primarily due to its specialized structure and the precision it demands. In this work, we follow a two-step approach to investigating the proficiency of LLMs in answering mathematical questions. First, we employ the most effective LLMs, as identified by their performance on math question-answer benchmarks, to generate answers to 78 questions from the Math Stack Exchange (MSE). Second, a case analysis is conducted on the LLM that showed the highest performance, focusing on the quality and accuracy of its answers through manual evaluation. We found that GPT-4 performs best (nDCG of 0.48 and P@10 of 0.37) amongst existing LLMs fine-tuned for answering mathematics questions and outperforms the current best approach on ArqMATH3 Task1, considering P@10. Our case analysis indicates that while GPT-4 can generate relevant answers, it isn't consistently accurate. This paper explores the current limitations of LLMs in navigating complex mathematical question-answering. We make our code and findings publicly available for research: https://github.com/gipplab/LLM-Investig-MathStackExchange

PUBLICATION RECORD

Publication year
2024
Venue
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Publication date
2024-03-30
Fields of study
Mathematics, Computer Science
Identifiers
DOI 10.1145/3626772.3657945 arXiv 2404.00344
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Understanding LLMs: A Comprehensive Overview from Training to Inference
2024cited by this paper
Solving olympiad geometry without human demonstrations
2024cited by this paper
Taxonomy of Mathematical Plagiarism
2024cited by this paper
Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study
2023cited by this paper
Who's the Best Detective? LLMs vs. MLs in Detecting Incoherent Fourth Grade Math Answers
2023cited by this paper
TEIMMA: The First Content Reuse Annotator for Text, Images, and Math
2023cited by this paper
One Blade for One Purpose: Advancing Math Information Retrieval using Hybrid Search
2023cited by this paper
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
2023cited by this paper
Code Llama: Open Foundation Models for Code
2023cited by this paper
Large Language Models: Their Success and Impact
2023cited by this paper
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
2023cited by this paper
Who's the Best Detective? Large Language Models vs. Traditional Machine Learning in Detecting Incoherent Fourth Grade Math Answers
2023cited by this paper
Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey
2021cited by this paper
Measuring Mathematical Problem Solving With the MATH Dataset
2021cited by this paper
Training Verifiers to Solve Math Word Problems
2021cited by this paper
ARQMath Lab: An Incubator for Semantic Formula Search in zbMATH Open?
2020cited by this paper
Transforming Scanned zbMATH Volumes to LaTeX: Planning the Next Level Digitisation
2020cited by this paper
Language Model is all You Need: Natural Language Understanding as Question Answering
2020cited by this paper
Publishing
2019cited by this paper
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
2019cited by this paper

CITED BY

ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
2026cites this paper
Probing Materials Knowledge in LLMs: From Latent Embeddings to Reliable Predictions
2026cites this paper
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications
2025cites this paper
Exploiting Query Reformulation and Reciprocal Rank Fusion in Math-Aware Search Engines
2025cites this paper
How do LLMs perform in the context of MCQs across different levels of thinking skills in a business education course at higher education? A comparison of ChatGPT, Gemini, and Copilot
2025cites this paper
LLM-TPF: Multiscale Temporal Periodicity-Semantic Fusion LLMs for Time Series Forecasting
2025cites this paper
Program of Thoughts for Financial Reasoning: Leveraging Dynamic In-Context Examples and Generative Retrieval
2025cites this paper
NuevAI: Streamlined Dataset Generator and Human Evaluator System for Development of Pedagogical Conversational Agents
2025cites this paper
LLMs São Bons Matemáticos? Avaliando o Desempenho em Resolução de Exercícios
2025cites this paper
Unravelling the Mechanisms of Manipulating Numbers in Language Models
2025cites this paper
Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation
2025cites this paper
NLP-QA: A Large-scale Benchmark for Informative Question Answering over Natural Language Processing Documents
2025cites this paper
Improving Math Information Retrieval via Query Rewriting with Large Language Models
2025cites this paper
QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
2025cites this paper
Does Learning Emotion Tune Expression? Investigating Linguistic Adaptation in ChatGPT-4 for Math Tutoring
2025cites this paper
Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models
2025cites this paper
Labeling Free-text Data using Language Model Ensembles
2025cites this paper
Holistic Audit Dataset Generation for LLM Unlearning via Knowledge Graph Traversal and Redundancy Removal
2025cites this paper
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning
2025cites this paper
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems
2025cites this paper
SigChord: Sniffing Wide Non-sparse Multiband Signals for Terrestrial and Non-terrestrial Networks
2025cites this paper
Development of a Chat Assistant Using Large Language Models for Personalized Mathematics Tutoring to Overcome Educational Disparities in Sri Lanka
2025cites this paper
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification
2025cites this paper
Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment
2025cites this paper
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
2025cites this paper
Born a Transformer -- Always a Transformer? On the Effect of Pretraining on Architectural Abilities
2025cites this paper
TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
2025cites this paper
Enhancing Mathematical Knowledge Graphs with Large Language Models
2025cites this paper
MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models
2025cites this paper
A Review of Mathematical Information Retrieval: Bridging Symbolic Representation and Intelligent Retrieval
2025cites this paper
RADAR: A Reasoning-Guided Attribution Framework for Explainable Visual Data Analysis
2025cites this paper
AI-Based EMG Reporting: A Randomized Controlled Trial
2025cites this paper
AI Agents in Clinical Medicine: A Systematic Review
2025cites this paper
Give me a hint: Can LLMs take a hint to solve math problems?
2024cites this paper
How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs
2024cites this paper
In Context Learning and Reasoning for Symbolic Regression with Large Language Models
2024cites this paper
Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data
2024cites this paper
Do advanced language models eliminate the need for prompt engineering in software engineering?
2024cites this paper
Automatic item generation in various STEM subjects using large language model prompting
2024cites this paper
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
2024cites this paper
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
2024cites this paper
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
2024influential citation
An Overview of zbMATH Open Digital Library
2024cites this paper
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models
2024cites this paper
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Neural Networks for Conjecture Verification
year unknowncites this paper