CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li,Yixuan Zhang,Fajri Koto,Yifei Yang,Hai Zhao,Yeyun Gong,Nan Duan,Tim Baldwin

Published 2023 in Annual Meeting of the Association for Computational Linguistics

ABSTRACT

As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities. We conduct a thorough evaluation of 18 advanced multilingual- and Chinese-oriented LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an average accuracy of 50%, even when provided with in-context examples and chain-of-thought prompts, whereas the random baseline stands at 25%. This highlights significant room for improvement in LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models' performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.

PUBLICATION RECORD

Publication year
2023
Venue
Annual Meeting of the Association for Computational Linguistics
Publication date
2023-06-15
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2306.09212 arXiv 2306.09212
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Crosslingual Generalization through Multitask Finetuning
2023cited by this paper
Can Large Langauge Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE
2023cited by this paper
SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark
2023cited by this paper
Llama 2: Open Foundation and Fine-Tuned Chat Models
2023cited by this paper
BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained Transformer
2023cited by this paper
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
2023cited by this paper
Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation
2023cited by this paper
M3KE: A Massive Multi-Level Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models
2023cited by this paper
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
2023cited by this paper
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
2023cited by this paper
Measuring Massive Multitask Chinese Understanding
2023cited by this paper
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
2023cited by this paper
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
2023cited by this paper
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
2023cited by this paper
GPT-4 Technical Report
2023cited by this paper
LLaMA: Open and Efficient Foundation Language Models
2023cited by this paper
Holistic Evaluation of Language Models
2023cited by this paper
MultiSpanQA: A Dataset for Multi-Span Question Answering
2022cited by this paper
OPT: Open Pre-trained Transformer Language Models
2022cited by this paper
News Summarization and Evaluation in the Era of GPT-3
2022cited by this paper
GLM-130B: An Open Bilingual Pre-trained Model
2022cited by this paper
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
2022cited by this paper
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
2022cited by this paper
GLM: General Language Model Pretraining with Autoregressive Blank Infilling
2021cited by this paper
TruthfulQA: Measuring How Models Mimic Human Falsehoods
2021cited by this paper
Program Synthesis with Large Language Models
2021cited by this paper
Evaluating Large Language Models Trained on Code
2021cited by this paper
Understanding by Understanding Not: Modeling Negation in Language Models
2021cited by this paper
Measuring Mathematical Problem Solving With the MATH Dataset
2021cited by this paper
A Practical Guide to Updating Beliefs From Contradictory Evidence
2021cited by this paper
Training Verifiers to Solve Math Word Problems
2021cited by this paper
CLUE: A Chinese Language Understanding Evaluation Benchmark
2020cited by this paper
Measuring Massive Multitask Language Understanding
2020influential reference
Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly
2019cited by this paper
An Adversarial Winograd Schema Challenge at Scale
2019cited by this paper
HellaSwag: Can a Machine Really Finish Your Sentence?
2019cited by this paper
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
2019cited by this paper
Natural Questions: A Benchmark for Question Answering Research
2019cited by this paper
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
2019cited by this paper
Know What You Don’t Know: Unanswerable Questions for SQuAD
2018cited by this paper
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
2018cited by this paper
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
2018cited by this paper
Teaching Machines to Read and Comprehend
2015cited by this paper

CITED BY

Multimodal language models in agriculture: A tutorial and survey
2026cites this paper
Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
2026cites this paper
Expert Divergence Learning for MoE-based Language Models
2026cites this paper
MiMo-V2-Flash Technical Report
2026cites this paper
Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
2026cites this paper
On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation
2026cites this paper
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep
2026cites this paper
Large language models in judicial assistance: Empirical insights and domain-specific fine-tuning
2026cites this paper
Scaling Embeddings Outperforms Scaling Experts in Language Models
2026cites this paper
How to Set the Batch Size for Large-Scale Pre-training?
2026cites this paper
MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine
2026cites this paper
GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
2026cites this paper
SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models
2026influential citation
MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging
2026influential citation
Global Context Compression with Interleaved Vision-Text Transformation
2026influential citation
ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs
2026cites this paper
ERNIE 5.0 Technical Report
2026cites this paper
AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
2026cites this paper
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
2026cites this paper
Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond
2026cites this paper
Parallelism and Generation Order in Masked Diffusion Language Models: Limits Today, Potential Tomorrow
2026cites this paper
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
2026influential citation
Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning
2026influential citation
Translation as a Scalable Proxy for Multilingual Evaluation
2026cites this paper
FBS: Modeling Native Parallel Reading inside a Transformer
2026cites this paper
JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation
2026cites this paper
HiFloat4 Format for Language Model Inference
2026cites this paper
GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation
2026influential citation
HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing
2026cites this paper
TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models
2026cites this paper
How to Set the Learning Rate for Large-Scale Pre-training?
2026cites this paper
MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages
2025cites this paper
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
2025cites this paper
Enterprise Large Language Model Evaluation Benchmark
2025cites this paper
OneEval: Benchmarking LLM Knowledge-intensive Reasoning over Diverse Knowledge Bases
2025cites this paper
TeleEval-OS: Performance evaluations of large language models for operations scheduling
2025cites this paper
Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?
2025cites this paper
Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
2025cites this paper
MultiHoax: A Dataset of Multi-hop False-Premise Questions
2025cites this paper
Exploring the Impact of Occupational Personas on Domain-Specific QA
2025cites this paper
EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving
2025cites this paper
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
2025cites this paper
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought
2025influential citation
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
2025cites this paper
dots.llm1 Technical Report
2025cites this paper
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
2025cites this paper
IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
2025cites this paper
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
2025cites this paper
Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark
2025cites this paper
WiNGPT-3.0 Technical Report
2025cites this paper
Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity
2025cites this paper
Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
2025cites this paper
When Language Shapes Thought: Cross-Lingual Transfer of Factual Knowledge in Question Answering
2025cites this paper
Scaling Physical Reasoning with the PHYSICS Dataset
2025cites this paper
MiniCPM4: Ultra-Efficient LLMs on End Devices
2025cites this paper
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
2025cites this paper
AI Flow: perspectives, scenarios, and approaches
2025cites this paper
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
2025cites this paper
Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law
2025cites this paper
Flexible Realignment of Language Models
2025cites this paper
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
2025cites this paper
PodGPT: an audio-augmented large language model for research and education
2025cites this paper
ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations
2025cites this paper
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
2025cites this paper
SOAEsV2-7B/72B: Full-Pipeline Optimization for State-Owned Enterprise LLMs via Continual Pre-Training, Domain-Progressive SFT and Distillation-Enhanced Speculative Decoding
2025cites this paper
MultiLoKo: a multilingual local knowledge benchmark for LLMs spanning 31 languages
2025cites this paper
Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study
2025cites this paper
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
2025cites this paper
MathEval: A Comprehensive Benchmark for Evaluating Large Language Models on Mathematical Reasoning Capabilities
2025cites this paper
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
2025cites this paper
Every Sample Matters: Leveraging Mixture-of-Experts and High-Quality Data for Efficient and Accurate Code LLM
2025cites this paper
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
2025cites this paper
SciHorizon: Benchmarking AI-for-Science Readiness from Scientific Data to Large Language Models
2025cites this paper
HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
2025influential citation
Cognitive Activation and Chaotic Dynamics in Large Language Models: A Quasi-Lyapunov Analysis of Reasoning Mechanisms
2025cites this paper
AnesSuite: A Comprehensive Benchmark and Dataset Suite for Anesthesiology Reasoning in LLMs
2025influential citation
MiMo: Unlocking the Reasoning Potential of Language Model - From Pretraining to Posttraining
2025cites this paper
Extrapolation Merging: Keep Improving With Extrapolation and Merging
2025cites this paper
A Weighted Cross-entropy Loss for Mitigating LLM Hallucinations in Cross-lingual Continual Pretraining
2025cites this paper
Qualitative analysis of the hypnolearning model in mandarin subjects through smartphones
2025cites this paper
TLUE: A Tibetan Language Understanding Evaluation Benchmark
2025influential citation
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
2025cites this paper
An astronomical question answering dataset for evaluating large language models
2025cites this paper
Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation
2025cites this paper
TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling
2025cites this paper
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
2025cites this paper
CARE: Multilingual Human Preference Learning for Cultural Awareness
2025cites this paper
Entropy-Based Block Pruning for Efficient Large Language Models
2025influential citation
Enhancing LLMs via High-Knowledge Data Selection
2025cites this paper
Efficient Evaluation of Large Language Models via Collaborative Filtering
2025cites this paper
Enhancing Contrastive Demonstration Selection with Semantic Diversity for Robust In-Context Machine Translation
2025cites this paper
Can the capability of Large Language Models be described by human ability? A Meta Study
2025influential citation
Measuring Hong Kong Massive Multi-Task Language Understanding
2025influential citation
Quantitative Analysis of Performance Drop in DeepSeek Model Quantization
2025cites this paper
Large-Scale Corpus Construction and Retrieval-Augmented Generation for Ancient Chinese Poetry: New Method and Data Insights
2025cites this paper
Learnware of Language Models: Specialized Small Language Models Can Do Big
2025cites this paper
S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
2025cites this paper
KaFT: Knowledge-aware Fine-tuning for Boosting LLMs' Domain-specific Question-Answering Performance
2025cites this paper
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
2025cites this paper
Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
2025influential citation