Enterprise Large Language Model Evaluation Benchmark

Liya Wang,David Yi,Damien Jose,John Passarelli,James Gao,Jordan Leventis,Kang Li

Published 2025 in Machine Learning Techniques and NLP

ABSTRACT

Large Language Models (LLMs) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task complexities. We propose a 14-task framework grounded in Bloom’s Taxonomy to holistically evaluate LLM capabilities in enterprise contexts. To address challenges of noisy data and costly annotation, we develop a scalable pipeline combining LLM-as-a-Labeler, LLM-as-aJudge, and corrective retrieval-augmented generation (CRAG), curating a robust 9,700-sample benchmark. Evaluation of six leading models shows open-source contenders like DeepSeek R1 rival proprietary models in reasoning tasks but lag in judgment-based scenarios, likely due to overthinking. Our benchmark reveals critical enterprise performance gaps and offers actionable insights for model optimization. This work provides enterprises a blueprint for tailored evaluations and advances practical LLM deployment.

PUBLICATION RECORD

Publication year
2025
Venue
Machine Learning Techniques and NLP
Publication date
2025-06-25
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.48550/arXiv.2506.20274 arXiv 2506.20274
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
2025cited by this paper
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning
2025cited by this paper
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
2024cited by this paper
Large Language Models for Data Annotation and Synthesis: A Survey
2024cited by this paper
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
2024cited by this paper
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
2024cited by this paper
MedQA-CS: Objective Structured Clinical Examination (OSCE)-Style Benchmark for Evaluating LLM Clinical Skills
2024cited by this paper
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
2024cited by this paper
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
2024cited by this paper
I-SHEEP: Self-Alignment of LLM from Scratch through an Iterative Self-Enhancement Paradigm
2024cited by this paper
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
2024cited by this paper
AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models
2024cited by this paper
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey
2024cited by this paper
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
2024cited by this paper
Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
2024cited by this paper
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
2024cited by this paper
FinBen: A Holistic Financial Benchmark for Large Language Models
2024influential reference
LLM-based NLG Evaluation: Current Status and Challenges
2024cited by this paper
Corrective Retrieval Augmented Generation
2024influential reference
West-of-N: Synthetic Preferences for Self-Improving Reward Models
2024cited by this paper
Comprehensive Exploration of Synthetic Data Generation: A Survey
2024cited by this paper
Self-Rewarding Language Models
2024cited by this paper
Qwen2.5 Technical Report
2024cited by this paper
LawBench: Benchmarking Legal Knowledge of Large Language Models
2023cited by this paper
Holistic Evaluation of Language Models
2023cited by this paper
BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark
2023influential reference
GPT-4 Technical Report
2023cited by this paper
ChatGPT outperforms crowd workers for text-annotation tasks
2023cited by this paper
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
2023cited by this paper
ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning
2023cited by this paper
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
2023cited by this paper
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
2023cited by this paper
ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness
2023cited by this paper
XuanYuan 2.0: A Large Chinese Financial Chat Model with Hundreds of Billions Parameters
2023influential reference
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
2023cited by this paper
Large Language Models are not Fair Evaluators
2023cited by this paper
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
2023cited by this paper
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
2023cited by this paper
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
2023cited by this paper
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance
2023influential reference
CMMLU: Measuring massive multitask language understanding in Chinese
2023cited by this paper
A Survey on Evaluation of Large Language Models
2023cited by this paper
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies
2023cited by this paper
CMB: A Comprehensive Medical Benchmark in Chinese
2023cited by this paper
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models
2023influential reference
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
2023cited by this paper
Making Retrieval-Augmented Language Models Robust to Irrelevant Context
2023cited by this paper
ChatGPT vs. Crowdsourcing vs. Experts: Annotating Open-Domain Conversations with Speech Functions
2023cited by this paper
Self-Knowledge Guided Retrieval Augmentation for Large Language Models
2023cited by this paper
Evaluating Large Language Models at Evaluating Instruction Following
2023cited by this paper
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
2023cited by this paper
Self-Consistency Improves Chain of Thought Reasoning in Language Models
2022cited by this paper
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
2022cited by this paper
When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain
2022influential reference
Chain of Thought Prompting Elicits Reasoning in Large Language Models
2022cited by this paper
Confident AI
2022cited by this paper
Leveraging Large Language Models for Multiple Choice Question Answering
2022cited by this paper
A Multi-Task Benchmark for Korean Legal Language Understanding and Judgement Prediction
2022cited by this paper
Teaching Models to Express Their Uncertainty in Words
2022cited by this paper
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
2022cited by this paper
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA
2022cited by this paper
LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
2021cited by this paper
BARTScore: Evaluating Generated Text as Text Generation
2021cited by this paper
What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams
2020cited by this paper
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
2020cited by this paper
Hallucination
2020cited by this paper
Bloom’s Taxonomy
2020cited by this paper
Measuring Massive Multitask Language Understanding
2020cited by this paper
PubMedQA: A Dataset for Biomedical Research Question Answering
2019cited by this paper
BERTScore: Evaluating Text Generation with BERT
2019influential reference
BLEU is Not Suitable for the Evaluation of Text Simplification
2018cited by this paper
Spearman’s rank correlation coefficient
2018cited by this paper
A measure of intelligence
2012cited by this paper
ROUGE: A Package for Automatic Evaluation of Summaries
2004influential reference
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
A Revision of Bloom's Taxonomy: An Overview
2002cited by this paper
Toxicity
1997cited by this paper
Computer Science & Information Technology (CS & IT)
year unknowncited by this paper

CITED BY

Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers
2025cites this paper
Past to Plan: LLM-Powered Personalized Travel via Mobility Patterns
2025cites this paper