A Cost-Effective Framework to Evaluate LLM-Generated Relevance Judgements

Simone Merlo,Stefano Marchesin,G. Faggioli,Nicola Ferro

Published 2025 in International Conference on Information and Knowledge Management

ABSTRACT

Large Language Models (LLMs) hugely impacted many research fields, including Information Retrieval (IR), where they are used for many sub-tasks, such as query rewriting and retrieval augmented generation. At the same time, the research community is investigating whether and how to use LLMs to support, or even replace, humans to generate relevance judgments. Indeed, generating relevance judgements automatically - or integrating an LLM in the annotation process - would allow us to improve the number of evaluation collections, also for scenarios where the annotation process is particularly challenging. To validate relevance judgements produced by an LLM they are compared with human-made relevance judgements, measuring the inter-assessor agreement between the human and the LLM. Our work introduces an innovative framework for estimating the quality of LLM-generated relevance judgments, providing statistical guarantees while minimizing human involvement. The proposed framework allows to: i) estimate the quality of LLM-generated relevance judgments with a defined confidence while minimizing human involvement; and ii) estimate the quality of LLM-generated relevance judgments with a fixed budget while providing bounds on the estimate. Our experimental results on three well-known IR collections using multiple LLMs as assessors show it is sufficient to assess 16% of the LLM-generated relevance judgments to estimate the LLM's performance with a 95% confidence.

PUBLICATION RECORD

Publication year
2025
Venue
International Conference on Information and Knowledge Management
Publication date
2025-11-10
Fields of study
Computer Science
Identifiers
DOI 10.1145/3746252.3761200
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
2025cited by this paper
Context Example Selection for LLM Generated Relevance Assessments
2025cited by this paper
Overview of the TREC 2023 Deep Learning Track
2025cited by this paper
Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment
2025cited by this paper
GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems
2025cited by this paper
Overview of the TREC 2021 Deep Learning Track
2025cited by this paper
Efficient and Reliable Estimation of Knowledge Graph Accuracy
2024cited by this paper
LLM4Eval: Large Language Model for Evaluation in IR
2024cited by this paper
Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
2024cited by this paper
Exploring Large Language Models for Relevance Judgments in Tetun
2024influential reference
MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels
2024cited by this paper
LLM-based relevance assessment still can't replace human relevance assessment
2024cited by this paper
LLMs can be Fooled into Labelling a Document as Relevant: best café near me; this paper is perfectly relevant
2024cited by this paper
AI Can Be Cognitively Biased: An Exploratory Study on Threshold Priming in LLM-Based Batch Relevance Assessment
2024cited by this paper
Don’t Use LLMs to Make Relevance Judgments
2024cited by this paper
Toward Automatic Relevance Judgment using Vision-Language Models for Image-Text Retrieval Evaluation
2024cited by this paper
Large Language Models can Accurately Predict Searcher Preferences
2023influential reference
One-Shot Labeling for Automatic Relevance Estimation
2023cited by this paper
ChatGPT outperforms crowd workers for text-annotation tasks
2023cited by this paper
Perspectives on Large Language Models for Relevance Judgment
2023influential reference
Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks
2023cited by this paper
Can Large Language Models Be an Alternative to Human Evaluations?
2023cited by this paper
Reduce, Reuse, Recycle: Green Information Retrieval Research
2022cited by this paper
Overview of the TREC 2020 Deep Learning Track
2021cited by this paper
Overview of the TREC 2019 deep learning track
2020cited by this paper
Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF
2019cited by this paper
Efficient Knowledge Graph Accuracy Evaluation
2019cited by this paper
TREC 2017 Common Core Track Overview
2017cited by this paper
Exact one-sided confidence limits for Cohen’s kappa as a measurement of agreement
2017cited by this paper
Data Cleaning: Overview and Emerging Challenges
2016cited by this paper
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
2016cited by this paper
An agenda for green information retrieval research
2012cited by this paper
Score standardization for inter-collection comparison of retrieval systems
2008cited by this paper
TREC: Experiment and evaluation in information retrieval
2007cited by this paper
Overview of the TREC 2004 Robust Retrieval Track
2004cited by this paper
Interval estimation for Cohen's kappa as a measure of agreement.
2000cited by this paper
The Cranfield tests on index language devices
1997influential reference
The standard error of Cohen's Kappa.
1991cited by this paper
2 x 2 kappa coefficients: measures of agreement or association.
1989cited by this paper
Large sample standard errors of kappa and weighted kappa.
1969cited by this paper
Sampling Techniques, 3rd Edition
1963cited by this paper
A Coefficient of Agreement for Nominal Scales
1960cited by this paper

CITED BY

Re-Rankers as Relevance Judges
2026cites this paper
Hybrid Pooling with LLMs via Relevance Context Learning
2026cites this paper
Report on the 34th ACM Conference on Information and Knowledge Management (CIKM 2025)
2025cites this paper