Can LLM Annotations Replace User Clicks for Learning to Rank?

Lulu Yu,Keping Bi,Jiafeng Guo,Shihao Liu,Shuaiqiang Wang,Dawei Yin,Xueqi Cheng

Published 2025 in arXiv.org

ABSTRACT

Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly. Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation. This paper investigates whether LLM annotations can replace click data for learning to rank (LTR) by conducting a comprehensive comparison across multiple dimensions. Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries, while LLM annotation-supervised models are more effective on medium- and low-frequency queries. Further analysis shows that click-supervised models are better at capturing document-level signals such as authority or quality, while LLM annotation-supervised models are more effective at modeling semantic matching between queries and documents and at distinguishing relevant from non-relevant documents. Motivated by these observations, we explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals. Both approaches enhance ranking performance across queries at all frequency levels, with the latter being more effective. Our code is available at https://github.com/Trustworthy-Information-Access/LLMAnn_Click.

PUBLICATION RECORD

Publication year
2025
Venue
arXiv.org
Publication date
2025-11-10
Fields of study
Computer Science
Identifiers
DOI 10.48550/arXiv.2511.06635 arXiv 2511.06635
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

LLM-Driven Usefulness Labeling for IR Evaluation
2025cited by this paper
Judging the Judges: A Collection of LLM-Generated Relevance Judgements
2025cited by this paper
Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG
2025cited by this paper
Unbiased Learning to Rank with Query-Level Click Propensity Estimation: Beyond Pointwise Observation and Relevance
2025cited by this paper
Are Large Language Models Good at Utility Judgments?
2024cited by this paper
Unconfounded Propensity Estimation for Unbiased Ranking
2023cited by this paper
A Test Collection of Synthetic Documents for Training Rankers: ChatGPT vs. Human Experts
2023cited by this paper
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models
2023cited by this paper
Large Language Models can Accurately Predict Searcher Preferences
2023cited by this paper
ChatGPT outperforms crowd workers for text-annotation tasks
2023cited by this paper
ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification
2023cited by this paper
A Large Scale Search Dataset for Unbiased Learning to Rank
2022influential reference
Towards Disentangling Relevance and Bias in Unbiased Learning to Rank
2022cited by this paper
Promptagator: Few-shot Dense Retrieval From 8 Examples
2022cited by this paper
Adapting Interactional Observation Embedding for Counterfactual Learning to Rank
2021cited by this paper
Mixture-Based Correction for Position and Trust Bias in Counterfactual Learning to Rank
2021cited by this paper
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
2021cited by this paper
ULTRA: An Unbiased Learning To Rank Algorithm Toolbox
2021influential reference
Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey
2020cited by this paper
Correcting for Selection Bias in Learning-to-rank Systems
2020cited by this paper
Cascade Model-based Propensity Estimation for Counterfactual Learning to Rank
2020cited by this paper
Language Models are Few-Shot Learners
2020cited by this paper
Pretrained Transformers for Text Ranking: BERT and Beyond
2020cited by this paper
Addressing Trust Bias for Unbiased Learning-to-Rank
2019cited by this paper
TianGong-ST: A New Dataset with Large-scale Refined Real-world Web Search Sessions
2019cited by this paper
Position Bias Estimation for Unbiased Learning to Rank in Personal Search
2018cited by this paper
Unbiased Learning to Rank with Unbiased Propensity Estimation
2018influential reference
Learning a Deep Listwise Context Model for Ranking Refinement
2018cited by this paper
Attention is All you Need
2017cited by this paper
Unbiased Learning-to-Rank with Biased Feedback
2016cited by this paper
Learning to Rank with Selection Bias in Personal Search
2016cited by this paper
Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data
2011cited by this paper
From RankNet to LambdaRank to LambdaMART: An Overview
2010cited by this paper
An experimental comparison of click position-bias models
2008cited by this paper
Predicting clicks: estimating the click-through rate for new ads
2007cited by this paper
The Kendall Rank Correlation Coefficient
2007cited by this paper
A study of smoothing methods for language models applied to information retrieval
2004cited by this paper
Cumulated gain-based evaluation of IR techniques
2002cited by this paper
A study of smoothing methods for language models applied to Ad Hoc information retrieval
2001cited by this paper
The PageRank Citation Ranking : Bringing Order to the Web
1999cited by this paper
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
1994influential reference
UvA-DARE (Digital Academic Repository) When Inverse Propensity Scoring does not Work: Affine Corrections for Unbiased Learning to Rank
year unknowncited by this paper

CITED BY

Training Dense Retrievers with Multiple Positive Passages
2026cites this paper