Resource Selection for Federated Search on the Web

Dong Nguyen,T. Demeester,D. Trieschnigg,D. Hiemstra

Published 2016 in arXiv.org

ABSTRACT

A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines. First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

PUBLICATION RECORD

Publication year
2016
Venue
arXiv.org
Publication date
2016-09-01
Fields of study
Computer Science
Identifiers
arXiv 1609.04556
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Ranking using multiple document types in desktop search
2010cited by this paper
A joint probabilistic classification model for resource selection
2010cited by this paper
CiteData: a new multi-faceted dataset for evaluating personalized search performance
2010influential reference
Overview of the TREC 2010 Web Track
2010cited by this paper
Classification-based resource selection
2009cited by this paper
Federated Search
2009cited by this paper
Blog site search using resource selection
2008influential reference
Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval
2007influential reference
Test theory for assessing IR test collections
2007influential reference
Federated text retrieval from uncooperative overlapped collections
2007cited by this paper
Estimating corpus size via queries
2006cited by this paper
Capturing collection size for distributed non-cooperative retrieval
2006cited by this paper
A picture of search
2006cited by this paper
Server selection methods in hybrid portal search
2005cited by this paper
Inferring Query Performance Using Pre-retrieval Predictors
2004cited by this paper
Unified utility maximization framework for resource selection
2004cited by this paper
Relevant document distribution estimation method for resource selection
2003influential reference
Predicting query performance
2002cited by this paper
A language modeling framework for resource selection and results merging
2002cited by this paper
Crawling the Hidden Web
2001cited by this paper
Discovering the representative of a search engine
2001cited by this paper
Server selection on the World Wide Web
2000cited by this paper
Cluster-based language models for distributed retrieval
1999cited by this paper
Searching distributed collections with inference networks
1995influential reference
Terrier : A High Performance and Scalable Information Retrieval Platform
year unknowncited by this paper

CITED BY

ReSLLM: Large Language Models are Strong Resource Selectors for Federated Search
2024cites this paper
Snippet-based result merging in federated search
2023cites this paper
Federated search techniques: an overview of the trends and state of the art
2023cites this paper
Real-time Web Search Framework for Performing Efficient Retrieval of Data
2021cites this paper
Embedding based learning for collection selection in federated search
2020cites this paper
SAMA: a real-time Web search architecture
2020cites this paper