Missing web pages (pages that return the 404 "Page Not Found error) are part of the browsing experience. The manual use of search engines to rediscover missing pages can be frustrating and unsuccessful. We compare four automated methods for rediscovering web pages. We extract the page's title, generate the page's lexical signature (LS), obtain the page's tags from the bookmarking website delicious.com and generate a LS from the page's link neighborhood. We use the output of all methods to query Internet search engines and analyze their retrieval performance. Our results show that both LSs and titles perform fairly well with over 60% URIs returned top ranked from Yahoo!. However, the combination of methods improves the retrieval performance. Considering the complexity of the LS generation, querying the title first and in case of insufficient results querying the LSs second is the preferable setup. This combination accounts for more than 75% top ranked URIs.
Evaluating methods to rediscover missing web pages from the web infrastructure
Martin Klein,Michael L. Nelson
Published 2009 in ACM/IEEE Joint Conference on Digital Libraries
ABSTRACT
PUBLICATION RECORD
- Publication year
2009
- Venue
ACM/IEEE Joint Conference on Digital Libraries
- Publication date
2009-07-14
- Fields of study
Computer Science
- Identifiers
- External record
- Source metadata
Semantic Scholar
CITATION MAP
EXTRACTION MAP
CLAIMS
- No claims are published for this paper.
CONCEPTS
- No concepts are published for this paper.
REFERENCES
Showing 1-38 of 38 references · Page 1 of 1
CITED BY
Showing 1-20 of 20 citing papers · Page 1 of 1