Reconstructing Websites for the Lazy Webmaster

F. McCown,Joan A. Smith,Michael L. Nelson,J. Bollen

Published 2005 in arXiv.org

ABSTRACT

Backup or preservation of websites is often not considered until after a catastrophic event has occurred. In the face of complete website loss, “lazy” webmasters or concerned third parties may be able to recover some of their website from the Internet Archive. Other pages may also be salvaged from commercial search engine caches. We introduce the concept of “lazy preservation”- digital preservation performed as a result of the normal operations of the Web infrastructure (search engines and caches). We present Warrick, a tool to automate the process of website reconstruction from the Internet Archive, Google, MSN and Yahoo. Using Warrick, we have reconstructed 24 websites of varying sizes and composition to demonstrate the feasibility and limitations of website reconstruction from the public Web infrastructure. To measure Warrick’s window of opportunity, we have profiled the time required for new Web resources to enter and leave search engine caches.

PUBLICATION RECORD

Publication year
2005
Venue
arXiv.org
Publication date
2005-12-16
Fields of study
Computer Science
Identifiers
arXiv cs/0512069
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

mod_oai: An Apache Module for Metadata Harvesting
2005cited by this paper
The indexable web is more than 11.5 billion pages
2005cited by this paper
Downloading textual hidden web content through keyword queries
2005cited by this paper
Managing versions of web documents in a transaction-time web server
2004cited by this paper
Tools and techniques for harvesting the World Wide Web
2004cited by this paper
What's new on the web?: the evolution of the web from a search engine perspective
2004cited by this paper
High performance crawling system
2004cited by this paper
Effective page refresh policies for Web crawlers
2003cited by this paper
Efficient URL caching for world wide web crawling
2003cited by this paper
A large‐scale study of the evolution of Web pages
2003cited by this paper
Collecting and preserving the world wide web
2003cited by this paper
Responsible web caching
2002cited by this paper
Parallel crawlers
2002cited by this paper
Crawling the Hidden Web
2001cited by this paper
LOCKSS: A Permanent Web Publishing and Access System
2001cited by this paper
A proxy-based personal web archiving service
2001cited by this paper
An approach to persistence of Web resources
2001cited by this paper
An adaptive model for optimizing performance of an incremental web crawler
2001cited by this paper
Evaluating topic-driven web crawlers
2001cited by this paper
Crawlets: Agents for High Performance Web Search Engines
2001cited by this paper
Internet search engine freshness by Web server help
2001cited by this paper
Finding replicated Web collections
2000cited by this paper
Crawler-Friendly Web Servers
2000cited by this paper
The Evolution of the Web and Implications for an Incremental Crawler
2000cited by this paper
Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content
1999cited by this paper
Finding Near-Replicas of Documents and Servers on the Web
1998cited by this paper
Efficient Crawling Through URL Ordering
1998cited by this paper
Optimal Robot Scheduling for Web Search Engines
1998cited by this paper
D-Lib Magazine: Publishing as the Honest Broker
1998cited by this paper
Mediating Among Diverse Data Formats.
1998cited by this paper
Syntactic Clustering of the Web
1997cited by this paper
SCAM: A Copy Detection Mechanism for Digital Documents
1995cited by this paper
International Journal on Digital Libraries Manuscript No. Infomonitor: Unobtrusively Archiving a World Wide Web Server ? Infomonitor: Unobtrusively Archiving a World Wide Web Server Infomonitor: Unobtrusively Archiving a World Wide Web Server
year unknowncited by this paper

CITED BY

Integrating preservation functions into the web server
2008cites this paper
How much preservation do I get if I do absolutely nothing? Using the Web Infrastructure for Digital Preservation
2007cites this paper
Using the web infrastructure to preserve web pages
2007cites this paper
Lazy preservation: reconstructing websites by crawling the crawlers
2006cites this paper
Integrating Preservation Functions into the Apache Web Server
2006cites this paper
Website Reconstruction using the Web Infrastructure [ Extended Abstract ]
2006influential citation
Observed Web Robot Behavior on Decaying Web Subsites
2006cites this paper
Evaluation of crawling policies for a web-repository crawler
2006cites this paper