Data Sets

Next: Information Longevity Distribution Up: Analysis of Web Data Previous: Analysis of Web Data

Data Sets

We use two real web data sets:

Random. We obtained a random sample of 10,000 URLs from Yahoo's crawled collection, and configured a crawler to download the content of each URL once every two days over the course of several months in 2006. A total of 50 snapshots were obtained.
High-quality. Our second data set is based on a random sample of 10,000 URLs from the OpenDirectory [11]. We consider this data set to consist of much higher quality pages than random ones, on average. For this data set we obtained 30 temporal snapshots, again by downloading each page every second day.

Within each data set we assign uniform importance weights ( for every page).

The high-quality data set is of significantly more interest to study than the random data set, because crawlers typically avoid recrawling low-quality pages frequently. The interesting question is how frequently to recrawl each high-quality page. For most of our experiments we only report results on the high-quality data set, due to space constraints. In general the two data sets yield similar results.

Unfortunately a few page snapshots are missing from the data, because in some cases the server hosting a given page was unreachable, even after several retries. In these rare cases we substituted the content from the previous snapshot.

To evaluate page revisitation policies (Sections 3.4 and 5), we need a notion of ``ground truth.'' Since we are not the originators of the web pages in our data sets, we do not have access to complete update histories. The only information available to us comes from our bi-daily snapshots, and we need to interpolate. Our method of interpolation is described in Appendix B.

Next: Information Longevity Distribution Up: Analysis of Web Data Previous: Analysis of Web Data

Chris Olston 2008-02-15