next up previous
Next: Information Longevity Distribution Up: Analysis of Web Data Previous: Analysis of Web Data


Data Sets

We use two real web data sets:


Within each data set we assign uniform importance weights ($W_P = 1$ for every page).

The high-quality data set is of significantly more interest to study than the random data set, because crawlers typically avoid recrawling low-quality pages frequently. The interesting question is how frequently to recrawl each high-quality page. For most of our experiments we only report results on the high-quality data set, due to space constraints. In general the two data sets yield similar results.

Unfortunately a few page snapshots are missing from the data, because in some cases the server hosting a given page was unreachable, even after several retries. In these rare cases we substituted the content from the previous snapshot.

To evaluate page revisitation policies (Sections 3.4 and 5), we need a notion of ``ground truth.'' Since we are not the originators of the web pages in our data sets, we do not have access to complete update histories. The only information available to us comes from our bi-daily snapshots, and we need to interpolate. Our method of interpolation is described in Appendix B.


next up previous
Next: Information Longevity Distribution Up: Analysis of Web Data Previous: Analysis of Web Data
Chris Olston 2008-02-15