Space Footprint

Next: Comparison with Previous Work Up: Online Experiments Previous: Curve-Fitting vs. Bound-Based Policy

Space Footprint

**Figure 10:** Impact of space-saving techniques on performance.
figure

Our online revisitation policies maintain per-page state. In the presence of billions of pages, it is important to consider the total space footprint of this state.

For each page our policies maintain change profiles, each of which consists of a short sequence of numeric (time, divergence) pairs. Furthermore, for each change profile a summary of the base version must be kept, to enable computation of divergence values relative to the base version.

A page summary consists of a list of page fragments. To save space, each fragment is encoded as a hash value of the shingle string, using a collision-resistant hash function. Furthermore, following [2], rather than storing the complete list of fragments, we retain only the fragments of minimal hash value, for some constant . A simple unbiased estimator of the Jaccard distance based on minimal hash value sets is given in [2].

The experiments reported in Section 5.1 used $m = \infty$ (all fragments) and , where is the number of historical change profiles and base version summaries maintained. Figure 10 shows the performance of our curve-fitting and bound-based policies under different values of and . As we can see, there is little degradation in performance due to truncating history and using compact fragment summaries to estimate divergence.

With and , we store just one base version summary containing just 8 shingle hashes, which requires 32 bytes. In our experiments, change profile sizes ranged from 4 to 12 observations, for a space footprint of between 32 and 96 bytes. The overall space footprint is between 64 and 128 bytes per page. If there are 10 billion pages, in the worst case the total space footprint amounts to a little over a terabyte. Modern crawlers spread the per-page state across more than a thousand crawler nodes, for about a gigabyte of state per node, which easily fits in memory on each node.

Next: Comparison with Previous Work Up: Online Experiments Previous: Curve-Fitting vs. Bound-Based Policy

Chris Olston 2008-02-15