Comparison with Previous Work

Next: Summary Up: Online Experiments Previous: Space Footprint

Comparison with Previous Work

**Figure 11:** Comparison with previous work, on high-quality data set.
figure

**Figure:** Comparison with previous work, on random data set.
figure

We compare our online curve-fitting policy (Section 4.2) against the policy proposed by Cho and Garcia-Molina for optimizing holistic staleness [4] coupled with a method of estimating change frequency proposed by the same authors [5], which we refer to as the CGM policy. Since that work does not propose a way of scheduling exploratory refreshes for learning purposes, we use a simple linearly-growing warm-up schedule for both CGM and our own policy. After the warm-up period we allow each policy to set its own refresh schedule. We measure post-warm-up performance.

Figure 11 shows the result of this comparison, on the high-quality data set. Almost 90% of the content fragments are static, so it is trivial to achieve staleness below . On the other extreme, 3% of the content is so dynamic that even under bi-daily refreshing it does not remain synchronized for long. Hence the ``playing field'' on which the policies compete spans the -- staleness range.

Figure 12 shows the same comparison, but on the random data set. Observe that the minimum staleness level achievable on the random data set is much higher than on the high-quality data set. This difference is due to the fact that random pages tend to have more dynamic content than high-quality ones, perhaps aimed at attracting the attention of search engines and users.

As expected our policy, which is optimized for the fragment staleness metric, outperforms CGM in terms of fragment staleness per refresh cost, on both data sets. Although the difference in the performance of the two policies is small in the low-cost/high-staleness part of the graphs (upper-left portion), the gap becomes substantial as we move to the moderate-cost/moderate-staleness part (middle portion) and beyond.

For example, on the high-quality data set our policy incurs refresh cost instead of to achieve the same moderate staleness level, which translates to refreshing pages once every 10 days instead of every 5 days, on average. If there are 1 billion high-quality pages, this difference amounts to 100 million saved page downloads per day.

CGM assumes that every time a page changes, its old content is wiped out (i.e., churn behavior), and therefore ignores all frequently changing pages. In contrast, our policy differentiates between churn and scroll behavior, and revisits scroll content often to acquire its longevous content. The only way to get CGM to visit that content is to have it visit all fast-changing content, thus incurring unnecessarily high refresh cost.

Next: Summary Up: Online Experiments Previous: Space Footprint

Chris Olston 2008-02-15