next up previous
Next: Analysis of Web Data Up: Theoretical Framework Previous: Optimal Recrawling


Discussion

Figure 2: Two divergence graphs, with utility.
Image fig-div

Figure 2 shows divergence curves for the two web pages illustrated in Figure 1. The horizontal axis of each graph plots time, assuming the crawler refreshes both pages at time $0$. The vertical axis plots divergence, under the fragment staleness metric introduced in Section 2.1.1.[*]

If we expect the divergence curve of a page to behave similarly after the next refresh as it did after the $t=0$ refresh, then we should favor refreshing Page B over refreshing Page A at the present time. The reason is that refreshing Page B gives us more ``bang for our buck,'' so to speak, than Page A. Utility at time $t$ is given by the area of the shaded region. Page B has higher utility than Page A, which matches our intuition of ``bang for buck.''

In addition to giving the intuition behind the utility formula derived in Section 2.2, this example also illustrates the appropriateness of fragment staleness as a divergence metric. Although our generic theory of optimal revisitation of Section 2.2 can be applied to holistic staleness as well (see Appendix A; our result matches that of [4]), the nature of the holistic staleness metric can lead to undesirable behavior: Under a revisitation policy that optimizes holistic staleness, Pages A and B, which undergo equally frequent changes, are treated identically--either both ignored or both refreshed with the same frequency.

One may wonder whether a simpler utility function such as $U_p(t) = D^*_P(t)$ (i.e., prioritize refreshing based on current divergence) may work nearly as well as the theoretically optimal formula given in Equation 4. We have verified empirically via simulation over real web data (see Section 3.1) that refreshing according to Equation 4 yields significantly improved staleness for the same refresh cost, compared with using the naive form $U_p(t) = D_P(t)$. In fact, the naive form even performs worse than uniform refreshing, which is to be expected in light of the findings of [4].


next up previous
Next: Analysis of Web Data Up: Theoretical Framework Previous: Optimal Recrawling
Chris Olston 2008-02-15