next up previous
Next: Online Revisitation Policies Up: Analysis of Web Data Previous: Implications


Offline Page Revisitation Policies

Figure 6: Performance of offline revisitation policies.

Now that we have developed an understanding of information longevity as a distinct web evolution characteristic from change frequency, we study whether there is any advantage in adopting a longevity-aware web crawling policy. For now we consider offline policies, i.e., ones that rely on a-priori measurements of the data set to set the revisitation schedule (we consider online policies in Sections 4 and 5).

The offline policies we consider are:

Figure 6 shows how these policies perform on the high-quality data set. The x-axis plots normalized refresh cost ($1$ corresponds to refreshing every snapshot of every page). The y-axis plots average fragment staleness as per Equations 2 and 3 (Section 2.1).[*]On both axes, lower is better.

Roughly half of the pages are completely static, so the largest interesting value of refresh cost is $0.5$. Even if we do refresh every non-static page at every opportunity (every two days, in our data set), staleness still does not go to zero due to divergence during the two-day period between refreshes. (Recall from Section 3.1 that we interpolate between snapshots.)

The FS-S policy is roughly the analogue of HS-S. Both assume stationary page change behavior (encoded as $D^*_P(\cdot)$ and $\lambda_P$, respectively). Comparing these two, we see that the fragment-based policy (FS-S) performs significantly better, especially if we consider uniform refreshing as the baseline. The fragment-based policy is geared toward refreshing content of high longevity, and avoids wasting resources on ephemeral content.

Turning to the dynamic policies FS-D and HS-D, we again see the fragment-based policy (FS-D) performing better. Also, both dynamic policies vastly outperform their static counterparts, which points to adaptiveness as a big win.


next up previous
Next: Online Revisitation Policies Up: Analysis of Web Data Previous: Implications
Chris Olston 2008-02-15