next up previous
Next: Bounding Risk Up: Online Revisitation Policies Previous: Bound-Based Policy


Setting the Utility Threshold

Overall, crawling resources must be shared between discovery and retrieval of new content, and refreshing of old content [7]. Hence there is an intrinsic tradeoff between freshness and coverage. In view of this tradeoff, the following overall crawling strategy seems appropriate: when there is an opportunity to boost freshness significantly by refreshing old content, do so; dedicate all other resources to acquiring new content.

From Section 2.2 we know that basing refresh decisions on a fixed threshold of utility, measured according to Equation 4, is optimal in terms of freshness achieved per unit cost. We leave the utility threshold $T$ as a parameter to be set by a human administrator who can judge the relative importance of freshness and coverage in the appropriate business context. $T$ is set properly iff (1) it would be preferable to receive a freshness boost of magnitude $T$ (in units of divergence $\times$ time) rather than download a new page, and (2) it would be preferable to download a new page than to receive a freshness boost of $T - \epsilon$.[*]

In a parallel crawler, the value of $T$ may be broadcast to all nodes at the outset of crawling (and during occasional global tuning). Subsequently, all refresh scheduling decisions are local, because they depend only on $T$ and a given page's change profiles.


next up previous
Next: Bounding Risk Up: Online Revisitation Policies Previous: Bound-Based Policy
Chris Olston 2008-02-15