Next: Interpolation of Divergence Up: Recrawl Scheduling Based on Previous: Bibliography

Optimal Recrawling for
Holistic Staleness

We specialize our model of optimal recrawling from Section 2.2 to the case where the divergence metric of interest is holistic staleness.

Following the model of [4], suppose each page experiences updates according to a Poisson process with rate parameter $\lambda_P$ , where each update is substantial enough to cause the crawled version to become stale. Consider a particular page that was most recently refreshed at time . The probability that has undergone at least one update during the interval is $1 - e^{- \lambda_P \cdot t}$ . Hence the expected divergence is given by $D^*_P(t) = 1 - e^{- \lambda_P \cdot t}$ . The expected utility of refreshing at time , according to Equation 4, is found to be:

$\begin{displaymath} U_P(t) = \frac{1}{\lambda_P} - \left( t + \frac{1}{\lambda_P} \right) e^{- \lambda_P \cdot t} \end{displaymath}$

The policy of refreshing a page whenever $U_P(t) \ge T$ (for some constant ) results in identical schedules to those resulting from the optimal policy derived in [4] for the same holistic staleness objective.

Chris Olston 2008-02-15

Optimal Recrawling for Holistic Staleness

Optimal Recrawling for
Holistic Staleness