next up previous
Next: Interpolation of Divergence Up: Recrawl Scheduling Based on Previous: Bibliography


Optimal Recrawling for
Holistic Staleness

We specialize our model of optimal recrawling from Section 2.2 to the case where the divergence metric of interest is holistic staleness.

Following the model of [4], suppose each page $P$ experiences updates according to a Poisson process with rate parameter $\lambda_P$, where each update is substantial enough to cause the crawled version to become stale. Consider a particular page $P$ that was most recently refreshed at time $0$. The probability that $P$ has undergone at least one update during the interval $[0,t]$ is $1 - e^{- \lambda_P \cdot t}$. Hence the expected divergence $D(P, P')$ is given by $D^*_P(t) = 1 - e^{- \lambda_P \cdot t}$. The expected utility of refreshing $P$ at time $t$, according to Equation 4, is found to be:



\begin{displaymath}
U_P(t) = \frac{1}{\lambda_P} - \left( t + \frac{1}{\lambda_P} \right) e^{- \lambda_P \cdot t}
\end{displaymath}

The policy of refreshing a page whenever $U_P(t) \ge T$ (for some constant $T$) results in identical schedules to those resulting from the optimal policy derived in [4] for the same holistic staleness objective.



Chris Olston 2008-02-15