next up previous
Next: Discussion Up: Theoretical Framework Previous: Page Divergence Metric


Optimal Recrawling

A theory of optimal recrawling[*] can be developed in a manner that is independent of any particular choice of per-page divergence metric. The goal is a page revisitation policy that achieves minimal time-averaged divergence within a given refresh cost budget. For the sake of the theory let us assume that for each page $P$, divergence depends only on the time $t_P$ since the last refresh of $P$. Assume furthermore that, other than in the case of a refresh event, divergence does not decrease. Under these assumptions we can re-express divergence as $D(P(t), P'(t)) = D^*_P(t - t_P)$, for some monotonic function $D^*_P(\cdot)$.

Let $T$ be any nonnegative constant. It can be shown using the method of Lagrange Multipliers that the following revisitation policy achieves minimal divergence compared with all policies that perform an equal number of refreshes:



At each point in time $t$, refresh exactly those pages that have $U_P(t - t_P) \ge T$, where

\begin{displaymath}
U_P(t) = t \cdot D^*_P(t) - \int_0^t D^*_P(x) \, dx
\end{displaymath} (4)


Intuitively speaking, $U_P(t)$ gives the utility of refreshing page $P$ at time $t$ (relative to the last refresh time of $P$). The constant $T$ represents a utility threshold; we are to refresh only those pages for which the utility of refreshing it is at least $T$. The unit of utility is divergence $\times$ time.


next up previous
Next: Discussion Up: Theoretical Framework Previous: Page Divergence Metric
Chris Olston 2008-02-15