Optimal Recrawling

Next: Discussion Up: Theoretical Framework Previous: Page Divergence Metric

Optimal Recrawling

A theory of optimal recrawling can be developed in a manner that is independent of any particular choice of per-page divergence metric. The goal is a page revisitation policy that achieves minimal time-averaged divergence within a given refresh cost budget. For the sake of the theory let us assume that for each page , divergence depends only on the time since the last refresh of . Assume furthermore that, other than in the case of a refresh event, divergence does not decrease. Under these assumptions we can re-express divergence as , for some monotonic function $D^*_P(\cdot)$ .

Let be any nonnegative constant. It can be shown using the method of Lagrange Multipliers that the following revisitation policy achieves minimal divergence compared with all policies that perform an equal number of refreshes:

At each point in time , refresh exactly those pages that have $U_P(t - t_P) \ge T$ , where

$\begin{displaymath} U_P(t) = t \cdot D^*_P(t) - \int_0^t D^*_P(x) \, dx \end{displaymath}$

(4)

Intuitively speaking, gives the utility of refreshing page at time (relative to the last refresh time of ). The constant represents a utility threshold; we are to refresh only those pages for which the utility of refreshing it is at least . The unit of utility is divergence $\times$ time.

Next: Discussion Up: Theoretical Framework Previous: Page Divergence Metric

Chris Olston 2008-02-15