next up previous
Next: About this document ... Up: Recrawl Scheduling Based on Previous: Optimal Recrawling for Holistic


Interpolation of Divergence

Our interpolation method starts by assigning to each page an eagerness score, based on whether the divergence curve tends to be concave (high eagerness) or convex (low eagerness). A page with a high eagerness score contains dynamic content that has rapid turnover, such as advertisements.

Given three consecutive snapshots $P(t_0)$, $P(t_1)$ and $P(t_2)$ of a page $P$, eagerness is measured as:



\begin{displaymath}E(P, t_0) = \frac{D(P(t_0), P(t_1))}{D(P(t_0), P(t_2))} \end{displaymath}

The overall eagerness score $E(P) \in [0,1]$ is found by averaging over all $(t_0, t_1, t_2)$ triples for which all three snapshots were downloaded successfully.

Now, suppose we have two consecutive snapshots $P(t_x)$ and $P(t_{x+1})$ and wish to estimate divergence inside the interval $(P(t_x), P(t_{x+1}))$, relative to some base version $P(t_B)$. We do so by taking the average of the two endpoint divergence values, weighted by eagerness. Mathematically, for any $t \in (t_x, t_{x+1})$ we assume $D(P(t_B), P(t)) = (1-E(P)) \cdot D(P(t_B), P(t_x)) + E(P) \cdot D(P(t_B), P(t_{x+1}))$.



Chris Olston 2008-02-15