Page Divergence Metric

Next: Optimal Recrawling Up: Metrics Previous: Metrics

Page Divergence Metric

Modern web pages tend to consist of multiple content regions stitched together, e.g., static logos and navigation bars, dynamic advertisements, and a central region containing the main content of the page, which is also dynamic but behaves quite differently from advertisements. Consequently, page divergence metrics that model a page as an atomic unit of content are no longer a good fit.

The page divergence metric we propose is called fragment staleness. It measures the fraction of content fragments that differ between two versions of a page. Mathematically, we treat each page version as a set of fragments and use the well-known Jaccard formula for comparing two sets:

$\begin{displaymath} D(P, P') = 1 - \frac{\vert F(P) \cap F(P')\vert}{\vert F(P) \cup F(P')\vert} \end{displaymath}$

(3)

where denotes the set of fragments that comprise .

As far as how we divide a page into fragments, we require a method that is amenable to efficient automated computation, because we will use it in our crawling algorithms. Hence we rule out sophisticated semantic approaches to page segmentation, which are likely to be too heavyweight and noisy for this purpose. A simple and robust alternative that is well aligned with the way search engines tend to treat documents is to treat sequences of consecutive words as coherent fragments. We adopt the well-known shingling method [2], which emits a set of content fragments, one for each word-level -gram ( $k \ge 1$ is a parameter).

Our rationale for adopting the fragment staleness metric is as follows. Consider what would happen if the cached copy of a page contains some fragments that do not appear in the web-resident version: The search engine might display the page's URL in response to a user's search query string that matches the fragment; yet if the user clicks through she may find no relevant content. Conversely, if a query matches some fragment in the web-resident copy that does not appear in the cached copy, the search engine would overlook a potentially relevant result. Fragment staleness gives the likelihood of introducing false positives and false negatives into search results.

Fragment staleness generalizes the ``freshness'' metric of [4,6], which we refer to in this paper as holistic staleness (staleness is the inverse of freshness). Holistic staleness implicitly assumes one content fragment per page and yields a Boolean response: each page is either ``fresh'' or ``stale.''

Next: Optimal Recrawling Up: Metrics Previous: Metrics

Chris Olston 2008-02-15