Next: Theoretical Framework
Up: Introduction
Previous: Introduction
This paper makes the following contributions:
- Identification of information longevity as a key factor in crawler performance.
- Longevity measurements of real web content, and a generative model that accounts for the observed characteristics (Section 3).
- New page revisitation policies that take into account information longevity in addition to the usual factors,
and avoid wastefully downloading ephemeral content (Section 4).
- Empirical study of the online revisitation problem, where policies must sample page update behavior on the fly, in order to learn how pages change and use the learned information to schedule future revisitations (Section 5).
The revisitation policies we propose are highly practical. They incur very little per-page space and time overhead. Furthermore, unlike some previously-proposed policies, ours do not rely on global optimization methods, making them suitable for use in a large-scale parallel crawler. Lastly, our policies automatically adapt to shifts in page change behavior.
Our revisitation policies are based on an underlying theory of optimal page revisitation, presented next.
Chris Olston
2008-02-15