Next: Acknowledgements Up: Recrawl Scheduling Based on Previous: Comparison with Previous Work

Summary

We identified information longevity as a distinct web evolution characteristic, and a key factor in crawler effectiveness. Previous web evolution models do not account for the information longevity characteristics we found on the web, so we proposed a new evolution model that fits closely with actual observations.

We brought our findings to bear on the recrawl scheduling problem. We began by formulating a general theory of optimal recrawling in which the optimization objective is to maximize correct information $\times$ time. The theory led us to two simple online recrawl algorithms that target longevous content, and outperform previous approaches on real web data.

Chris Olston 2008-02-15