We identified information longevity as a distinct web evolution characteristic, and a key factor in crawler effectiveness. Previous web evolution models do not account for the information longevity characteristics we found on the web, so we proposed a new evolution model that fits closely with actual observations.
We brought our findings to bear on the recrawl scheduling problem. We began by formulating a general theory of optimal recrawling in which the optimization objective is to maximize correct information time. The theory led us to two simple online recrawl algorithms that target longevous content, and outperform previous approaches on real web data.