Haifa 32000, Israel

Haifa 32000, Israel

Copyright is held by the World Wide Web Conference Committee (IW3C2).
Distribution of these papers is limited to classroom use, and personal use by others.

WWW 2006, May 23.26, 2006, Edinburgh, Scotland.

ACM 1-59593-323-9/06/0005.

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from two well
recorded biases: it favors long documents and highly ranked
documents. In this paper we introduce two novel sampling
techniques: a lexicon-based technique and a random walk
technique. Our methods produce *biased* sample
documents, but each sample is accompanied by a corresponding
``weight'', which represents the probability of this document
to be selected in the sample. The samples, in conjunction with
the weights, are then used to *simulate* near-uniform
samples. To this end, we resort to three well known Monte Carlo
simulation methods: *rejection sampling*, *importance
sampling* and the *Metropolis-Hastings
algorithm*.

We analyze our methods rigorously and prove that under
plausible assumptions, our techniques are *guaranteed*
to produce near-uniform samples from the search engine's index.
Experiments on a corpus of 2.4 million documents substantiate
our analytical findings and show that our algorithms do not
have significant bias towards long or highly ranked documents.
We use our algorithms to collect fresh data about the relative
sizes of Google, MSN Search, and Yahoo!.

H.3.3: Information Search and Retrieval.

Measurement, Algorithms.

search engines, benchmarks, sampling, size estimation.

1 Introduction

The latest round in the search engine size wars (cf. [22]) erupted last August after Yahoo! claimed [20] to index more than 20 billion documents. At the same time Google reported only 8 billion pages in its index, but simultaneously announced [3] that its index is three times larger than its competition's. This surreal debate underscores the lack of widely acceptable benchmarks for search engines.

Current evaluation methods for search engines [11,14,6] are labor-intensive and are based on anecdotal sets of queries or on fixed TREC data sets. Such methods do not provide statistical guarantees about their results. Furthermore, when the query test set is known in advance, search engines can manually adapt their results to guarantee success in the benchmark.

In an attempt to come up with reliable automatic benchmarks for search engines, Bharat and Broder [4] proposed the following problem: can we sample random documents from a search engine's index using only the engine's public interface? Unlike the manual methods, random sampling offers statistical guarantees about its test results. It is important that the sampling is done only via the public interface, and not by asking the search engine itself to collect the sample documents, because we would like the tests to be objective and not to rely on the goodwill of search engines. Furthermore, search engines seem reluctant to allow random sampling from their index, because they do not want third parties to dig into their data.

Random sampling can be used to test the quality of search
engines under a multitude of criteria: (1) *Overlap and
relative sizes*: we can find out, e.g., what fraction of the
documents indexed by Yahoo! are also indexed by Google and vice
versa. Such size comparisons can indicate which search engines
have better recall for narrow-topic queries. (2) *Topical
bias*: we can identify themes or topics that are
overrepresented or underrepresented in the index. (3)
*Freshness evaluation*: we can evaluate the freshness of
the index, by estimating the fraction of ``dead links'' it
contains. (4) *Spam evaluation*: using a spam classifier,
we can find the fraction of spam pages in the index. (5)
*Security evaluation*: using an anti-virus software, we
can estimate the fraction of indexed documents that are
contaminated by viruses.

**The Bharat-Broder approach. ** Bharat and Broder proposed
the following simple algorithm for uniformly sampling documents
from a search engine's index. The algorithm successively
formulates ``random'' queries, submits the queries to the search
engine, and picks uniformly chosen documents from the result sets
returned. In order to construct the random queries, uses a
*lexicon* of terms that appear in web documents. Each term
in the lexicon should be accompanied by an estimate of its
frequency on the web. Random queries are then formulated as
conjunctions or disjunctions of terms that are randomly selected
from the lexicon, based on their estimated frequency. The lexicon
is constructed at a pre-processing step by crawling a large
corpus of documents (Bharat and Broder crawled the Yahoo!
directory).

As Bharat and Broder noted in the original article [4] and was later observed by
subsequent studies [7],
the method suffers from severe biases. The first bias is towards
long, ``content-rich'', documents, simply because these documents
match many more queries than short documents. An extreme example
is online dictionaries and word lists (such as the ispell
dictionaries), which will be returned as the result of almost any
query. Worse than that, when queries are formulated as
conjunctions or disjunctions of unrelated terms, typically
*only* dictionaries and word lists match the queries.
Another major problem is that search engines do not allow access
to the full list of results, but rather only to the top
ones (where is usually 1,000). Since
the Bharat-Broder technique samples documents only from the top
, it is biased towards
documents that have high static rank. Finally, the accuracy of
the Bharat-Broder method is directly related to the quality of
the term frequency estimates it uses. In particular, if these
estimates are biased, then the same bias will be reflected in the
samples. Collecting accurate term statistics is a major problem
by itself, and it is not clear that using a web directory like
Yahoo! is sufficient.

**Our contributions. ** We propose two novel methods for
sampling pages from a search engine's index. Our first technique
uses a lexicon to formulate random queries, but unlike the
Bharat-Broder approach, does not need to know term frequencies.
Our second technique is based on a random walk on a virtual graph
defined over the documents in the index. This technique does not
need a lexicon at all.

Both our techniques, like the Bharat-Broder method, produce
*biased* samples. That is, some documents are more likely
to be sampled than others. Yet, our algorithms have one crucial
advantage: they produce together with each sample document
a corresponding
``weight''
, which
represents the probability
of the
document to be sampled. This seemingly minor difference turns out
to be extremely significant. The weights allow us to apply
*stochastic simulation* methods on the samples and
consequently obtain *uniform*, unbiased, samples from the
search engine's index!

A simulation method accepts samples taken from a *trial
distribution* and
simulates sampling from a *target distribution*
. In order for the
simulation to be feasible, the simulator needs to be able to
compute
and
, at least
up to normalization, given any instance
. The simulation
has some overhead, which depends on how far and are from each other. In
our case is the uniform
distribution over the search engine's index, and is the distribution of
samples generated by our samplers. We employ three Monte Carlo
simulation methods: *rejection sampling*, *importance
sampling*, and the *Metropolis-Hastings*
algorithm.

One technical difficulty in applying simulation methods in our
setting is that the weights produced by our samplers are only
*approximate*. To the best of our knowledge, stochastic
simulation with approximate weights has not been addressed
before. We are able to show that the rejection sampling method
still works even when provided with approximate weights. The
distribution of the samples it generates is no longer identical
to the target distribution
, but is rather only close to
. Similar analysis for importance sampling and for the
Metropolis-Hastings algorithm is deferred to future work, but our
empirical results suggest that they too are affected only
marginally by the approximate weights.

**Pool-based sampler. ** A *query pool* is a
collection of queries. Our pool-based sampler assumes knowledge
of some query pool .
The terms constituting queries in the pool can be collected by
crawling a large corpus, like the Yahoo! or ODP [9] directories. We stress again
that knowledge of the frequencies of these terms is not
needed.

The *volume* of a query
is the number of
documents indexed by the search engine and that match
. We first show
that if the sampler could somehow sample queries from the pool
proportionally to their volume, then we could attach to each
sample a weight. Then, by applying any of the simulation methods
(say, rejection sampling), we would obtain truly uniform samples
from the search engine's index. Sampling queries according to
their volume is tricky, though, because we do not know a priori
the volume of queries. What we do instead is sample queries from
the pool according to some other, arbitrary, distribution (e.g.,
the uniform one) and then *simulate* sampling from the
volume distribution. To this end, we use stochastic simulation
again. Hence, stochastic simulation is used twice: first to
generate the random queries and then to generate the uniform
documents.

We rigorously analyze the pool-based sampler and identify the properties of the query pool that make this technique accurate and efficient. We find that using a pool of phrase queries is much more preferable to using conjunctive or disjunctive queries, like the ones used by Bharat and Broder.

**Random walk sampler. ** We propose a completely different
sampler, which does not use a term lexicon at all. This sampler
performs a random walk on a virtual graph defined over the
documents in the index. The limit equilibrium distribution of
this random walk is the uniform distribution over the documents,
and thus if we run the random walk for sufficiently many steps,
we are guaranteed to obtain near-uniform samples from the
index.

The graph is defined as follows: two documents are connected by an edge iff they share a term or a phrase. This means that both documents are guaranteed to belong to the result set of a query consisting of the shared term/phrase. Running a random walk on this graph is simple: we start from an arbitrary document, at each step choose a random term/phrase from the current document, submit a corresponding query to the search engine, and move to a randomly chosen document from the query's result set.

The random walk as defined does not converge to the uniform distribution. In order to make it uniform, we apply the Metropolis-Hastings algorithm. We analyze the random walk experimentally, and show that a relatively small number of steps is needed to approach the limit distribution.

**Experimental results. ** To validate our techniques, we
crawled 2.4 million English pages from the ODP hierarchy
[9], and built a search
engine over these pages. We used a subset of these pages to
create the query pool needed for our pool-based sampler and for
the Bharat-Broder sampler.

We ran our two samplers as well as the Bharat-Broder sampler on this search engine, and calculated bias towards long documents and towards highly ranked documents. As expected, the Bharat-Broder sampler was found to have significant bias. On the other hand, our pool-based sampler had no bias at all, while the random walk sampler only had a small negative bias towards short documents.

We then used our pool-based sampler to collect samples from Google, MSN Search, and Yahoo!. As a query pool, we used 5-term phrases extracted from English pages at the ODP hierarchy. We used the samples from the search engines to produce up-to-date estimates of their relative sizes.

For lack of space, all proofs are omitted. They can be found in the full version of the paper.

2 Related work

Apart from Bharat and Broder, several other studies used queries to search engines to collect random samples from their indices. Queries were either manually crafted [5], collected from user query logs [17], or selected randomly using the technique of Bharat and Broder [12,7]. Assuming search engine indices are independent and uniformly chosen subsets of the web, estimates of the sizes of search engines and the indexable web have been derived. Due to the bias in the samples, though, these estimates lack any statistical guarantees. Dobra and Fienberg [10] showed how to avoid the unrealistic independence and uniformity assumptions, but did not address the sampling bias. We believe that their methods could be combined with ours to obtain accurate size estimates.

Several studies [18,15,16,2,23] developed methods for
sampling pages from the indexable web. Such methods can be used
to also sample pages from a search engine's index. Yet, since
these methods try to solve a harder problem, they also suffer
from various biases, which our method does not have. It is
interesting to note that the random walk approaches of Henzinger
*et al.* [16] and
Bar-Yossef *et al.* [2] implicitly use importance
sampling and rejection sampling to make their samples
near-uniform. Yet, the bias they suffer towards pages with high
in-degree is significant.

Last year, Anagnostopoulos, Broder, and Carmel [1] proposed an enhancement to index architecture that could support random sampling from the result sets of broad queries. This is very different from what we do in this paper: our techniques do not propose any changes to current search engine architecture and do not rely on internal data of the search engine; moreover, our goal is to sample from the whole index and not from the result set of a particular query.

3 Formal setup

**Notation. ** All probability spaces in this paper are
discrete and finite. Given a distribution on a domain ,
the *support of * is
defined as:
. For an event
, we define
to be the
probability of this event under :
.

**Search engines. ** A *search engine* is a tuple
.
is the collection of
documents indexed. Documents are assumed to have been
pre-processed (e.g., they may be truncated to some maximum size
limit). is the space
of queries supported by the search engine.
is an
*evaluation function*, which maps every query
to
an ordered sequence of documents, called *candidate
results*. The *volume* of
is the number of
candidate results:
. is the *result
limit*. Only the top
candidate results are actually returned. These top results are called the
*result set* of
and are denoted
.
A query
*overflows*, if
, and it
*underflows*, if
. Note that if
does not
overflow, then
.

A document
*matches*
a query
, if
. The set of
queries that a document
matches is
denoted
.

**Search engine samplers. ** Let be a *target distribution* over the document
collection . Typically,
is uniform, i.e.,
, for all
. A
*search engine sampler* with target is a randomized
procedure, which generates a random document
from the domain
. The distribution of
the sample
is called the
*sampling distribution* and is denoted by . Successive
invocations of the sampler produce independent samples from
. Ideally,
, in which case
the sampler is called *perfect*. Otherwise, the sampler is
*biased*. The quality of a search engine sampler is
measured by two parameters: the *sampling recall*,
measuring ``coverage'', and the *sampling bias*, measuring
``accuracy''.

Search engine samplers have only ``black box'' access to the search engine through its public interface. That is, the sampler can produce queries, submit them to the search engine, and get back their results. It cannot access internal data of the search engine. In particular, if a query overflows, the sampler does not have access to results beyond the top .

Not all documents in are practically reachable via the public interface of
the search engine. Some pages have no text content and others
have very low static rank, and thus formulating a query that
returns them as one of the top
results may be impossible. Thus, search engine samplers usually
generate samples only from large subsets of and not from the whole
collection . The *sampling
recall* of a sampler with sampling distribution is defined as
.
For instance, when is the
uniform distribution, the sampling recall is
, i.e., the
fraction of documents which the sampler can actually return as
samples. Ideally, we would like the recall to be as close to 1 as
possible. Note that even if the recall is lower than 1, but
is
sufficiently representative of , then estimators that use samples from
can
produce accurate estimates.

Since samplers sample only from large subsets of and not from
in its entirety, it is
unfair to measure the bias of a sampler directly w.r.t. the
target distribution .
Rather, we measure the bias w.r.t. the distribution *conditioned* on
selecting a sample in
.
Formally, let
be the following distribution on
:
for all
. The *sampling
bias* of the sampler is defined as the total variation
distance between and
:

Since the running time of samplers is dominated by the number
of search engine queries they make, we define the *query
cost* of a sampler to be the expected number of search engine
queries it performs per sample generated.

4 Monte Carlo methods

The basic question addressed in stochastic simulation is the
following. There is a *target distribution* on a space , which is hard
to sample from directly. On the other hand, we have an
easy-to-sample *trial distribution* available. Simulation
methods enable using samples from to simulate sampling from , or at least to estimate statistical parameters
under the measure .
Simulators need to know the distributions and in
*unnormalized* form. That is, given any instance
,
they should be able to compute ``weights''
and
that
represent the probabilities
and
,
respectively. By that we mean that there should exist
*normalization constants*
and
s.t. for all
,
and
. The
constants
and
themselves
need not be known to the simulator. For example, when is a uniform
distribution on ,
a possible unnormalized form is
. In this case the normalization constant is
.

**Rejection sampling. ** *Rejection sampling* [24] assumes
and assumes
knowledge of an *envelope constant* , which is at least
. The procedure, described in Figure 1, repeatedly calls a sampler
that generates samples
from the trial
distribution , until a sample
is ``accepted''. To decide whether
is accepted,
the procedure tosses a coin whose heads probability is
(note that
this expression is always at most 1, due to the property of the
envelope constant).

A simple calculation shows that the distribution of the
accepted samples is *exactly* the target distribution
(hence rejection
sampling yields a *perfect* sampler). The expected number
of samples from needed in
order to generate each sample of is
. Hence, the efficiency of the procedure depends on two factors:
(1) the similarity between the target distribution and the trial
distribution: the more similar they are the smaller is the
maximum
; and (2) the gap between the envelope constant and
.

**The Metropolis-Hastings algorithm. ** The
Metropolis-Hastings (MH) algorithm [21,13] is a Markov Chain Monte Carlo
(MCMC) approach to simulation. Unlike rejection sampling, there
is no single trial distribution, but rather different trial
distributions are used at different steps of the algorithm. The
big advantage over rejection sampling is that no knowledge of an
``envelope constant'' is needed.

The MH algorithm, described in Figure 2, runs a random walk on , starting from a state . After steps (called the ``burn-in period''), the reached state is returned as a sample. The transition probability from state to state is determined by a ``proposal function'' , which is a stochastic matrix (i.e., every row of specifies a probability distribution over ). When the random walk reaches a state , it uses the distribution specified by the -th row of to choose a ``proposed'' next state . The algorithm then applies an acceptance-rejection procedure to determine whether to move to or to stay at . The probability of acceptance is:

It can be shown that for any proposal function satisfying some standard restrictions, the resulting random walk converges to . This means that if the burn-in period is sufficiently long, the sampling distribution is guaranteed to be close to . The convergence rate, i.e., the length of the burn-in period needed, however, does depend on the proposal function. There are various algebraic and geometric methods for figuring out how long should the burn-in period be as a function of parameters of the proposal function. See a survey by Diaconis and Saloff-Coste [8] for a detailed review. If the parameters of the proposal function are easy to analyze, one can use these methods to set the parameter . Otherwise, empirical methods are used to set .

**Monte Carlo methods with approximate weights. ** All Monte
Carlo methods assume that the trial distribution is known, up to
normalization. This assumption turns out to be very problematic
in our setting, since the trial distributions we construct depend
on unknown internal data of the search engine. The best we can do
is come up with an unnormalized *weight distribution*
, whose normalized
form is statistically close
to . To the best of our
knowledge, no previous study addressed this scenario before.

The following theorem proves that as long as and are close, the sampling distribution of rejection sampling is close to the target distribution :

We believe that similar analysis is possible also for importance sampling and for the Metropolis-Hastings algorithm, but leave it to future work. Nevertheless, our empirical results indicate that these two methods are not significantly affected by the approximate weights.

5 Pool-based sampler

In this section we describe our pool-based (PB) sampler. A
*query pool* is a fragment of the query space (e.g., all single term queries, all -term conjunctions, or
all -term exact phrases). A
query pool can be constructed by crawling a large corpus of web
documents, such as the ODP directory, and collecting terms or
phrases that occur in its pages. We can run the PB sampler with
any query pool, yet the choice of the pool may affect the
sampling bias, the sampling recall, and the query cost. We
provide quantitative analysis of the impact of various pool
parameters on the PB sampler.

The PB sampler uses as its principal subroutine another sampler, which generates random documents from according to the ``match distribution'', which we define below. The match distribution is non-uniform, yet its unnormalized weights can be computed efficiently, without submitting queries to the search engine. Thus, by applying any of the Monte Carlo methods on the samples from the match distribution, the PB sampler obtains near-uniform samples. In the analysis below we focus on application of rejection sampling, due to its simplicity. We note that the unnormalized weights produced by the PB sampler are only approximate, hence Theorem 4.1 becomes essential for the analysis.

**Preliminaries. ** is the
target distribution of the PB sampler. For simplicity of
exposition, we assume is the
uniform distribution on , although the sampler can be adapted to work for
more general distributions. The unnormalized form of we use is:
, for all
.

is any query
pool. The set of queries
that overflow (i.e.,
) is denoted
, while the
set of queries
that underflow (i.e.,
) is denoted
. A query that
neither overflows nor underflows is called *valid*. The
set of valid queries
is
denoted
and the set of
invalid queries is denoted
.

Let be any distribution
over . The
*overflow probability* of ,
, is
. Similarly, the
*underflow probability* of ,
, is
.

denotes the set of queries in that a document matches. That is, . For example, if consists of all the 3-term exact phrases, then consists of all the 3-term exact phrases that occur in the document .

We say that
*covers* a document
, if
. Let
be the
collection of documents covered by . Note that a sampler that uses only queries
from can never reach
documents outside
. We thus
define
to be the
uniform distribution on
. The PB
sampler will generate samples from
and hence
in order to determine its sampling bias, its sampling
distribution will be compared to
.

Let be any distribution
over . The *recall of
w.r.t.
* , denoted
, is the probability that
a random document selected from is covered by . That is,
. The recall of
w.r.t. the uniform distribution is the ratio
. This recall
will determine the sampling recall of the PB sampler.

**Match distribution. ** The *match distribution*
induced by a query pool , denoted
, is a
distribution over the document collection defined as follows:
. That is, the probability of choosing a document is proportional
to the number of queries from it matches. Note that the support of
is exactly
.

The unnormalized form of
we use is the
following:
. We next argue that for ``natural'' query pools,
can be computed
based on the content of
alone and
*without* submitting queries to the search engine. To this
end, we make the following plausible assumption: for any given
document
and any
given query
,
we know, without querying the search engine, whether
matches
or not. For
example, if is the set of
single-term queries, then checking whether
matches
boils down to
finding the term
in the text of
.

Some factors that may impact our ability to determine whether the search engine matches with are: (1) How the search engine pre-processes the document (e.g., whether it truncates it, if the document is too long). (2) How the search engine tokenizes the document (e.g., whether it ignores HTML tags). (3) How the search engine indexes the document (e.g., whether it filters out stopwords or whether it indexes also by anchor text terms). (4) The semantics of the query . Some of the above are not publicly available. Yet, most search engines follow standard IR methodologies and reverse engineering work can be used to learn answers to the above questions.

Since queries typically correspond to the occurrence of terms or phrases in the document, we can find , by enumerating the terms/phrases in and deducing the queries in that matches. For example, if is the set of single-term queries, finding all the queries in that matches simply requires finding all the distinct non-stopword terms occurring in .

**Query volume distribution. ** A special distribution over a
query pool is the
*query volume distribution*, which we denote by
. The
*volume* of a query pool , denoted
,
is the sum
.
is defined as:
.

The query volume distribution always has an underflow probability of 0. Yet, the overflow probability may be high, depending on the pool . Typically, there is a tradeoff between this overflow probability and the the recall of : the higher the recall, the higher also is the overflow probability.

Note that sampling queries from according seems hard to do, since volumes of queries are not known a priori.

**The PB sampler. ** Assume, for the moment, we have a sampler
that generates
sample documents from the match distribution
. Figure
3 shows our PB sampler, which uses
as a
subroutine.

The PB sampler applies rejection sampling with trial distribution and target distribution . The unnormalized weights used for document are and . Since is a least for any , then an envelope constant of is sufficient. As argued above, we assume the weight can be computed exactly from the content of alone. The following shows that if indeed generates samples from , then the sampling distribution of the PB sampler is uniform on . The proof follows directly from the analysis of the rejection sampling procedure:

**Match distribution sampler. ** In Figure 4 we describe the match distribution
sampler. For the time being, we make two unrealistic assumptions:
(1) there is a sampler
that samples
queries from the volume distribution
; and (2) the
overflow probability of
is 0, although
the pool's recall is high (as argued above, this situation is
unlikely). We later show how to remove these assumptions.

The sampler is very similar to the Bharat-Broder sampler, except that it samples random queries proportionally to their volume. Since no query overflows, all documents that match a query are included in its result set. It follows that the probability of a document to be sampled is proportional to the number of queries in that it matches:

We next address the unrealistic assumption that the overflow probability of is 0. Rather than using , which is likely to have a non-zero overflow probability, we use a different but related distribution, which is guaranteed to have an overflow probability of 0. Recall that is the set of valid (i.e., non-overflowing and non-underflowing) queries in . We can view as a query pool itself (after all, it is just a set of queries). The volume distribution of this pool has, by definition, an overflow probability of 0.

Suppose we could somehow efficiently sample from and use these samples instead of samples from . In that case, by Proposition 5.2, the sampling distribution of the match distribution sampler equals the match distribution induced by the query pool .

Let us return now to the PB sampler. That sampler assumed generates samples from . What happens if instead it generates samples from ? Note that now there is a mismatch between the trial distribution used by the PB sampler (i.e., ) and the unnormalized weights it uses (i.e., ).

One solution could be to try to compute the unnormalized weights of , i.e., . However, this seems hard to do efficiently, because is no longer a ``natural'' query pool. In particular, given a document , how do we know which of the queries that it matches are valid? To this end, we need to send all the queries in to the search engine and filter out the overflowing queries. This solution incurs a prohibitively high query cost. Instead, we opt for a different solution: we leave the PB sampler as is; that is, the trial distribution will be but the unnormalized weights will remain those of (i.e., ). This means that the samples generated by the PB sampler are no longer truly uniform. The following theorem uses Theorem 4.1 to bound the distance of these samples from uniformity. It turns out that this distance depends on two factors: (1) the overflow probability ; and (2) , which is the probability that a random document chosen from matches at least one valid query in .

**Volume distribution sampler. ** We are left to show how to
sample queries from the volume distribution
efficiently. Our most crucial observation is the following:
can be
*easily computed*, up to normalization. Given any query
, we can submit
to the search
engine and determine whether
or
. In the former case we also obtain
. This gives the following unnormalized form of
:
for
and
for
. Since we know
in
unnormalized form, we can apply rejection sampling to queries
sampled from some other, arbitrary, distribution on , and obtain
samples from
.

The volume distribution sampler is depicted in Figure 5. The sampler uses another query sampler that generates samples from an easy-to-sample-from distribution on . must satisfy . (We usually choose to be the uniform distribution on .) We assume is known in some unnormalized form and that a corresponding envelope constant is given. The target distribution of the rejection sampling procedure is . The unnormalized form used is the one described above. Figuring out the unnormalized weight of a query requires only a single query to the search engine. The following now follows directly from the correctness of rejection sampling:

**Analysis. ** The sampling recall and the sampling bias of
the PB sampler are analyzed in Theorem 5.3.
The query cost is bounded as follows:

**Choosing the query pool. ** We next review the parameters of
the query pool that impact the PB sampler.

(1) *Pool's recall.* The sampler's recall approximately
equals the pool's recall, so we would like pools with high
recall. In order to guarantee high recall, a pool must consist of
queries that include at least one term/phrase from (almost) every
document in . Since
overflowing queries are not taken into account, we need these
terms not to be too popular. We can obtain such a collection of
terms/phrases by crawling a large corpus of web documents, such
as the ODP directory.

(2) *Overflow probability.* The bias of the PB sampler
is approximately proportional to the invalidity probability of
the query volume distribution. We thus need the pool and the
corresponding query volume distribution to have a low overflow
probability. Minimizing this probability usually interferes with
the desire to have high recall, so finding a pool and a
distribution that achieve a good tradeoff between the two is
tricky. -term conjunctions
or disjunctions are problematic because they suffer from a high
overflow probability. We thus opted for -term exact phrases. Our experiments hint that for
, the overflow
probability is tiny. If the phrases are collected from a real
corpus, like ODP, then the underflow probability is also small.
The recall of exact phrases is only a bit lower than that of
conjunctions or disjunctions.

(3) *Average number of matches.* The query cost of the
PB sampler is proportional to
. Hence, we would like to find pools for which the number of
matches grows moderately with the document length. -term exact phrases are
a good example, because the number of matches w.r.t. them grows
linearly with the document length. -term conjunctions or disjunctions are poor choices,
because there the growth is exponential in .

6 Random walk based sampler

The *random walk (RW) sampler*, described in Figure
6, uses the Metropolis-Hastings algorithm
to perform a random walk on the documents indexed by the search
engine. The graph on these documents is defined by an
*implicit* query pool . All we need to know about is how to compute
the match set
, given a document
. We do not need
to know the constituent queries in , and thus do not have to crawl a corpus in
order to construct
.

As in the previous section, denotes the set of valid queries among the queries in . The graph on induced by , which we denote by , is an undirected weighted graph. The vertex set of is the whole document collection . Two documents are connected by an edge if and only if . That is, and are connected iff there is at least one valid query in , which they both match. The weight of the edge is set to be:

**Sampling neighbors. ** We start with the problem of sampling
neighbors according to the proposal function
. To this end, we first come up with following characterization
of vertex degrees:

**Calculating acceptance probability. ** Next, we address the
issue of how to calculate the acceptance probability. By
Proposition 6.1, if the current
document is
, and the
proposed neighbor is
, then the
acceptance probability should be
. Yet, our sampler cannot know
and
without submitting queries to the search engine. Instead, it uses
the following perturbed acceptance probability:

The *validity density* of a document
is defined
as:

Why should the variance of the validity density be small? Typically, the validity density of a document is related to the fraction of popular terms occurring in the document. The fraction of popular terms in documents written in the same language (or even in documents written in languages with similar statistical profiles) should be about the same.

**Burn-in period. ** The length of the burn-in period of the
random walk depends on the *spectral gap* of the Markov
Chain's transition matrix. The difference between the largest and
second largest eigenvalues determines how quickly the random walk
approaches its limit distribution. The bigger the gap, the faster
the convergence. Quantitatively, if we want the distribution of
samples to be at most
away from the limit distribution , then we need a burn-in period of length
, where is the
spectral gap.

When is unknown, one can resort to empirical tests to figure out a sufficient burn-in period length. One approach is to run the MH algorithm several times, and empirically find the minimum burn-in period length needed to accurately estimate statistical parameters of the data whose real value we somehow know in advance.

The query cost of the sampler depends on two factors: (1) the burn-in period ; and (2) the validity densities of the documents encountered during the random walk. When visiting a document , the expected number of queries selected until hitting a valid query is , i.e., the inverse validity density of .

7 Experimental results

We conducted 3 sets of experiments: (1) *pool
measurements*: estimation of parameters of selected query
pools; (2) *evaluation experiments*: evaluation of the
bias of our samplers and the Bharat-Broder (BB) sampler; and (3)
*exploration experiments*: measurements of real search
engines.

**Experimental setup. ** For the pool measurements and the
evaluation experiments, we crawled the 3 million pages on the ODP
directory. Of these we kept 2.4 million pages that we could
successfully fetch and parse, that were in text, HTML, or pdf
format, and that were written in English. Each page was given a
serial id, stored locally, and indexed by single terms and
phrases. Only the first 10,000 terms in each page were
considered. Exact phrases were not allowed to cross boundaries,
such as paragraph boundaries.

In our exploration experiments, conducted in January-February 2006, we submitted 220,000 queries to Google, 580,000 queries to MSN Search, and 380,000 queries to Yahoo!. Due to legal limitations on automatic queries, we used the Google, MSN, and Yahoo! web search APIs, which are, reportedly, served from older and smaller indices than the indices used to serve human users.

Our pool measurements and evaluation experiments were performed on a dual Intel Xeon 2.8GHz processor workstation with 2GB RAM and two 160GB disks. Our exploration experiments were conducted on five workstations.

**Pool measurements. ** We selected four candidate query
pools, which we thought could be useful in our samplers: single
terms and exact phrases of lengths 3, 5, and 7. (We measured only
four pools, because each measurement required substantial disk
space and running time.) In order to construct the pools, we
split the ODP data set into two parts: a *training set*,
consisting of every fifth page (when ordered by id), and a
*test set*, consisting of the rest of the pages. The pools
were built only from the training data, but the measurements were
done only on the test data. In order to determine whether a query
is overflowing, we set a result limit of .

All our measurements were made w.r.t. the uniform distribution over the pool (including the overflow probability). Analysis of the overflow probability of the volume distribution (which is an important factor in our analysis) will be done in future work. The results of our measurements are tabulated in Table 1. The normalized deviation of the validity density is the ratio , where is a uniformly chosen document.

Parameter | Single terms | Phrases (3) | Phrases (5) | Phrases (7) |
---|---|---|---|---|

Size | 2.6M | 97M | 155M | 151M |

Overflow prob. | 11.4% | 3% | 0.4% | 0.1% |

Underflow prob. | 40.3% | 56% | 76.2% | 82.1% |

Recall | 57.6% | 96.5% | 86.7% | 63.9% |

Avg # of matches | 5.6 | 78.8 | 43.2 | 29 |

Avg validity density | 2.4% | 29.2% | 55.7% | 67.5% |

Normalized deviation of validity density | 1.72 | 0.493 | 0.416 | 0.486 |

The measurements show that overflow probability, average number of matches, and average validity density improve as phrase length increases, while recall and underflow probability get worse. The results indicate that a phrase length of 5 achieves the best tradeoff among the parameters. It has a very small overflow probability (less than 0.5%), while maintaining a recall of 86%. The overflow probability of 3-term phrases is a bit too high (3%), while the recall of the 7-term phrases is way too low (about 64%).

Single terms have a relatively low overflow probability, only because we measure the probability w.r.t. the uniform distribution over the terms, and many of the terms are misspellings, technical terms, or digit strings. Note, on the other hand, the minuscule average validity density.

Since the ODP data set is presumably representative of the web, we expect most of these measurements to represent the situation on real search engines. The only exceptions are the overflow and underflow probabilities, which are distorted due to the small size of the ODP data set. We thus measured these parameters on Yahoo!. The results are given in Table 2. It is encouraging to see that for 5-term phrases, the overflow probability remains relatively low, while the underflow probability goes down dramatically. The elevated overflow probability of 3-term phrases is more evident here.

Parameter | Single terms | Phrases (3) | Phrases (5) | Phrases (7) |
---|---|---|---|---|

Overflow prob. | 40.2% | 15.7% | 2.1% | 0.4% |

Underflow prob. | 4.5% | 3.4% | 7.2% | 12.9% |

**Evaluation experiments. ** To evaluate our samplers and
the BB sampler, we ran them on a home-made search engine built
over the ODP data set. In this controlled environment we could
compare the sampling results against the real data.

The index of our ODP search engine consisted of the test set only. It used static ranking by document id to rank query results. A result limit of was used in order to have an overflow probability comparable to the one on Yahoo!.

We ran four samplers: (1) the PB sampler with rejection sampling; (2) the PB sampler with importance sampling; (3) the RW sampler; and (4) the BB sampler. All the samplers used a query pool of 5-term phrases extracted from the ODP training set. The random walk sampler used a burn-in period of 1,000 steps. Each sampler was allowed to submit exactly 5 million queries to the search engine.

In Figure 7, we show the distribution of samples by document size. We ordered the documents in the index by size, from largest to smallest, and split them into 10 equally sized deciles. Truly uniform samples should distribute evenly among the deciles. The results show the overwhelming difference between our samplers and the BB sampler. The BB sampler generated a huge number of samples at the top decile (more than 50%!). Our samplers had no or little bias.

Figure 8 addresses bias towards highly ranked documents. We ordered the documents in the index by their static rank, from highest to lowest, and split into 10 equally sized deciles. The first decile corresponds to the most highly ranked documents. The results indicate that none of our samplers had any significant bias under the ranking criterion. Surprisingly, the BB sampler had bias towards the 4th, 5th, and 6th deciles. When digging into the data, we found that documents whose rank (i.e., id) belonged to these deciles had a higher average size than documents with lower or higher rank. Thus, the bias we see here is in fact an artifact of the bias towards long documents. A good explanation is that our 5-term exact phrases pool had a low overflow probability in the first place, so very few queries overflowed.

We have several conclusions from the above experiments: (1) the 5-term phrases pool, which has small overflow probability, made an across-the-board improvement to all the samplers (including BB). This was evidenced by the lack of bias towards highly ranked documents. (2) The BB sampler suffers from a severe bias towards long documents, regardless of the query pool used. (3) Our pool-based samplers seem to give the best results, showing no bias in any of the experiments. (4) The random walk has a small negative bias towards short documents. Possibly by running the random walk more steps, this negative bias could be alleviated.

**Exploration experiments. ** We used our most successful
sampler, the PB sampler, to generate uniform samples from Google,
MSN Search, and Yahoo!. For complete details of the experimental
setup, see the full version of the paper.

Table 3 tabulates the measured relative sizes of the Google, MSN Search, and Yahoo! indices. Since our query pool consisted mostly of English language phrases, our results refer mainly to the English portion of these indices. We also remind the reader that the indices we experimented with are somewhat outdated. In order to test whether a URL belongs to the index, we used a standard procedure [4,12].

Figure 9 shows the domain name distributions in the three indices. Note that there are some minor differences among the search engines. For example the .info domain is covered much better by MSN Search than by Google and Yahoo!.

8 Conclusions

We presented two novel search engine samplers. The first, the
pool-based sampler, provides *weights* together with the
samples, and can thus be used in conjunction with stochastic
simulation techniques to produce near-uniform samples. We showed
how to apply these simulations even when the weights are
approximate. We fully analyzed this sampler and identified the
query pool parameters that impact its bias and performance. We
then estimated these parameters on real data, consisting of the
English pages from the ODP hierarchy, and showed that a pool of
5-term phrases induces nearly unbiased samples. Our second
sampler runs a random walk on a graph defined over the index
documents. Its primary advantage is that it does not need a query
pool. Empirical evidence suggests that the random walk converges
quickly to its limit equilibrium distribution. It is left to
future work to analyze the spectral gap of this random walk in
order to estimate the convergence rate mathematically.

We note that the bias towards English language documents is a limitation of our experimental setup and not of the techniques themselves. The bias could be eliminated by using a more comprehensive dictionary or by starting random walks from pages in different languages.

Finally, we speculate that our methods may be applicable in a more general setting. For example, for sampling from databases, deep web sites, library records, etc.

[1]
A. Anagnostopoulos, A. Broder, and D. Carmel, Sampling search-engine results, In *Proc. 14th WWW*, pages 245-256, 2005.

[2]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D.
Weitz, Approximating aggregate queries about Web pages via random
walks, In *Proc. 26th VLDB*, pages 535-544, 2000.

[3]
J. Battelle, John Battelle's searchblog, `battellemedia.com/archives/001889.php`,
2005.

[4]
K. Bharat and A. Broder, A technique for measuring the relative size and overlap of
public Web search engines, In *Proc. 7th WWW*, pages 379-388, 1998.

[5]
E. Bradlow and D. Schmittlein, The little engines that could: Modelling the performance of
World Wide Web search engines, *Marketing Science*, 19:43-62, 2000.

[6]
F. Can, R. Nuray, and A. B. Sevdik, Automatic performance evaluation of Web search engines, *Infor. Processing and Management*, 40:495-514,
2004.

[7]
M. Cheney and M. Perry, A comparison of the size of the Yahoo! and Google
indices, `vburton.ncsa.uiuc.edu/indexsize.html`,
2005.

[8]
P. Diaconis and L. Saloff-Coste, What do we know about the Metropolis algorithm? *J. of Computer and System Sciences*, 57:20-36,
1998.

[9]
dmoz, The open directory project, `dmoz.org`.

[10]
A. Dobra and S. Fienberg, How large is the World Wide Web? *Web Dynamics*, pages 23-44, 2004.

[11]
M. Gordon and P. Pathak, Finding information on the World Wide Web: the retrieval
effectiveness of search engines, *Information Processing and Management*, 35(2):141-180,
1999.

[12]
A. Gulli and A. Signorini, The indexable Web is more than 11.5 billion pages, In *Proc. 14th WWW (Posters)*, pages 902-903, 2005.

[13]
W. Hastings, Monte Carlo sampling methods using Markov chains and their
applications, *Biometrika*, 57(1):97-109, 1970.

[14]
D. Hawking, N. Craswel, P. Bailey, and K. Griffiths, Measuring search engine quality, *Information Retrieval*, 4(1):33-59, 2001.

[15]
M. Henzinger, A. Heydon, M. Mitzenmacher, and M.
Najork, Measuring index quality using random walks on the Web, In *Proc. 8th WWW*, pages 213-225, 1999.

[16]
M. Henzinger, A. Heydon, M. Mitzenmacher, and M.
Najork, On near-uniform URL sampling, In *Proc. 9th WWW*, pages 295-308, 2000.

[17]
S. Lawrence and C. Giles, Searching the World Wide Web, *Science*, 5360(280):98, 1998.

[18]
S. Lawrence and C. Giles, Accessibility of information on the Web, *Nature*, 400:107-109, 1999.

[19] J. S. Liu, Monte Carlo Strategies in Scientific Computing, Springer, 2001.

[20]
T. Mayer, Our blog is growing up and so has our index, `www.ysearchblog.com/archives/000172.html`.

[21]
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and
E. Teller, Equations of state calculations by fast computing
machines, *J. of Chemical Physics*, 21:1087-1091, 1953.

[22]
G. Price, More on the total database size battle and Googlewhacking with
Yahoo, `blog.searchenginewatch.com/blog/050811-231448`,
2005.

[23]
P. Rusmevichientong, D. Pennock, S. Lawrence, and C. Lee
Giles, Methods for sampling pages uniformly from the World Wide
Web, In *AAAI Fall Symp.*, 2001.

[24]
J. von Neumann, Various techniques used in connection with random digits, In *John von Neumann, Collected Works*, volume
V. Oxford, 1963.