A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets
Mehran Sahami
Google Inc.
Mountain View, CA, USA
Timothy D. Heilman
Google Inc.
Mountain View, CA, USA
Copyright is held by the World Wide Web Conference Committee (IW3C2).
Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
ACM 1595933239/06/0005.
ABSTRACT
Determining the similarity of short text snippets, such as search
queries, works poorly with traditional document similarity measures
(e.g., cosine), since there are often few, if any, terms in common
between two short text snippets. We address this problem by
introducing a novel method for measuring the similarity between short
text snippets (even those without any overlapping terms) by leveraging
web search results to provide greater context for the short texts. In
this paper, we define such a similarity kernel function,
mathematically analyze some of its properties, and provide examples of
its efficacy. We also show the use of this kernel function in a
largescale system for suggesting related queries to search engine
users.
Categories & Subject Descriptors
H.3.3 Information Systems Information Search and Retrieval;
I.2.6 Artificial Intelligence Learning;
I.2.7 Artificial Intelligence Natural Language Processing [Text analysis]
General Terms
Algorithms, Experimentation
Keywords
Text similarity measures, Web search, Information retrieval,
Kernel functions, Query suggestion
1 Introduction
In analyzing text, there are many situations
in which we wish to determine how similar two short text snippets
are. For example, there may be different ways to describe some
concept or individual, such as ``United Nations
SecretaryGeneral'' and ``Kofi Annan'', and we would like to
determine that there is a high degree of
semantic
similarity between these two text snippets. Similarly, the
snippets ``AI'' and ``Artificial Intelligence'' are very similar
with regard to their meaning, even though they may not share any
actual terms in common. Directly applying traditional document
similarity measures, such as the widely used cosine coefficient
[
17,
16], to such short text
snippets often produces inadequate results, however. Indeed, in
both the examples given previously, applying the cosine would
yield a similarity of 0 since each given text pair contains no
common terms. Even in cases where two snippets may share terms,
they may be using the term in different contexts. Consider the
snippets ``graphical models'' and ``graphical interface''. The
first uses
graphical in reference to graph structures
whereas the second uses the term to refer to graphic displays.
Thus, while the cosine score between these two snippets would be
0.5 due to the shared lexical term ``graphical'', at a semantic
level the use of this shared term is not truly an indication of
similarity between the snippets. To address this problem, we
would like to have a method for measuring the similarity between
such short text snippets that captures more of the semantic
context of the snippets rather than simply measuring their
termwise similarity. To help us achieve this goal, we can
leverage the large volume of documents on the web to determine
greater context for a short text snippet. By examining documents
that contain the text snippet terms we can discover other
contextual terms that help to provide a greater context for the
original snippet and potentially resolve ambiguity in the use of
terms with multiple meanings. Our approach to this problem is
relatively simple, but surprisingly quite powerful. We simply
treat each snippet as a query to a web search engine in order to
find a number of documents that contain the terms in the original
snippets. We then use these returned documents to create a
context vector for the original snippet, where such a
context vector contains many words that tend to occur in context
with the original snippet (i.e., query) terms. Such context
vectors can now be much more robustly compared with a measure
such as the cosine to determine the similarity between the
original text snippets. Furthermore, since the cosine is a valid
kernel, using this function in conjunction with the generated
context vectors makes this similarity function applicable in any
kernelbased machine learning algorithm [
4] where (short) text
data is being processed. While there are many cases where getting
a robust measure of similarity between short texts is important,
one particularly useful application in the context of search is
to suggest related queries to a user. In such an application, a
user who issues a query to a search engine may find it helpful to
be provided with a list of semantically related queries that he
or she may consider to further explore the related information
space. By employing our short text similarity kernel, we could
match the user's initial query against a large repository of
existing user queries to determine other similar queries to
suggest to the user. Thus, the results of the similarity function
can be directly employed in an enduser application. The approach
we take in constructing our similarity function has relations to
previous work in both the Information Retrieval and Machine
Learning communities. We explore these relations and put our work
in the context of previous research in Section
2. We then formally define
our similarity function in Section
3 and present initial examples of
its use in Section
4. This is followed by a
mathematical analysis of the similarity function in Section
5. Section
6 presents a
system for related query suggestion using our similarity
function, and an empirical evaluation of this system is given in
Section
7.
Finally, in Section
8
we provide some conclusions and directions for future work.
2 Related Work
The similarity function we present here is
based on
query expansion techniques [
3,
13] which have long
been used in the Information Retrieval community. Such methods
automatically augment a user query with additional terms based on
documents that are retrieved in response to the initial user
query or by using an available thesaurus. Our motivation for and
usage of query expansion greatly differs from this previous work,
however. First, the traditional goal of query expansion has been
to improve recall (potentially at the expense of precision) in a
retrieval task. Our focus, however, is on using such expansions
to provide a richer representation for a short text in order to
potentially compare it robustly with other short texts. Moreover,
traditional expansion is focused on creating a new query for
retrieval rather than doing pairwise comparisons between short
texts. Thus, the approach we take is quite different than the use
of query expansion in a standard Information Retrieval context.
Alternatively, information retrieval researchers have previously
proposed other means of determining query similarity. One early
method proposed by Raghavan and Sever [
14] attempts to
measure the relatedness of two queries by determining differences
in the ordering of documents retrieved in response to the two
queries. This method requires a total ordering (ranking) of
documents over the whole collection for each query. Thus,
comparing the pairwise differences in rankings requires
time, where
is the number of documents in
the collection. In the context of the web, where
billion
^{1}, this algorithm quickly becomes
intractable. Later work by Fitzpatrick and Dent [
9] measures query
similarity using the normalized set overlap (intersection) of the
top 200 documents retrieved for each query. While this
algorithm's runtime complexity easily scales to the web, it will
likely not lead to very meaningful similarity results as the
sheer number of documents in the web collection will often make
the set overlap for returned results extremely small (or empty)
for many related queries that are not nearly identical. We show
that this is indeed the case in our experimental and theoretical
results later in the paper. In the context of Machine Learning,
there has been a great deal of work in using kernel methods, such
as Support Vector Machines for text classification [
11,
8]. Such work has recently
extended to building specialized kernels aimed at measuring
semantic similarity between documents. We outline some of these
approaches below, and show how they differ from the work
presented here. One of the early approaches in this vein is
Latent Semantic Kernels (LSK) [
5],
which is a kernelbased extension to the wellknown Latent
Semantic Indexing (LSI) [
6]
proposed in the Information Retrieval community. In LSK, a kernel
matrix is computed over text documents, and the
eigendecomposition of this matrix is used to compute a new
(lower rank approximation) basis for the space. The dimensions of
the new basis can intuitively be thought of as capturing
``semantic concepts'' (i.e., roughly corresponding to covarying
subsets of the dimensions in the original space). While there may
be some superficial similarities, this approach differs in
fundamental respects from our work. First, our method is aimed at
constructing a new kernel function, not using an existing kernel
matrix to infer ``semantic dimensions''. Also, our method takes a
lazy approach in the sense that we need not compute an
expansion for a given text snippet until we want to evaluate the
kernel function. We never need to explicitly compute a full
kernel matrix over some set of existing text snippets nor its
eigendecomposition. Indeed, the kernel we present here is entire
complimentary to work on LSK, as our kernel could be used to
construct the kernel matrix on which the eigendecomposition is
performed. An approach more akin to that taken here is the work
of Kandola
et al. [
12] who define a
kernel for determining the similarity of individual terms based
on the collection of documents that these terms appear in. In
their work, they learn a
Semantic Proximity Matrix that
captures the relatedness of individual terms by essentially
measuring the correlation in the documents that contain these
terms. In our work, the kernel we consider is not attempting to
just determine similarity between single terms, but entire text
snippets. Moreover, our approach does not require performing an
optimization over an entire collection of documents (as is
required in the previous work), but rather the kernel between
snippets can be computed online selectively, as needed. Previous
research has also tried to address learning a semantic
representation for a document by using crosslingual techniques
[
18].
Here, one starts with a corpus of document pairs, where each pair
is the same document written in two different languages. A
correlation analysis is then performed between the corpora in
each language to determine combinations of related words in one
language that correlate well with combinations of words in the
other language, and thereby learn word relations within a given
language. Obviously, the approach we take does not require such
paired corpora. And, again, we seek to not just learn
relationships between single terms but between entire arbitrary
short texts. Thus, while there has been a good deal of work in
determining semantic similarities between texts (which highlights
the general importance of this problem), many of which use kernel
methods, the approach we present has significant differences with
that work. Moreover, our approach provides the compelling
advantage that semantic similarity can be measured between
multiterm short texts, where the entire text can be considered
as a whole, rather than just determining similarity between
individual terms. Furthermore, no expensive preprocessing of a
corpus is required (e.g., eigendecomposition), and the kernel
can easily be computed for a given snippet pair as needed. We
simply require access to a search engine (i.e., text index) over
a corpus, which can be quite efficiently (linearly) constructed
or can be obviated entirely by accessing a public search engine
on the Web, such as the Google API
(
http://www.google.com/apis).
3 A New Similarity Function
Presently, we formalize our
kernel function for semantic similarity. Let
represent a short text snippet
^{2}. Now, we compute
the
query expansion of
,
denoted
, as follows:
1. Issue as a query to a search engine .
2. Let be the set of (at most) retrieved
documents
3. Compute the TFIDF term vector for each
document
4. Truncate each vector to include its highest
weighted terms
5. Let be the centroid of the normalized
vectors :
6. Let be the normalization of the centroid :
We note that to be precise, the computation of
really should be parameterized by both the query
and the search engine
used. Since we assume that
remains constant in all
computations, we omit this parameter for brevity. There are
several modifications that can be made to the above procedure, as
appropriate for different document collections. Foremost among
these is the term weighting scheme used in Step 3. Here, we
consider a TFIDF vector weighting scheme [
15], where the weight
associated with with
term
in document
is defined to be:
where
is the frequency of
in
,
is the total number of
documents in the corpus, and
is the total number of documents that contain
. We compute
and
using a large sample of
documents from the web. Clearly, other weighting schemes are
possible, but we choose TFIDF here since it is commonly used in
the IR community and we have found it to empirically give good
results in building representative query expansions. Also, in
Step 4, we set the maximum number of terms in each vector
= 50, as we have found this
value to give a good tradeoff between representational
robustness and efficiency. Also, in Step 2, we need not choose to
use the entirety of retrieved documents in order to produce
vectors. We may choose to limit ourselves to create vectors using
just the
contextually descriptive text snippet for each
document that is commonly generated by Web search engines. This
would make our algorithm more efficient in terms of the amount of
data processed, and allows us to make ready use of the results
from public web search engines without having to even retrieve
the full actual underlying documents. Of course, there remains
the question of how large such descriptive texts provided by
search engines need to be in order to be particularly useful.
Empirically, we have found that using 1000 characters (in a token
delimited window centered on the original query terms in the
original text) is sufficient to get accurate results, and
increasing this number does not seem to provide much additional
benefit. Evaluating a variety of term weighting or text windowing
schemes, however, is not the aim of this work and we do not
explore it further here. Rather we simply seek to outline some of
the issues that may be of interest to practitioners and provide
some guidance on reasonable values to use that we have found work
well empirically. Finally, given that we have a means for
computing the query expansion for a short text, it is a simple
matter to define the semantic kernel function
as the inner product of the query expansions for two
text snippets. More formally, given two short text snippets
and
, we define the semantic similarity kernel between them
as:
Observation 1 is a valid kernel function.
This readily follows from the fact that
is defined as an inner product with a bounded
norm (given that each query expansion vector has norm 1.0). For
more background on the properties of kernel functions and some of
their potential applications, we refer the interested reader to
the text by Cristianini and ShaweTaylor [
4].
4 Initial Results With Kernel
To get a cursory evaluation for
how well our semantic similarity kernel performs, we show results
with the kernel on a number of text pairs, using the Google
search engine as the underlying document retrieval mechanism. We
attempt to highlight both the strengths and potential weaknesses
of this kernel function.
Table 1: Examples of webbased kernel
applied to short text snippet pairs.
Text 1 
Text 2 
Kernel 
Cosine 
Set Overlap 
Acronyms 
support vector machine 
SVM 
0.812 
0.0 
0.110 
portable document format 
PDF 
0.732 
0.0 
0.060 
artificial intelligence 
AI 
0.831 
0.0 
0.255 
artificial insemination 
AI 
0.391 
0.0 
0.000 
term frequency inverse document
frequency 
tf idf 
0.831 
0.0 
0.125 
term frequency inverse document
frequency 
tfidf 
0.507 
0.0 
0.060 
Individuals
and their positions 
UN SecretaryGeneral 
Kofi Annan 
0.825 
0.0 
0.065 
UN SecretaryGeneral 
George W. Bush 
0.110 
0.0 
0.000 
US President 
George W. Bush 
0.688 
0.0 
0.045 
Microsoft CEO 
Steve Ballmer 
0.838 
0.0 
0.090 
Microsoft CEO 
Bill Gates 
0.317 
0.0 
0.000 
Microsoft Founder 
Bill Gates 
0.677 
0.0 
0.010 
Google CEO 
Eric Schmidt 
0.845 
0.0 
0.105 
Google CEO 
Larry Page 
0.450 
0.0 
0.040 
Google Founder 
Larry Page 
0.770 
0.0 
0.050 
Microsoft Founder 
Larry Page 
0.189 
0.0 
0.000 
Google Founder 
Bill Gates 
0.096 
0.0 
0.000 
web page 
Larry Page 
0.123 
0.5 
0.000 
Multifaceted
terms 
space exploration 
NASA 
0.691 
0.0 
0.070 
space exploration 
space travel 
0.592 
0.5 
0.005 
vacation travel 
space travel 
0.321 
0.5 
0.000 
machine learning 
ICML 
0.586 
0.0 
0.065 
machine learning 
machine tooling 
0.197 
0.5 
0.000 
graphical UI 
graphical models 
0.275 
0.5 
0.000 
graphical UI 
graphical interface 
0.643 
0.5 
0.000 
java island 
Indonesia 
0.454 
0.0 
0.000 
java programming 
Indonesia 
0.020 
0.0 
0.000 
java programming 
applet development 
0.563 
0.0 
0.010 
java island 
java programming 
0.280 
0.5 
0.000 

We examined several text snippet pairs to determine the
similarity score given by our new webbased kernel, the
traditional cosine measure, and the set overlap measure proposed
by Fitzpatrick and Dent. We specifically look at three genres of
text snippet matching: (i) acronyms, (ii) individuals and their
positions, and (iii) multifaceted terms.
^{3} Examples of
applying the kernel are shown in Table
1, which is segmented by the
genre of matching examined. The first section of the table deals
with the identification of acronyms. In this genre, we find two
notable effects using our kernel. First, from the relatively high
similarity scores found between acronyms and their full name, it
appears that our kernel is generally effective at capturing the
semantic similarity between an acronym and its full name. Note
that the kernel scores are not 1.0 since acronyms can often have
multiple meanings. Related to this point, our second observation
is that our kernel function (being based on contextual text usage
on the web) tends to prefer more common usages of an acronym in
determining semantic similarity. For example, the text ``AI'' is
determined to be much more similar to ``artificial intelligence''
than ``artificial insemination'' (even though it is a valid
acronym for both), since contextual usage of ``AI'' on the web
tends to favor the former meaning. We see a similar effect when
comparing ``term frequency inverse document frequency'' to ``tf
idf'' and ``tfidf''. While the former acronym tends to be more
commonly used (especially since the subacronyms ``tf'' and
``idf'' are separated), the still reasonable score over 0.5 for
the acronym ``tfidf'' shows that the kernel function is still
able to determine a solid level of semantic similarity. It is not
surprising that the use of cosine similarity is entirely
inappropriate for such a task (since the full name of an acronym
virtually never contains the acronym itself). Moreover, we find,
as expected, that the set overlap measure leads to very low (and
not very robust) similarity values. Next, we examined the use of
our kernel in identifying different characterizations of
individuals. Specifically, we considered determining the
similarity of the name of a notable individual with his prominent
role description. The results of these examples are shown in the
second section of Table
1. In order to assess the
strengths and weaknesses of the kernel function we intentionally
applied the kernel to both correct pairs of descriptions and
individuals as well looking at pairs involving an individual and
a
close, but incorrect, description. For example, while
Kofi Annan and George W. Bush are both prominent world political
figures, the kernel is effective at determining the correct role
matches and assigning them appropriately high scores. In the
realm of business figures, we find that the kernel is able to
distinguish Steve Ballmer as the current CEO of Microsoft (and
not Bill Gates). Bill Gates still gets a nontrivial semantic
similarity with the role ``Microsoft CEO'' since he was indeed
the
former CEO, but he is much more strongly (by a over a
factor of 2) associated correctly with the text ``Microsoft
founder''. Similarly, the kernel is successful at correctly
identifying the current Google CEO (Eric Schmidt) from Larry Page
(Google's founder and
former CEO). We also attempted to
test how readily the kernel function assigned high scores for
inappropriate matches by trying to pair Bill Gates as the founder
of Google and Larry Page as the founder of Microsoft. The low
similarity scores given by the kernel show that it does indeed
find little semantic similarity between these inappropriate
pairs. Once again, the kernel value is nonzero since each of the
individuals is indeed the founder of
some company, so the
texts compared are not entirely devoid of some semantic
similarity. Finally, we show that even though Larry Page has a
very common surname, the kernel does a good job of not confusing
him with a ``web page'' (although the cosine gives a
inappropriately high similarity due to the match on the term
``page''). Lastly, we examined the efficacy of the kernel when
applied to texts with multifaceted terms  a case where we
expect the raw cosine and set overlap to once again do quite
poorly. As expected, the kernel does a reasonable job of
determining the different facets of terms, such as identifying
``space exploration'' with ``NASA'' (even though they share no
tokens), but finding that the similarity between ``vacation
travel'' and ``space travel'' is indeed less than the cosine
might otherwise lead us to believe. Similar effects are seen in
looking at terms used in context, such as ``machine'',
``graphical'', and ``java''. We note that in many cases, the
similarity values here are not as extreme as in the previous
instances. This has to do with the fact that we are trying to
measure the rather fuzzy notion of
aboutness between
semantic concepts rather than trying to identify an acronym or
individual (which tend to be much more specific matches). Still,
the kernel does a respectable job (in most cases) of providing a
score above 0.5 when two concepts are very related and less than
0.3 when the concepts are generally thought of as distinct. Once
again, the low similarity scores given by the set overlap method
show that in the context of a large document collection such as
the web, this measure is not very robust. As a side note, we also
measured the set overlap using the top 500 and top 1000 documents
retrieved for each query (in addition to the results reported
here which looked at the top 200 documents as suggested in the
original paper), and found qualitatively very similar results
thus indicating that the method itself, and not merely the
parameter settings, led to the poor results in the context of the
web.
5 Theoretical Analysis of Kernel and Set Overlap Measures
In
light of the anecdotal results in comparing our kernel function
with the set overlap measure, it is useful to mathematically
analyze the behavior of each measure in the context of large (and
continuously growing) document collections such as the web. We
begin by introducing some relevant concepts for this
analysis.
Intuitively, this definition captures the notion that since a
search engine generates a ranking of documents by scoring them
according to various criteria, the scores used for ranking may
only accurately resolve document relevance to within some
toleration
. This
toleration factor
reflects the inherent resolving limitation of a given relevance
scoring function, and thus within this toleration factor, the
ranking of documents can be seen as arbitrary. As we are
interested in analyzing very large corpora and the behavior of
the various similarity measures in the limit as the collections
being searched grow infinitely large, we consider the situation
in which so many relevant documents are available to a search
engine for any given query
that
the set of
topranked documents
are all
indistinguishable. To formalize this concept,
let
be the set of all
(maximally ranked) documents which are all
indistinguishable to search engine
for query
. Now we note that as the size of the collection
grows to infinity (i.e.,
) then
, since there will be
infinitely many documents that are equally relevant to a given
query. Moreover, since the documents in
are
indistinguishably relevant to
, we assume that the top
results retrieved for query
will
be a uniformly random sampled subset of
(with replacement, just to simplify the analysis in
the limit as
grows large).
The use of a uniform distribution for sampling documents from
can be justified by the
fact that since all documents in
are within the tolerance
of the ranking function, their ranking is
arbitrary. Since in this context there is no reason to prefer one
particular distribution of rankings over any another, a maximally
entropic distribution (i.e., uniform) is a reasonable model to
use. In the sequel, assume that we are given two different
queries
and
, which are so highly related to each other that
(again, for simplicity) we assume
. While in
reality it is unlikely that two queries would share
exactly the same set of maximally relevant documents, we
make this assumption (which intuitively should lead to a very
high similarity score between
and
) to show that even under
conditions of extreme similarity, there are shortcomings with the
set overlap similarity measure. We show that the kernel function
does not suffer from similar problems. Since we assume
and always
use the same search engine
in
our analysis, we will simply refer to
(and thus
) as
for brevity
when there is no possibility of ambiguity.
Theorem 1 Let R(q) be the set of
topranked documents with
respect to query
. Then, in
the set overlap measure, the expected normalized set overlap
for queries
and
, is
A desirable (and straightforward) corollary of this theorem, is
that as we increase the results set size to capture all the
relevant documents (i.e.,
),
the expected overlap measure approaches 1. Interestingly,
however, for any fixed results set size
, as
, the expected normalized
set overlap
.
This result suggests that even if two queries are so similar as
to have the same set of highly relevant documents, in the limit
as the collection size increases (and thus the number of relevant
documents increases), the similarity as given by the set overlap
measure will go to 0. Note that this problem would be even worse
if we had not made the simplifying assumption that
, as the
set overlap measure would approach 0 even more quickly. While
this result is not surprising, it does show the problem that
arises with using such a measure in the context of a large
collection such as the web. This is also borne out in the
anecdotal results seen in Section
4.
Analyzing our kernel function under the same
conditions as above, we find that the measure is much more robust
to growth in collection size, making the measure much more
amenable for use in broad contexts such as the web. Since the
kernel function computes vectors based on the documents retrieved
from the relevant set
, we examine
properties of the document vectors from this set. Namely, we
assume that the document vectors
generated from the documents in
are distributed according to some arbitrary
^{4}
distribution
with mean
direction vector
and a
standard deviation
, where
measures the
angular difference from
.
Such distributions, which are defined based on direction or
angle, fall into the general class of
circular
distributions and a full discussion of them is beyond the scope
of this paper (we refer the interested reader to work on document
analysis using circular distributions, such as the von Mises
distribution [
7,
2].) In this context, we note
that
for a given query
is simply a sample mean from
the distribution
of document
vectors in
. This follows from the
fact that the set of relevant documents
retrieved in response to
are simply samples from
, and thus their corresponding document vectors
are just samples from
. The centroid of these
vectors
is defined to be the
mean (direction) of the vectors, and
is just the unit length normalized centroid (with
the same direction as
) thus
making it a sample mean of the vector directions in
.
Observation 2 As
,
then
.
This observation follows directly from the fact that as
,
then the sample on which
is based becomes the whole population, so
becomes the true population mean
.
Observation 3 If queries
and
share the
same
indistinguishable
relevant set
, then as
it follows that
.
To show this observation we note that if
and
share the
same
indistinguishable
relevant set
, as
,
then
and
. Thus,
. This gives us
the intuitively desirable behavior (which is also shared with the
set overlap measure) that as the size of the results set used to
generate the query expansion vectors grows to encompass all
relevant documents, the similarity for two queries with the same
results set goes to 1. In contrast to the set overlap measure, we
find that the kernel function does not go to 0 as the number of
documents in the relevant results set
increases without bound. Indeed, we can prove a stronger
theorem with respect to the property of our kernel function in
the limit.
Theorem 2 Let
be the standard
deviation of the distribution
of vectors corresponding to documents in the
indistinguishable
set of query results
for queries
and
. Then, with high
probability (
), it holds
that
Note that the bound on
in
Theorem
2 is independent
of
, even though it
depends on
. This follows from
the fact that since the vectors that correspond to documents in
are just samples from some
true underlying stationary distribution
(with mean
and
standard deviation
), the
true standard deviation
does not change as
. Since
is
independent of
, then so
is
. This implies that
the kernel function is robust for use in large collections, as
its value does not depend on the number of relevant documents,
but simply on the directional dispersion (measured by the
standard deviation over angles) of the vectors of the relevant
documents. This property makes the kernel wellsuited for use
with large collections such as the web. Furthermore, we can
consider the more general (and realistic) case where the sets of
indistinguishable
results for queries
and
need not be the same
(i.e.,
), and
now prove a more general result that subsumes Theorem
2 as a special case.
Theorem 3
Let
and
be the respective
means of the distributions
and
of
vectors corresponding to documents from
and
. Let
and
be the standard
deviations of
and
, respectively. And let
be the
angle between
and
. Then, with high
probability (
), it holds
that
We note that Theorem
2 is
simply a special case of Theorem
3, where
,
, and thus
.
Theorem
2 was derived
separately just to provide a direct contrast with the normalized
set overlap measure under the same conditions. Intuitively,
Theorem
3 tells us
that that the kernel function is essentially trying to measure
the cosine of the angle
between
the mean vectors of the documents relevant to each query, with a
``noise'' term (proportional to
) that depends on the
natural dispersion (standard deviation) of the documents relevant
to each query and the size
of
the sample used to generate the query expansion. Thus, if we were
to think of the set of documents that are relevant to a given
query
as the ``topic'' of
, then the kernel is
attempting to measure the mean ``topical'' difference between the
queries, independent of the number of documents that make up each
topic. This sort of behavior (and its independence from the
overall collection size) is an intuitively desirable property for
a similarity function.
6 Related Query Suggestion
Armed with promising anecdotal
evidence as well as theoretical results that argue in favor of
using this kernel when comparing short texts, we turn our
attention to the task of developing a simple application based on
this kernel. The application we choose is query suggestionthat
is, to suggest potentially related queries to the users of a
search engine to give them additional options for information
finding. We note that there is a long history of work in query
refinement, including the previously mentioned work in query
expansion [
3,
13], harnessing
relevance feedback for query modification [
10], using precomputed
term similarities for suggestions [
19], linguistically
mining documents retrieved in response to a search for related
terms and phrases [
20,
1], and even simply
finding related queries in a thesaurus. While this is certainly
an active area of work in information retrieval, we note that
improving query suggestion is not the primary focus of this work.
Thus, we intentionally do not compare our system with others.
Rather, we use query suggestion as a means of showing the
potential utility of our kernel function in just one, of
potentially many, realworld applications. We provide a user
evaluation of the results in this application to get a more
objective measure of the efficacy of our kernel. At a highlevel,
our query expansion system can be described as starting with an
initial repository
of previously
issued user queries (for example, culled from search engine
logs). Now, for any newly issued user query
, we can compute our kernel function
for all
and suggest related queries
which have the highest kernel score with
(subject to some
postfiltering to eliminate related queries that are too
linguistically similar to each other). More specifically, we
begin by precomputing the query expansions for a repository
of approximately 116 million
popular user queries issued in 2003, determined by sampling
anonymized web search logs from the Google search engine. After
generating these query expansions, we index the resulting vectors
for fast retrieval in a retrieval system
. Now, for any newly observed user query
, we can generate its query expansion
and use this entire expansion as a disjunctive
query to
, finding all existing
query expansions
in the
repository that potentially match
. Note that if a query expansion
indexed in
does
not match
in at least one term
(i.e., it is not retrieved), then we know
since there are no common terms in
and
. For each retrieved query expansion
, we can then compute
the inner product
.
Table 2: Examples of suggested queries,
along with corresponding kernel scores and human rater
evaluations.
Original Query 
Suggested Queries 
Kernel Score 
Human Rating 

california lotto home 
0.812 
3 
california lottery 
winning lotto numbers in
california 
0.792 
5 

california lottery super lotto
plus 
0.778 
3 

2003 valentine's day 
0.832 
3 

valentine day card 
0.822 
4 
valentines day 
valentines day greeting cards 
0.758 
4 

I love you valentine 
0.736 
2 

new valentine one 
0.671 
1 

To actually determine which of the matched queries from the
repository to suggest to the user, we use the following
algorithm, where the constant MAX is set to the maximum number of
suggestions that we would like to obtain:
Given: user query , and
list of matched queries from repository
Output: list of queries to suggest
1. Initialize suggestion list
2. Sort kernel scores in descending order
to produce an ordered list
of corresponding queries .
3.
4. While and do
4.1 If then
4.1.1
4.2
5. Return suggestion list
Here
denotes the
number of terms in query
. Thus,
the test in Step 4.1 above is our postfilter to only add
another suggested query
if it differs by more than half as many terms
from any other query already in the suggestion list
(as well as the
original user query
). This helps promote linguistic diversity in the
set of suggested queries. The outputted list of query
suggestions
can be
presented to the search engine user to guide them in
conducting followup searches.
7 Evaluation of Query Suggestion System
In order to evaluate
our kernel within the context of this query suggestion system, we
enlisted nine human raters who are computer scientists familiar
with information retrieval technologies. Each rater was asked to
issue queries from the Google Zeitgeist
^{5} in a different month
of 2003 (since our initial repository of queries to suggest was
culled near the start of 2003). The Google Zeitgeist tracks
popular queries on the web monthly. We chose to use such common
queries for evaluation because if useful suggestions were found,
they could potentially be applicable for a large number of search
engine users who had the same information needs. Each rater
evaluated the suggested queries provided by the system on a
5point Likert scale, defined as:
1: suggestion is totally off topic.
2: suggestion is not as good as original query.
3: suggestion is basically same as original query.
4: suggestion is potentially better than original query.
5: suggestion is fantastic  should suggest this query
since it might help a user find what they're looking
for if they issued it instead of the original query.
In our experiment we set the maximum number of suggestions for
each query (MAX) to 5, although some queries yielded fewer than
this number of suggestions due to having fewer suggestions pass
the postfiltering process. A total of 118 user queries, which
yielded 379 suggested queries (an average of 3.2 suggestions per
query) were rated. Note that some raters evaluated a different
number of queries than other raters. In Table
2 we provide an example
of two user queries, the query suggestions made using our system,
the corresponding kernel scores, and the human evaluation ratings
for the suggested queries. As can be seen in the first example,
it is not surprising that users interested in the ``california
lottery'' would prefer to find winning numbers rather than simply
trying to get more information on the lottery in general. In the
second example, we find that users querying for ``valentines
day'' may be looking to actually send greeting cards. The
suggestion ``new valentine one'' is actually referring to a radar
detector named
Valentine One and thus is clearly offtopic
with regard to the original user query. Since each query
suggestion has a kernel score associated with it, we can
determine how suggestion quality is correlated with the kernel
score by looking at the average rating over all suggestions that
had a kernel score above a given threshold. If the kernel is
effective, we would generally expect higher kernel scores to lead
to more useful queries suggested to the user (as they would tend
to be more ontopic even given the postfiltering mechanism that
attempts to promote diversity among the query suggestions).
Moreover, we would expect that overall the suggestions would
often be rated close to 3 (or higher) if the kernel were
effective at identifying query suggestions semantically similar
to the original query.
Figure 1: Average ratings at various
kernel thresholds.

The results of this experiment are shown in Figure
1, which shows the average user
rating for query suggestions, where we use a kernel score
threshold to only consider suggestions that scored at that
threshold or higher with the original query. Indeed, we see that
the query suggestions are generally rated close to 3 (same as the
original query), but that the rating tends to increase with the
kernel score. This indicates that queries deemed by the kernel to
be very related to the original query are quite useful to users
in honing their information need, especially when we allow for
some diversity in the results using the postfiltering mechanism.
In fact, we found that without the use of the postfiltering
mechanism, the results suggested by the system were often too
similar to the original query to provide much additional utility
for query suggestion (although it was indicative of the kernel
being effective at finding related queries).
Figure 2: Average ratings versus average
number of query suggestions made for each query as kernel
threshold is varied from 0.85 down to 0.3.

Figure
2 shows a graph
analogous to a PrecisionRecall curve, where we plot the average
user rating for query suggestions versus the average number of
suggestions that are given per query as we vary the kernel score
threshold from 0.85 down to 0.3. We see a clear tradeoff between
the quality of the suggestions presented to the user and the
number of suggestions given. Indeed, it is possible, on average
to give two query suggestions for each query which have a quality
(slightly) higher than the original query.
8 Conclusions and Future Work
We have presented a new kernel
function for measuring the semantic similarity between pairs of
short text snippets. We have shown, both anecdotally and in a
humanevaluated query suggestion system, that this kernel is an
effective measure of similarity for short texts, and works well
even when the short texts being considered have no common terms.
Moreover, we have also provided a theoretical analysis of the
kernel function that shows that it is wellsuited for use with
the web. There are several lines of future work that this kernel
lays the foundation for. The first is improvement in the
generation of query expansions with the goal of improving the
match score for the kernel function. The second is the
incorporation of this kernel into other kernelbased machine
learning methods to determine its ability to provide improvement
in tasks such as classification and clustering of text. Also,
there are certainly other potential webbased applications,
besides query suggestion, that could be considered as well. One
such application is in a question answering system, where the
question could be matched against a list of candidate answers to
determine which is the most similar semantically. For example,
using our kernel we find that: K(``Who shot Abraham Lincoln'',
``John Wilkes Booth'') = 0.730. Thus, the kernel does well in
giving a high score to the correct answer to the question, even
though it shares no terms in common with the question.
Alternatively, K(``Who shot Abraham Lincoln'', ``Abraham
Lincoln'') = 0.597, indicating that while the question is
certainly semantically related to ``Abraham Lincoln'', the true
answer to the question is in fact more semantically related to
the question. Finally, we note that this kernel is not limited to
use on the web, and can also be computed using query expansions
generated over domainspecific corpora in order to better capture
contextual semantics in particular domains. We hope to explore
such research venues in the future.
We thank
Amit Singhal for many invaluable discussions related to this
research. Additionally, we appreciate the feedback provided on
this work by the members of the Google Research group, especially
Vibhu Mittal, Jay Ponte, and Yoram Singer. We are also indebted
to the nine human raters who took part in the query suggestion
evaluation.
 1
 P. Anick and S. Tipirneni.
The paraphrase search assistant: Terminological feedback for
iterative information seeking.
In Proceedings of the 22nd Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 153159, 1999.
 2
 A. Banerjee, I. S. Dhillon, J. Ghosh, and S. Sra.
Clustering on the unit hypersphere using von misesfisher
distributions.
Journal of Machine Learning Research, 6:13451382,
2005.
 3
 C. Buckley, G. Salton, J. Allan, and A. Singhal.
Automatic query expansion using SMART: TREC 3.
In The Third Text REtrieval Conference, pages 6980,
1994.
 4
 N. Cristianini and J. ShaweTaylor.
An Introduction to Support Vector Machines and Other
Kernelbased Learning Methods.
Cambridge University Press, 2000.
 5
 N. Cristianini, J. ShaweTaylor, and H. Lodhi.
Latent semantic kernels.
Journal of Intelligent Information Systems,
18(2):127152, 2002.
 6
 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman.
Indexing by latent semantic analysis.
Journal of the American Society for Information
Science, 41(6):391407, 1990.
 7
 I. S. Dhillon and S. Sra.
Modeling data using directional distributions, 2003.
 8
 S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami.
Inductive learning algorithms and representations for text
categorization.
In CIKM98: Proceedings of the Seventh International
Conference on Information and Knowledge Management,
1998.
 9
 L. Fitzpatrick and M. Dent.
Automatic feedback using past queries: Social searching?
In Proceedings of the 20th Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 306313, 1997.
 10
 D. Harman.
Relevance feedback and other query modification
techniques.
In W. B. Frakes and R. BaezaYates, editors, Information
Retrieval: Data Structures and Algorithms, pages 241263.
Prentice Hall, 1992.
 11
 T. Joachims.
Text categorization with support vector machines: learning with
many relevant features.
In Proceedings of ECML98, 10th European Conference on
Machine Learning, number 1398, pages 137142, 1998.
 12
 J. S. Kandola, J. ShaweTaylor, and N. Cristianini.
Learning semantic similarity.
In Advances in Neural Information Processing Systems (NIPS)
15, pages 657664, 2002.
 13
 M. Mitra, A. Singhal, and C. Buckley.
Improving automatic query expansion.
In Proceedings of the 21st Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 206214, 1998.
 14
 V. V. Raghavan and H. Sever.
On the reuse of past optimal queries.
In Proceedings of the 18th Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 344350, 1995.
 15
 G. Salton and C. Buckley.
Term weighting approaches in automatic text retrieval.
Information Processing and Management, 24(5):513523,
1988.
 16
 G. Salton and M. J. McGill.
Introduction to Modern Information Retrieval.
McGrawHill Book Company, 1983.
 17
 G. Salton, A. Wong, and C. S. Yang.
A vector space model for automatic indexing.
Communications of the ACM, 18:613620, 1975.
 18
 A. Vinokourov, J. ShaweTaylor, and N. Cristianini.
Inferring a semantic representation of text via crosslanguage
correlation analysis.
In Advances in Neural Information Processing Systems (NIPS)
15, pages 14731480, 2002.
 19
 B. Vélez, R. Wiess, M. A. Sheldon, and D. K. Gifford.
Fast and effective query refinement.
In Proceedings of the 20th Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 615, 1997.
 20
 J. Xu and W. B. Croft.
Query expansion using local and global document analysis.
In Proceedings of the 19th Annual International ACMSIGIR
Conference on Research and Development in Information
Retrieval, pages 411, 1996.
Footnotes
 ^{1}
 Leading search engines claim index sizes of at least 20
billion documents at the time of this writing.
 ^{2}
 While the real focus of our work is geared toward short
text snippets, there is no technical reason why must have limited length,
and in fact can be arbitrary
text.
 ^{3}
 We prefer the term multifaceted over
ambiguous, since multifaceted terms may have the same
definition in two contexts, but the accepted semantics of that
definition may vary in context. For example, the term
``travel'' has the same definition in both the phrases ``space
travel'' and ``vacation travel'', so it is (strictly speaking)
not ambiguous here, but the semantics of what is meant by
traveling in those two cases is different.
 ^{4}
 The distribution is
arbitrary up to the fact that its first two moments, mean and
variance, exist (which is a fairly standard and nonrestrictive
assumption).
 ^{5}

http://www.google.com/intl/en/press/zeitgeist.html