2
Copyright is held by the World Wide Web Conference Committee (IW
3C2).
Distribution of these papers is limited to classroom use,
and personal use by others.
WWW 2006, May 23.26, 2006, Edinburgh, Scotland.
ACM 1-59593-323-9/06/0005.
We introduce the notion of query substitution, that is, generating a new query to replace a user's original search query. Our technique uses modifications based on typical substitutions web searchers make to their queries. In this way the new query is strongly related to the original query, containing terms closely related to all of the original terms. This contrasts with query expansion through pseudo-relevance feedback, which is costly and can lead to query drift. This also contrasts with query relaxation through boolean or TFIDF retrieval, which reduces the specificity of the query. We define a scale for evaluating query substitution, and show that our method performs well at generating new queries related to the original queries. We build a model for selecting between candidates, by using a number of features relating the query-candidate pair, and by fitting the model to human judgments of relevance of query suggestions. This further improves the quality of the candidates generated. Experiments show that our techniques significantly increase coverage and effectiveness in the setting of sponsored search. .
H.3.3 Information Storage and Retrieval, Query formulation, Search Process, Retrieval Models
Algorithms, Experimentation
Query rewriting, query substitution, paraphrasing, sponsored search
Existing solutions to this type of problem include relevance feedback and pseudo-relevance feedback, query term deletion [#!jonesFainSIGIR2003!#], substituting query terms with related terms from retrieved documents [#!terra04:Scoring!#], and Latent Semantic Indexing (LSI) [#!deerwester90indexing!#]. Pseudo-relevance feedback involves submitting a query for an initial retrieval, processing the resulting documents, modifying the query by expanding it with additional terms from the documents retrieved and then performing a second retrieval with the modified query. Pseudo-relevance feedback has limitations in effectiveness [#!ruthven2003!#]. It may lead to query drift, as unrelated terms are added to the query. It is also computationally expensive. Substituting query terms with related terms from retrieved documents also relies on an initial retrieval. Query relaxation or deleting query terms leads to a loss of specificity from the original query. LSI is also computationally expensive. In an alternative to automatic query modification or expansion, one could allow the user to select the appropriate related terms [#!AnickSIGIR2003!#]. This allows for user-aided disambiguation, though also at the cost of an extra iteration and cognitive load.
We propose query modification based on pre-computed query and phrase similarity, combined with a ranking of the proposed queries. Our similar queries and phrases are derived from user query sessions, while our learned models used for reranking are based on the similarity of the new query to the original query as well as other indicators. We are able to generate suggestions for over half of all web search queries, even after eliminating sensitive queries such as those containing adult terms. A random sample of query pairs are labeled for training the models, and we find that the proportion of good suggestions is significant in the sample (even without any training): 55% of the suggestions in the sample are labeled as highly relevant. When we apply machine learning, we improve performance over the baseline significantly, improving the detection accuracy for highly relevant pairs to 79%. We describe the application of our methods to sponsored search, discuss coverage, and examine the types of useful suggestions and mistakes that our techniques make.
Others using user query sessions as a source of information to improve retrieval include Furnas furnasCHI1985, who explicitly asked users whether their reformulations should be used for indexing, and Radlinski and Joachims radlinskiJoachims who use query chains followed by clicks as a source of relevance ordering information. Cucerzan and Brill cucerzan_brill04:spelling user query logs for spelling correction, but do not take advantage of user query reformulations.
Our contributions are: (1) identification of a new source of data for identifying similar queries and phrases, which is specific to user search data (2) the definition of a scheme for scoring query suggestions, which can be used for other evaluations (3) an algorithm for combining query and phrase suggestions which finds highly relevant phrases and queries 55% of the time, and broadly relevant phrases and queries 87.5% of the time and (4) identification of features which are predictive of highly relevant query suggestions, which allow us to trade-off between precision and recall.
In Section we define
the problem, describe our four types of suggestion quality, and
discuss possible query rewriting tasks and the sponsored search
application. In Section
we
describe our technique for identifying related phrases
(substitutables) based on user query-rewrite sessions,
and the basic process for generation of candidate query
substitutions. In Section
we
describe a basic rewriting system, combining whole-query and
phrase-substitutions based on user-session analysis. We then
describe experiments using machine learning to improve the
selection of query rewriting suggestions. In Section
we show
that using machine-learned models improves the quality of the
top-ranked suggestion, leading to high quality query
substitutions, as measured though human evaluation. In Section
we show that we are able to assign a reliable confidence score to
the query-suggestion pairs, which allows us to set thresholds and
predict the performance of query suggestions when deployed.
There are many ways in which the new query could be related to the original query
. Some may be improvements
to the original query, others may be neutral, while others may
result in a loss of the original meaning:
To accommodate and allow quantification of these and other
changes, we define four types of query substitutions. We place
these on a scale of 1 to 4, with the most related suggestions
assigned a score of 1, and the least related assigned a score of
4. Examples of these are shown in Table , and we
define these classes below.
While in general web search there are often many documents containing all of the user's search terms, this is not always true for sponsored search. The set of advertiser results is much smaller than the set of all possible web pages. For this reason, we can think of sponsored search as information retrieval over a small corpus, where in many cases a conjunctive match on all query terms will not give a result. In this setting, the ability to modify a user's query (which may have no sponsored results) to a closely related query (which does have results) has extremely high utility.
Sponsored search has a few interesting constraints that we
briefly touch on. The primary constraint is that we have a finite
set of possible rewritings: the set of phrases for which we can
return an advertisement. On one hand, this increases the
difficulty of the task in terms of coverage, because even if we
find a good rewriting for the initial query, it won't necessarily
be found in an advertiser's listing. On the other hand, the
constraint acts as a basic (language model) filter. If the
rewritten form of the query is available for sponsored search, it
is more likely to be meaningful. Thus this constraint can help
filter out nonsensical rewritings (E.g.: air jordan iv retro
jordan intravenous
50s). Another constraint relates to the classes of queries
we can modify. For example, we avoid generating suggestions for
sensitive queries such as those containing adult terms.
We expect that the work we present here is relevant to other applications of query rewriting, such as for general web retrieval.
![]() |
Thus we would like to have a source of similar queries, and similar phrases. While we could use static sources such as WordNet [#!fellbaum98wordnet!#] to identify similar phrases, in general this will not allow us to generate suggestions for new concepts such as products, movies and current affairs that arise in query streams. Another possibility is to use within-document cooccurrence statistics to find related phrases, as these have been found useful for finding query-relevant terms [#!terra04:Scoring!#]. One could also use anchor text, which has been found useful for generating query refinements [#!kraftAnchorText!#]. For using data as close as possible to our target task, however, we chose to work with user-sessions from search query logs. Previous work has shown these sessions to contain around 50% reformulations [#!jonesFainSIGIR2003!#,#!spink00:use!#]. In these query-session reformulations, a user modifies a query to another closely related query, through word insertions, deletions and substitutions, as well as re-phrasing of the original query.
Note that the point-wise mutual information measure we use to identify phrases looks for adjacent terms whose mutual information is above a threshold:
where we set the threshold
to be 8. We also experimented with the connexity
approach to identifying phrases [#!fastwww03!#], and found that
our methods are not sensitive to the approach used to identify
phrases.
Because of the
distribution of
, a
score of 3.84 for LLR gives us a 95% confidence that we can
reject the null hypthesis, and two phrases are statistically
significantly related. However, this will give us 1 in 20
spurious relationships. As we are dealing with millions of
phrases, we set the threshold on LLR much higher, generally to
above 100, based on observation of the substitutable pairs.
|
We seek to generate statistically significant related queries for arbitrary input queries. For frequent queries, we have dozens of such related queries. But for less frequent queries, we may not have seen sufficiently many instances of them to have any statistically significant related queries. We can't ignore infrequent queries: infrequent: the power-law (Zipf) distribution of query terms leads to a large proportion of rare queries.
In the same way as we generated the phrase-substitutables, we
can break up the input query into segments, and replace one or
several segments by statistically significant related segments.
This will help cover the infrequent queries. Figure shows a
schematic of the approach we take:
This gives us a set of query-substitution candidates which we
will denote .
As an example, consider the query ``catholic baby names''. We
can find whole-query substitutions from our query substitutables,
such as ``catholic names''. We can also segment the query into
the two phrases ``catholic'' and ``baby names'', and find
substitutions for these phrases (see Figure ).
For each evaluation, we use a random sample of 1000 initial
queries sampled from query logs
from a disjoint time period to our substitutables query data, and
generate a single suggestion
for
each. We will evaluate the accuracy of approaches to choosing the
suggestion to generate, by reporting the proportion of
suggestions falling into the classes precise (class 1+2)
and broad (class 3+4).
To assess whether there are properties of the suggestions
which are predictive of good quality, we will train a
machine-learned classifier on the labeled
pairs. We will then
evaluate our ability to produce suggestions of higher quality for
fewer queries, by plotting precision-recall curves.
We first describe two simple methods for ranking candidates.
In order to develop a more sophisticated ranking scheme, we take
the top suggestion from one of these ranking schemes and use
machine learning and human judgements to learn a model of high
quality suggestions. We describe how we use the learned
classifier to learn a more accurate re-ranking of candidate
suggestions in Section .
For randomRank we require all whole query and phrase suggestions to have an LLR score of at least 50. We set the threshold lower than the value of 100 we have empirically observed to be a useful setting, in order to assess the performance of suggestions over a range of LLR scores.
We sample one query uniformly at random from this sample. Note
that we are much more likely to select a phrase-substitution than
a whole query substitution: for a query with two phrases, there
are up to
new
queries generated by phrase substitution, and only 10 generated
by whole-query substitution.
We designed our next method, LLRNumSubst, using the following intuitions, based on our experience with samples of the generated candidates:
|
Rather than making assumptions about what makes a good
substitution, we can treat our problem as a machine learning
problem. The target to learn is the quality of the substitution,
and we provide features that we expect to be reasonably
efficiently computable. We tried both linear regression and
binary classification approaches. For classification, we learn a
binary classifier over query pairs:
![]() |
(1) |
We evaluate on two binary classification tasks, as specified
in Section :
|
It is interesting to note that none of the features relating
to the substitution statistics appeared in the best models. This
may be due to the fact that the training query suggestions were
selected using the LLRNumSubst method which takes LLR score into
account. The ranking function we learn is shown in Equation
.
We can interpret this model as saying that, given that a suggestion is among the top ranked suggestions according to the LLRNumSubst ranking, we should prefer query substitutions with small edit distance (perhaps spelling and morphological changes) and with small word edit distances (perhaps word insertions or deletions). We should prefer whole-query suggestions, and if we substitute at the phrase segment level, the fewer the substitutions the better.
If (number of tokens in common is nonZero ) then {1+2} else if (prefix overlap is nonZero) then {1+2} else {3+4} end endWe can interpret this model as saying that a suggestion should ideally contain some words from the initial query, otherwise, the modifications should not be made at the beginning of the query.
We would like to apply one of these models for selecting the suggestion for an initial query from the pool of possible candidates, Q'. For our subsequent experiments we experimented with the linear regression model for its execution time efficiency, as it only requires three easy to compute features.
We can now compare the precision of the top-ranked suggestion
of these three ranking schemes. Note that each was used to
generate suggestions for a distinct sample of 1,000 queries. In
Table we see
that the substitutables are an excellent source of related terms:
the baseline random ranking achieved precision of 55% on the
specific rewriting task of identifying highly relevant
suggestions for queries. The LLRNumSubst algorithm
improved the performance to 66%, while NumSubstEdit's
re-ranking of a separate sample of LLRNumSubst candidates
increased precision of the top suggestion to 74%.
|
For generating a suggestion for broad rewriting, which allows suggestions from each of classes {1,2,3} LLRNumSubst generates an appropriate top-ranked suggestion for 87.5% of a sample of 1000 queries. Our source of data, the substitutables, is by construction a source of terms that web searchers substitute for one another, leading to a large proportion of broadly related suggestions. Re-ranking LLRNumSubst's candidate suggestions with NumSubstEdit does not result in increased precision of the top-ranked candidate for broad rewriting.
Under these constraints, our methods (e.g., Random
Selection of Section ) yield at
least one suggestion for about 50% of queries.
When we break down the coverage by query decile we see that
the coverage varies by decile. Decile 1 contains the queries
constituting the top 10% of query volume, while decile 10
contains the most infrequent queries, most of which have a search
frequency of only 1 in the sampled time-period. The histogram
shown in Figure was
generated by using the NumSubstEdit method to select a
candidate. We see that our method allows generation of
suggestions for both frequent and infrequent queries.
We generate a suggestion for more than 10% of queries from decile 10, many of which would typically have no sponsored search result. This incremental coverage is primarily due to phrase substitution: for the frequent queries, we were able to find whole query substitutions, but for the less frequent ones, only breaking the query into phrase segments allowed us to generate a result.
Examples of queries with a low monthly query frequency for which we changed 1 or 2 units:
new testament passages
bible quotes
cheap motels manhattan, ny
cheap hotels manhattan, ny
We evaluate our confidence models using root-mean squared
error (RMSE):
Isotonic regression [#!lutz2000!#] (Iso) performs well, but
tends to overfit1. Fitting an asymmetric distribution
[#!bennett2003!#] (ADist) is a promising technique in general,
but for our task no true distribution could model well the
class and thus the
performance was not as good as with the other techniques. The
best results were obtained with a sigmoid [#!platt1999!#] (SS).
The sigmoid also has the advantage of being a simple model to
implement.
Consequently, we used the sigmoid to model the probability
function:
We could directly rely on the observed data, and for each
value of the threshold, count how many times suggestions with a
score higher than this threshold were correct. A drawback of this
approach is that for high thresholds, we will have very little
data with which to estimate the precision. This can be observed
on the graph of the estimated precision for given thresholds in
Figure . The
confidence interval increases with the threshold, and with a high
threshold, we are not able to give a reliable estimate.
We have a probability model, and we can take advantage of it
using the following integration. For a set of query
transformations , if we
accept only those with a probability of correctness
greater than
, the average precision will
be:
![]() |
Still it is odd that we have an under-confident estimate. This is due to the fact that the true model of the probability is not sigmoid-shaped for high scores. When we replace the flattening part of the sigmoid with a line ramping up faster than the sigmoid, we have better results as measured with RMSE. Nevertheless, since we have few datapoints with high scores, it is safer to assume that the probability for high scores is flattening. We can integrate to get the associated confidence threshold. For instance, if we want to ensure an average precision of 75%, we need to reject any rewriting with a confidence lower than 17%. If we want to ensure an average precision of 85%, we need to refuse any rewriting with a confidence lower than 76%.
In Section we looked
at precision-recall curves. We now take into account both the
incomplete coverage, the continuous score of the linear model and
the threshold model transformation. This allows us to tune for
more or less coverage and predict the expected average precision.
As we can see in Figure
, we
can have, for example:
- precision: 90% coverage 7%
- precision: 80% coverage 42%
- precision: 75% coverage 49%
|
|
One of the causes of the broader suggestions is our approach
to identifying possible phrase substitutions. Very general
contexts can lead to broad types of substitutions: (britney
spears) (mp3s) (christina
aguilera) (lyrics) gives us the phrase pair: britney
spears
christina
aguilera. We are investigating methods for detecting very
broad changes, such as considering the entropy of the rewrites
found in a given context.
While we showed in Section that
coverage varies across deciles, the precision of suggestions is
generally constant across deciles. However, there are still
differences between the suggestions we generate for frequent and
infrequent queries. We tend to generate more spelling variants
for the infrequent queries: we see 0% spelling change for initial
queries in decile 1, and 14% for decile 10. The reason may simply
be that infrequent queries are much more likely to contain
misspellings, and spelling corrections are the easiest
suggestions to generate.
Even with an excellent source of related terms such as the substitutables, we occasionally generate poor suggestions. We have observed that these errors are frequently produced because of polysemy: some terms are good substitutes for each other in one context, but not in another. For instance, while software is related to system in the context of computer science, it is not in the context of the term ``aerobic''. We generated the poor suggestion:
Both are popular queries, certainly likely issued often in the
same sessions, but it is unlikely that a user's specific search
need when searching for one is satisfied by serving results for
the other.
Overall our query substitution technique is extremely efficient, especially in comparison to retrieval-based techniques such as pseudo-relevance feedback, and matrix multiplication methods such as LSI. For whole-query suggestions, we are able to precompute the query substitutions and their scores offline, and so at run-time we require just a look-up. For phrase substitutions, we precompute edit distance between phrases offline, so when we look-up substitutions for each phrase at run-time, we require linear normalization of the edit-distance score, as well as computing the linear score with multiplications and additions.
To improve our algorithm, we can also take inspiration from machine translation techniques. Query rewriting can be viewed as a machine translation problem, where the source language is the language of user search queries, and the target language is the language of the application (for instance advertiser language in the case of sponsored search).
In order to generalize our work to any application, we also need to work on introducing a language model, so that in the absence of filtering with the list of sponsored queries, we avoid producing nonsensical queries. In addition, with the algorithm in operation we could learn a new ranking function using click information for labels.