RACE: Finding and Ranking Compact Connected Trees for Keyword Proximity Search over XML Documents

Guoliang Li Jianhua Feng Jianyong Wang Bei Yu Yukai He

Department of Computer Science and Technology,Tsinghua University, Beijing, China School of Computing, National University of Singapore, Singapore

Abstract:

In this paper, we study the problem of keyword proximity search over XML documents and leverage the efficiency and effectiveness. We take the disjunctive semantics among input keywords into consideration and identify meaningful compact connected trees as the answers of keyword proximity queries. We introduce the notions of Compact Lowest Common Ancestor (CLCA) and Maximal CLCA (MCLCA) and propose Compact Connected Trees (CCTrees) and Maximal CCTrees (MCCTrees) to efficiently and effectively answer keyword queries. We propose a novel ranking mechanism, RACE, to Rank compAct Connected trEes, by taking into consideration both the structural similarity and the textual similarity. Our extensive experimental study shows that our method achieves both high search efficiency and effectiveness, and outperforms existing approaches significantly.

1 Introduction

Keyword search is a proven and widely accepted mechanism for querying in textual document systems and World Wide Web. The research community has recently recognized the benefits of keyword search and has been introducing keyword search capability into XML documents [2,4,5,6,7].

In this paper, we study the problem of keyword proximity search over XML documents by considering the disjunctive semantics (i.e., the OR predicate) among the input keywords, and provide a novel ranking mechanism for effective keyword search, by taking into account both the structural similarity from the DB point of view and the textual similarity from the IR viewpoint. We introduce the notions of Compact LCA (CLCA) and Maximal CLCA (MCLCA) to capture the focuses of keyword queries, and propose Compact Connected Trees (CCTrees) and Maximal CCTrees (MCCTrees) to efficiently and effectively answer keyword proximity queries. Moreover, we devise a novel ranking mechanism, RACE, to Rank compAct Connected trEes. RACE not only considers the textual similarity like document relevancy in IR literature, but also incorporates the structural similarity into the ranking function from the DB point of view.

2 Compact Connected Trees

Traditional methods usually compute the LCAs of content nodes to answer keyword queries. However, it is inefficient to compute all the LCAs as given a keyword query { $k_1,k_2,\cdots,k_m$ }, there are $\prod_{i=1}^{m}$ $\vert\mathcal{I}_i\vert$ combinations of LCA candidates, where $\mathcal{I}_i$ denotes the set of content nodes that directly contain keyword

. To address this problem, we introduce the concepts of Compact LCA (CLCA) and Compact Connected Trees (CCTrees).

Definition .1 (CLCA and CCTree) Given content nodes, $\in$ $\mathcal{I}_1$ , $\in$ $\mathcal{I}_2$ , $\cdots$ , $\in$ $\mathcal{I}_q$ , and =LCA(,, $\cdots$ ,). is said to dominate w.r.t. {,, $\cdots$ ,}, if $\succeq$ LCA(, $\cdots$ , $v_{i-1}'$ ,, $v_{i+1}'$ , $\cdots$ ,), $\forall$ $\in$ $\mathcal{I}_1$ , $\in$ $\mathcal{I}_2$ , $\cdots$ , $v_{i-1}'$ $\in$ $\mathcal{I}_{i-1}$ , $v_{i+1}'$ $\in$ $\mathcal{I}_{i+1}$ , $\cdots$ , $\in$ $\mathcal{I}_q$ . is a CLCA w.r.t. {,, $\cdots$ ,}, if dominates each for $\leq$ $\leq$ . The tree rooted at a CLCA and containing the paths from the root to the nodes dominated by the root, is called a CCTree.

**Figure:** Maximal Compact Connected Trees
$\includegraphics[scale=0.5]{globaltree.eps}$

A CLCA is the LCA of some relevant nodes and the irrelevant nodes cannot share a CLCA. For example, in Figure

is the CLCA of

and

w.r.t. {

}, however,

is not the CLCA of

and $n_{17}$ , although

is their LCA. Because

dominates

, and $n_{15}$ dominates $n_{17}$ , but there is no node which dominates both

and $n_{17}$ . We observe that

and $n_{15}$ are more relevant to {

} than

. The subtree rooted at

is a CCTree. CLCA is orthogonal to SLCA [7] and avoids false negatives introduced by SLCA. For example, in Figure

and

are both CLCAs w.r.t. {

}, and they dominate { $n_{20}$ , $n_{21}$ , $n_{23}$ , $n_{24}$ } and {

, $n_{11}$ , $n_{12}$ }, respectively.

is a false negative for SLCA as

has a LCA descendant

. CLCA can avoid those false negatives and thus is a more meaningful methodology to answer keyword queries. We give the least upper bound of the number of CLCAs as stated in LEMMA

, which is much smaller than the number of LCAs.

Lemma .1
There are at most 2 $\sum_{i=1}^{m}$ $\vert\mathcal{I}_i\vert$ - CLCAs w.r.t. a query $\mathcal{K}$ =(,, $\cdots$ ,) and an XML document $\mathcal{D}$ in terms of the disjunctive semantics (i.e., the OR predicate).

Definition .2
(MCLCA and MCCTree) Given a keyword query $\mathcal{K}$ ={,, $\cdots$ ,} and $\mathcal{K}_i$ ={ $k_{i_1}$ , $k_{i_2}$ , $\cdots$ , $k_{i_q}$ } $\subseteq$ $\mathcal{K}$ . Suppose = CLCA( $v_{i_1}$ , $v_{i_2}$ , $\cdots$ , $v_{i_q}$ ), where $v_{i_1}$ $\in$ $\mathcal{I}_{i_1}$ , $\cdots$ , $v_{i_q}$ $\in$ $\mathcal{I}_{i_q}$ . is a Maximal CLCA (MCLCA), if $\forall$ $\in$ ( $\mathcal{K}$ - $\mathcal{K}_i$ ), $\in$ $\mathcal{I}_{k'}$ , $\not$ $\exists$ , which dominates both and every $v_{i_j}$ for $\leq$ $\leq$ . The CCTree rooted at an MCLCA is called an MCCTree.

To effectively answer keyword search, we propose the concepts of Maximal CLCA and Maximal CCTree. An MCLCA is also a CLCA, which has no ancestors that still dominate some other content nodes besides the content nodes dominated by the MCLCA. Therefore, an MCLCA dominates a maximal set of content nodes and is more meaningful than a CLCA. An MCCTree is the CCTree rooted at an MCLCA and contains more keywords than CCTrees. For example, in Figure

, the four circled trees are MCCTrees.

3 RACE

$\cdot$

based methods for ranking relevant documents have been proved to be effective for keyword proximity search in text documents. However, traditional ranking techniques in IR literature may not be effective to rank MCCTrees, as besides the term frequency (

) and inverse document frequency (

), MCCTrees also contain rather rich structural information. We take into account both the structural similarity and traditional IR metrics to rank MCCTrees.

There are three parameters - the number of content nodes in $\mathcal{T}$ ,

, the number of distinct input keywords contained in $\mathcal{T}$ ,

, and the number of all nodes in $\mathcal{T}$ ,

, which will affect the score assigned to an MCCTree, and we will employ these three parameters to rank MCCTrees. Intuitively, the larger

, the higher the score of the MCCTree should be; on the other hand, the larger

, the more likely the MCCTree is relevant to $\mathcal{K}$ . On the contrary,

should be inverse with the score of the MCCTree. In addition, the succinctness of the MCCTree should be reflected in the structural similarity function, and the more succinct of the MCCTree, the higher score of the structural similarity should be. Based on above observations, we can compute the structural similarity.

Accordingly, we combine the textual similarity and structural similarity to effectively rank the MCCTrees.

**Figure:** Efficiency of various algorithms
$\includegraphics[scale=0.5]{ElapsedTime.eps}$

4 Experimental Study

We have conducted a set of experiments to evaluate the performance of our approach. We used real dataset DBLP in our experiments. The raw file was about 420MB. The experiments were conducted on an Intel(R) Pentium(R) 2.4GHz computer with 1GB of RAM. The algorithms were implemented in Java. We compared RACE with state-of-the-art methods, XSEarch[1], XRank[2], GDMCT[3] and MSLCA [7]. We selected six groups of queries,

, $\cdots$ ,

. Each group has ten queries and the queries in the same group have the same number of keywords. For example, each query in

has 3 keywords. Figure

illustrates the experimental results on search efficiency and Figure

gives the experimental results on search quality.

5 Conclusion

In this paper, we have investigated the problem of keyword proximity search over XML documents. We proposed the notions of CLCA and MCLCA to capture the focuses of keyword queries and adopted CCTrees and MCCTrees to effectively and efficiently answer keyword proximity queries. We demonstrated a novel ranking mechanism, RACE, to Rank the compAct Connected trEes, by taking into account both structural similarity from the DB viewpoint and textual similarity from the IR point of view. The experimental results show that our approach achieves high search efficiency and quality, and outperforms existing methods significantly.

**Figure:** Top- answer relevancy
$\includegraphics[scale=0.5]{topk.eps}$

Acknowledgement

This work is partly supported by the National Natural Science Foundation of China under Grant No.60573094, the National High Technology Development 863 Program of China under Grant No.2007AA01Z152 and 2006AA01A101, the National Grand Fundamental Research 973 Program of China under Grant No.2006CB303103.