Using Proportional Transportation Similarity with Learned Element Semantics for XML Document Clustering

Xiaojun Wan

Institute of Computer Science and Technology, Peking University
Beijing 100871, China

Jianwu Yang

Institute of Computer Science and Technology, Peking University
Beijing 100871, China

ABSTRACT

This paper proposes a novel approach to measuring XML document similarity by taking into account the semantics between XML elements. The motivation of the proposed approach is to overcome the problems of ˇ°under-contributionˇ± and ˇ°over-contributionˇ± existing in previous work. The element semantics are learned in an unsupervised way and the Proportional Transportation Similarity is proposed to evaluate XML document similarity by modeling the similarity calculation as a transportation problem. Experiments of clustering are performed on three ACM SIGMOD data sets and results show the favorable performance of the proposed approach.

Categories & Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval ¨C clustering

General Terms

Theory, Experimentation

Keywords

XML document clustering, Proportional Transportation Similarity

1. INTRODUCTION

XML document clustering aims to group XML documents into different clusters with a standard of document similarity. Much previous work has explored clustering methods for grouping structurally similar XML documents so as to find common structures or DTDs for various purposes. While in this study we focus on grouping topically similar XML documents, i.e., the text contents of the XML documents within a cluster express a similar topic or subject. Based on the traditional Vector Space Model (VSM), XML documents are processed as ordinary unstructured documents by removing all element tags and thus the structural information is totally lost. Another intuitive method takes into account the structural information by computing the similarity of texts within each element respectively and then combining these similarities linearly, which is called C-VSM. Much work [1, 4] explores this problem by extending VSM in order to incorporate the element tag information. For SLVM proposed in [4], each document, docx, is represented as a matrix p025-wan.files/image001.gif, given as dx=<dx(1),dx(2),ˇ­,dx(n)>T, dx(i)=<dx(i,1),dx(i,2),ˇ­,dx(i,m)>, where m is the number of XML elements, and n is the number of terms. p025-wan.files/image002.gifis a feature vector related to the term tBiB for all the elements, dBx(i,j)B is a feature related to the term tBiB and specific to the element eBjB, given as dx(i,j)=TF(ti,docx.ej)*IDF(ti) and TF(ti,docx.ej) is the frequency of the term tBiB in the element eBjB of the documents docBxB.And each dBx(i,j)B is normalized by p025-wan.files/image003.gif. The similarity between two documents docBxB and docByB is then defined with an element semantic matrix Me introduced, given as

image004.gif (1)

where MBeB is an m*m element semantic matrix which captures both the similarity between a pair of XML elements as well as the contribution of the pair to the overall document similarity.

The popular way to determine the value of MBeB is based on the edit distance [5]. In order to acquire a more appropriate element semantic matrix, with the notion that term similarity should be affecting document similarity and vice versa, we propose aniterative algorithm for learning the element semantic matrix MBeB, given as

image005.gif (2)

image006.gif (3)

p025-wan.files/image007.gifrepresents a set of XML documents, wherep025-wan.files/image008.gifis a matrix with its kPthP column corresponding to dBk(iBB)B of the k-thPP document and p is the number of documents.p025-wan.files/image009.gif. All the entriesˇŻ values of matrix SBdB are normalized between zero and one. Two totally different documents have a similarity value of zero and two identical documents have a similarity value of one. An additional constraint for getting a non-trivial solution of having both matrices with all zero elements is required to force the diagonal elements of SBdB (i.e., the similarity of identical documents) to take the value of one.

VSM does not consider structural information and C-VSM only allows the text within an element of one document to correspond to the text within the same element of the other document. The two cases ignore the semantic relationship between different elements and have ˇ°under-contributionˇ± problem. SLVM overcomes the above ˇ°under-contributionˇ± problem by allowing the text within an element of one document to correspond to the text within any element of the other document. However, the corresponding relation between elements is loose, i.e. the text with a weight within an element of one document can always use its total weight to correspond to the text within any element of the other document, which is believed to have the so-called ˇ°over-contributionˇ± problem.

In order to address the above problems of ˇ°under-contributionˇ± and ˇ°over-contributionˇ±, illuminated by the Proportional Transportation Distances [2], we propose the Proportional Transportation Similarity (PTS) to measure the document similarity. PTS allows the text within an element of one document to correspond to the text within any element of the other document under a few strict constraints by modeling a transportation problem in the linear programming field. The document similarity over one single term is computed as follows:

Given two feature vectorsdx(i)=<dx(i,1),dx(i,2),ˇ­,dx(i,m)>B Band dy(i)=<dy(i,1),dy(i,2),ˇ­,dy(i,m)>, related tothe particular term tBiB for all the elements in documents docBxBand docBy Brespectively, where m is the number of elements, a weighted graph G is constructed as follows:

Let dx(i)=<dx(i,1),dx(i,2),ˇ­,dx(i,m)> as the weighted point set for the term tBiB in document docBxB, dBx(i,j)B is a feature related to the term tBiB and specific to the element eBj B given as the TF(ti,docx.ej)*IDF(ti) value.

Let dy(i)=<dy(i,1),dy(i,2),ˇ­,dy(i,m)> as the weighted point set for the term tBiB in document docByB, dBy(i,j)B is a feature related to the term tBiB and specific to the element eBj B given as the TF(ti,docy.ej)*IDF(ti) value..

Let G={dBx(i)B, dBy(i)B, Me} as a weighted graph constructed by dBx(i)B, dBy(i)B, and Me. V=dx(i)ˇČdy(i) is the vertex set while Me is the edge matrix, either learned or obtained based on edit distance. p025-wan.files/image010.gif, p025-wan.files/image011.gifare the total weights of dx(i), dy(i), i.e. the sums of all weights of points within dx(i), dy(i), respectively.

Based on the weighted graph G , the possible flowsp025-wan.files/image011.gif= [fBuvB], with fBuvB the flow from dx(i,u)to dy(i,v), are defined by the following constraints:

image013.gif image014.gif image015.gif (4)

image016.gif image014.gif (5)

image017.gif image015.gif (6)

image019.gif (7)

Constraint (4) allows moving weights from dBx(i)B to dBy(i)B and not vice versa. Constraint (5) and (7) force all of dBx(i)BˇŻs weight to move to the positions of points in dBy(i)B. Constraint (6) ensures that this is done in a way that preserves the old percentages of weight in dBy(i)B.

And PTS is given by

image020.gif (8)

The overall document similarity based on PTS is obtained as

image021.gif (9)

The above transportation problem can be efficiently solved by interior-point algorithms [3] which have polynomial time complexity.

2. EXPERIMENTAL RESULTS

In the experiments, we use the agglomerative hierarchical clustering (AHC) algorithm as the cluster engine. Three benchmarking datasets with different sizes extracted from ACM SIGMOD Record 19991, which is composed of hundreds of documents from past issues of SIGMOD Record, are used for evaluation. The class each record belonging to is given by the ˇ°categoryˇ± tag2. The weighted F-measure is used as the evaluation metric. Table 1 gives the results. For SLVM and PTS, either the edit distance approach or the learning method can be employed to acquire element semantics, the corresponding results are given in different columns respectively.

Table 1. F-measure (%) comparison results

Data Set VSM C-VSM SLVM(Edit Distance) SLVM(Learned Semantics) PTS(Edit Distance) PTS(Learned Semantics)
ACM-8 40.0 36.9 46.0 53.0 43.3 58.9
ACM-12 21.2 22.4 24.2 43.0 22.9 46.6
ACM-16 42.4 47.7 38.8 58.8 35.1 69.9

Seen from the above table, the proposed PTS with learned element semantics performs best over all three data sets. The approaches with the learned element semantics, i.e. SLVM and PTS, both significantly outperform the traditional approaches not considering the element semantics, i.e. VSM and C-VSM, which shows the importance of incorporating the element semantics for XML document clustering. For SLVM and PTS, we find that the performance based on the learning method for acquiring the element semantics is significantly better than that based on the edit distance approach, which demonstrates that the learned element semantics can reflect the true underlying semantics between XML elements.

The experimental results demonstrate that with the appropriate element semantics, PTS can improve the performance by circumventing both the ˇ°over-contributionˇ± problem and the ˇ°under-contributionˇ± problem.

REFERENCES

[1] A. Doucet, H. A. Myka. Naive Clustering of a Large XML Document Collection. In Proceedings of the 1st INEX, Germany, 2002.

[2] P. Giannopoulos and R. C. Veltkamp. A Pseudo-Metric for Weighted Point Sets. In Proceedings of the 7th European Conference on Computer Vision (ECCV) , 715¨C730, 2002.

[3] N. Karmarkar. A new polynomial-time algorithm for linear programming. In Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing , 302-311, 1984.

[4] J.W. Yang and X.O. Chen. A semi-structured document model for text mining. Journal of Computer Science and Technology, 17(5): 603-610, 2002.

[5] K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245¨C1262, 1989.



1 http://www.acm.org/sigs/sigmod/record/xml/XMLSigmodRecordMar1999.zip

2 All the ˇ°categoryˇ± tags as well as the inner texts and the corresponding descriptions in the record are removed to make the answer class blind to the clustering algorithm.