��!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Efficient Mining of Frequent Sequence Generators</title> <link rel="stylesheet" href="iw3c2.css" /> <style type="text/css"> .underlineStyle { text-decoration: underline; } </style> </head> <body> <div class="meta"> <h1 class="title"> Efficient Mining of Frequent Sequence Generators </h1> <div class="authors"> <div class="author"> <h2 class="author"> Chuancong Gao </h2> <h3 class="affiliation"> Tsinghua University<br /> Beijing, 100084, P.R.China </h3> <h2 class="email"> <a href="mailto:gaocc07@mails.tsinghua.edu.cn">gaocc07@mails.tsinghua.edu.cn</a> </h2> </div> <div class="author"> <h2 class="author"> Jianyong Wang</h2> <h3 class="affiliation"> Tsinghua University<br /> Beijing, 100084, P.R.China</h3> <h2 class="email"> <a href="mailto:jianyong@tsinghua.edu.cn">jianyong@tsinghua.edu.cn</a></h2> </div> <div class="author"> <h2 class="author"> Yukai He</h2> <h3 class="affiliation"> Tsinghua University<br /> Beijing, 100084, P.R.China</h3> <h2 class="email"> <a href="mailto:heyk05@mails.tsinghua.edu.cn">heyk05@mails.tsinghua.edu.cn</a></h2> </div> <div class="author"> <h2 class="author"> Lizhu Zhou</h2> <h3 class="affiliation"> Tsinghua University<br /> Beijing, 100084, P.R.China</h3> <h2 class="email"> <a href="mailto:dcszlz@tsinghua.edu.cn">dcszlz@tsinghua.edu.cn</a></h2> </div> </div> <div class="copyright"> <p class="copyright"> Copyright is held by the World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.<br /> WWW 2008, April 21-25, 2008, Beijing, China.<br /> ACM 978-1-60558-085-2/08/04. </p> </div> <div class="abstract"> <h1 class="abstract"> ABSTRACT</h1> <p class="abstract"> Sequential pattern mining has raised great interest in data mining research field in recent years. However, to our best knowledge, no existing work studies the problem of frequent sequence generator mining. In this paper we present a novel algorithm, FEAT (abbr. <span class="underlineStyle">F</span>requent s<span class="underlineStyle">E</span>quence gener<span class="underlineStyle">AT</span>or miner), to perform this task. Experimental results show that FEAT is more efficient than traditional sequential pattern mining algorithms but generates more concise result set, and is very effective for classifying Web product reviews. </p> </div> <div class="categories"> <h2 class="categories"> Categories &amp; Subject Descriptors</h2> <p class="categories"> H.2.8 [Database Management]: Database applications - Data Mining</p> </div> <div class="terms"> <h2 class="terms"> General Terms</h2> <p class="terms"> Algorithms, Performance</p> </div> <div class="keywords"> <h2 class="keywords"> Keywords</h2> <p class="keywords"> Sequence Generators, Sequence, Web Mining</p> </div> </div> <h2> <a id="tth_sEc1">1</a> Introduction</h2> <div class="par"> Sequential pattern mining has raised great interest in data mining research field in recent years. Various mining methods have been proposed, including sequential pattern mining[<a href="#oldest">1</a>][<a href="#PrefixSpan">5</a>], and closed sequentialpattern mining[<a href="#clospan">7</a>][<a href="#BIDE">6</a>]. Sequential pattern mining has also shown its utility for Web data analysis, such as mining Web log data[<a href="#weblog">2</a>] and identifying comparative sentences from Web forum posting and product reviews[<a href="#comparative">3</a>]. However, there exists no existing work on mining frequent sequence generators, where a sequence generator is informally defined as one of the minimal subsequences in an equivalence class. Thus, generators have the same ability to describe an equivalence class as their corresponding subsequences of the same equivalence class, and according to the MDL principle[<a href="#whyMDL">4</a>], generators are preferable to all sequential patterns in terms of Web page and product review classification. </div> <div class="par"> In the rest of this paper, we first give a formal problem formulation and focus on our solution in Section <a href="#algorithm">2</a>, then present the performance study in Section <a href="#experiment">3</a>. We conclude the study in Section <a href="#conclusion">4</a>. </div> <h2> <a id="tth_sEc2">2</a> Mining Sequential Generators</h2> <h3> <a id="tth_sEc2.1">2.1</a> Problem Formulation</h3> <div class="par"> An <b>input sequence database</b> SDB contains a set of input sequences, where an <b>input sequence</b> is an ordered list of items (each item can appear multiple times in a sequence) and can be denoted by S=e<sub>1</sub>e<sub>2</sub>... e<sub>n</sub>. Given a <b>prefix</b> of sequence S, S<sub>pre</sub>=e<sub>1</sub>e<sub>2</sub>... e<sub>i</sub>, we define the <b>projected sequence</b> of S<sub>pre</sub> w.r.t. S as e<sub>i+1</sub>e<sub>2</sub>... e<sub>n</sub>. The complete set of projected sequences of S<sub>pre</sub> w.r.t. each sequence in SDB is called the <b>projected database</b> of S<sub>pre</sub> w.r.t. SDB, denoted by SDB<sub>S<sub>pre</sub></sub>. Given a subsequence S<sub>p</sub>=e<sub>p<sub>1</sub></sub>e<sub>p<sub>2</sub></sub> ... e<sub>p<sub>m</sub></sub>, its <b>support</b> sup(S<sub>p</sub>) is defined as the number of sequences in SDB<sub>S<sub>p</sub></sub> each of which contains S<sub>p</sub>, denoted by |SDB<sub>S<sub>p</sub></sub>|. Given a user specified minimum support threshold, min_sup, S<sub>p</sub> is said to be <b>frequent</b> if sup(S<sub>p</sub>) e" min_sup holds. S<sub>p</sub> is called a <b>sequence generator</b> iff " S<sub>p</sub>' such that S<sub>p</sub>'�" S<sub>p</sub> (i.e., S<sub>p</sub> contains S<sub>p</sub>') and sup(S<sub>p</sub>)=sup(S<sub>p</sub>'). In addition, given a sequence S=e<sub>1</sub>e<sub>2</sub>... e<sub>n</sub> and an item e', we denote e<sub>1</sub>e<sub>2</sub>... e<sub>i-1</sub>e<sub>i+1</sub>... e<sub>n</sub> by S<sup>(i)</sup>, e<sub>i</sub>e<sub>i+1</sub>... e<sub>j</sub> by S<sub>(i, j)</sub>, and e<sub>1</sub>e<sub>2</sub>...e<sub>n</sub>e' by &lt; S, e' &gt; . </div> <div class="par"> Given a minimum support threshold min_sup and an input sequence database SDB, the task of <i>frequent sequence generator mining</i> is to mine the complete set of sequence generators which are frequent in database SDB. </div> <h3> <a id="tth_sEc2.2">2.2</a> Pruning Strategy</h3> <div class="par"> A na�ve approach to mining the set of frequent sequence generators is to first apply a sequential pattern mining algorithm to find the set of frequent subsequences and check if each frequent subsequence is a generator. However, it is inefficient as it cannot prune the unpromising parts of search space. In this subsection we propose two novel pruning methods, <i>Forward Prune</i> and <i>Backward Prune</i>, which can be integrated with the pattern-growth enumeration framework [<a href="#PrefixSpan">5</a>] to speed up the mining process. We first introduce Theorems 1 and 2 which form the basis of the pruning methods, but due to limited space we eliminate their proofs here. </div> <div class="theorem"> <a id="equivalence1"></a><b>Theorem 1</b> <em>Given two sequences S<sub>p1</sub> and S<sub>p2</sub>, if S<sub>p1</sub> �" S<sub>p2</sub> (i.e., S<sub>p1</sub> is a proper subsequence of S<sub>p2</sub>) and SDB<sub>S<sub>p1</sub></sub> = SDB<sub>S<sub>p2</sub></sub>, then any extension to S<sub>p2</sub> cannot be a generator. <a href="#tthFtNtAAB" id="tthFrefAAB"><sup>1</sup></a></em> </div> <div class="theorem"> <b>Theorem 2</b> <em><a id="equivalence2"></a>Given subsequence S<sub>p</sub> = e<sub>1</sub>e<sub>2</sub>... e<sub>n</sub> and an item e', if SDB<sub>S<sub>p</sub></sub> = SDB<sub>S<sub>p</sub><sup>(i)</sup></sub> (i = 1, 2, ..., n), then we have that SDB<sub>&lt; S<sub>p</sub>, e' &gt;</sub> = SDB <sub>&lt; S<sub>p</sub><sup>(i)</sup>, e' &gt;</sub> .</em> </div> <div class="lemma"> <b>Lemma 1</b> <em>(Forward Prune). Given subsequence S<sub>p</sub>, and let S<sub>p</sub><sup>*</sup> = &lt; S<sub>p</sub>, e' &gt; , if sup(S<sub>p</sub>) = sup(S<sub>p</sub><sup>*</sup>) and for any local frequent item u of S<sub>p</sub><sup>*</sup> we always have SDB <sub>&lt; S<sub>p</sub>, u &gt;</sub> = SDB <sub>&lt; S<sub>p</sub><sup>*</sup>, u &gt;</sub> , then S<sub>p</sub><sup>*</sup> can be safely pruned.</em> </div> <div class="proof"> Easily derived from Theorem 1. </div> <div class="lemma"> <b>Lemma 2</b> <em>(Backward Prune). Given S<sub>p</sub> = e<sub>1</sub> e<sub>2</sub> ... e<sub>n</sub>, if there exists an index i (i = 1, 2, ..., n-1) and a corresponding index j (j = i + 1, i + 2, ..., n) such that SDB<sub>(S<sub>p</sub>)<sub>(1, j)</sub></sub> = SDB<sub>((S<sub>p</sub>)<sub>(1, j)</sub>)<sup>(i)</sup></sub>, then S<sub>p</sub> can be safely pruned.</em> </div> <div class="proof"> Easily derived from Theorem 2 and Theorem 1.</div> <h3> <a id="tth_sEc2.3">2.3</a> Generator Checking Scheme</h3> <div class="par"> The preceding pruning techniques can be used to prune the unpromising parts of search space, but they cannot assure each mined frequent subsequence S=e<sub>1</sub> e<sub>2</sub> ... e<sub>n</sub> is a generator. We devise a generator checking scheme as shown in Theorem <a href="#checking">3</a> in order to perform this task, and it can be done efficiently during pruning process by checking whether there exists such an index i(i=1,2,...,n) that |SDB<sub>S</sub>| = |SDB<sub>S<sup>(i)</sup></sub>|, as sup(S) = |SDB<sub>S</sub>| holds.</div> <div class="theorem"> <b>Theorem 3</b> <em><a id="checking"></a>A sequence S = e<sub>1</sub> e<sub>2</sub> ... e<sub>n</sub> is a generator if and only if "i that 1 d"i d"n and sup(S) = sup(S<sup>(i)</sup>).</em> </div> <div class="proof"> Easily derived from the definition of generator and the well-known downward closure property of a sequence.</div> <h3> <a id="tth_sEc2.4">2.4</a> Algorithm</h3> <div class="par"> By integrating the preceding pruning methods and generator checking scheme with a traditional pattern growth framework [<a href="#PrefixSpan">5</a>], we can easily derive the FEAT algorithm as shown in Algorithm 1. Given a prefix sequence S<sub>P</sub>, FEAT first finds all its locally frequent items, uses each locally frequent item to grow S<sub>P</sub>, and builds the projected database for the new prefix (lines 2,3,4). It adopts both the <i>forward</i> and <i>backward</i> pruning techniques to prune the unpromising parts of search space (lines 8,11), and uses the generator checking scheme to judge whether the new prefix is a generator (lines 7,9,11,12). Finally, if the new prefix cannot be pruned , FEAT recursively calls itself with the new prefix as its input (lines 14,15).</div> <div class="image"> <a id="algorithm"></a> <img src="feat/algorithm.bmp" alt="Feat Algorithm" /> </div> <h2> <a id="tth_sEc3">3</a> Performance Evaluation</h2> <div class="par"> We conducted extensive performance study to evaluate FEAT algorithm on a computer with Intel Core Duo 2 E6550 CPU and 2GB memory installed. Due to limited space, we only report the results for some real datasets. The first dataset, <i>Gazelle</i>, is a Web click-stream data containing 29,369 sequences of Web page views. The second dataset, <i>ProgramTrace</i>, is a program trace dataset. The third dataset, <i>Office07Review</i>, contains 320 consumer reviews for Office 2007 collected from Amazon.com, in which 240 and 80 reviews are labeled as positive and negative, respectively.</div> <div class="figure"> <a id="fig:cmp"></a> <div class="image"> <img src="feat/gazelle_time.bmp" alt="Gazelle Figure" /> <p class="caption"> a) Gazelle</p> </div> <div class="image"> <img src="feat/programtrace_time.bmp" alt="ProgramTrace Figure" /> <p class="caption"> b) ProgramTrace</p> </div> <div class="imageText"> Figure 1: Runtime Efficiency Comparison </div> </div> <div class="par"> Figure <a href="#fig:cmp">1</a> shows the runtime efficiency comparison between FEAT and PrefixSpan, a state-of-the-art algorithm for mining all sequential patterns. Figure <a href="#fig:cmp">1</a> a) demonstrates that FEAT is slightly slower than PrefixSpan when the minimum support threshold is high for sparse dataset <i>Gazelle</i>, however, with a minimum support threshold less than 0.026%, FEAT is significantly faster than PrefixSpan. This also validates that our pruning techniques are very effective, since without pruning FEAT needs to generate the same set of sequential patterns as PrefixSpan and perform generator checking to remove those non-generators, thus it should be no faster than PrefixSpan if the pruning techniques are not applied. Figure <a href="#fig:cmp">1</a> b) shows that for dense dataset <i>ProgramTrace</i>, FEAT is significantly faster than PrefixSpan with any minimum support. For example, PrefixSpan used nearly 200,000 seconds to finish even at a minimum support of 100% , while FEAT costs less then 0.02 seconds. </div> <div class="par"> We used generators and sequential patterns as features to build SVM and Na�ve Bayesian classifiers respectively. The results for <i>Office07Review</i> dataset show that both generator-based and sequential pattern-based models achieve almost the same accuracy. For example, with a minimum support of 2% and a minimum confidence of 75%, both generator-based and sequential pattern-based Na�ve Bayesian classifiers can achieve the same best accuracy of 80.6%. As generator-based approach is more efficient, it has an edge over sequential pattern-based approach in terms of efficiency. </div> <h2> <a id="tth_sEc4">4</a><a id="conclusion"></a> Conclusions</h2> <div class="par"> In this paper we study the problem of mining sequence generators, which has not been explored previously to our best knowledge. We proposed two novel pruning methods and an efficient generator checking scheme, and devised a frequent generator mining algorithm, FEAT. An extensive performance study shows that FEAT is more efficient than the state-of-the-art sequential pattern mining algorithm, PrefixSpan, and is very effective for classifying Web product reviews. In future we will further explore its applications in Web page classification and click stream data analysis. </div> <h2> <a id="tth_sEc5">5</a><a id="ack"></a> Acknowledgements</h2> <div class="par"> This work was partly supported by 973 Program under Grant No. 2006CB303103, and Program for New Century Excellent Talents in University under Grant No. NCET-07-0491, State Education Ministry of China. </div> <div class="p"> <!----> </div> <div class="references"> <h1> REFERENCES</h1> <p> <a href="#CITEoldest" id="oldest"></a>[1] R. Agrawal, R. Srikant. Mining Sequential Patterns. ICDE'95.</p> <p> <a href="#CITEweblog" id="weblog"></a>[2] J. Chen, T. Cook. Mining Contiguous Sequential Patterns from Web Logs. WWW'07 (Posters track).</p> <p> <a href="#CITEcomparative" id="comparative"></a>[3] N. Jindal, B. Liu. Identifying Comparative Sentences in Text Documents. SIGIR'06.</p> <p> <a href="#CITEwhyMDL" id="whyMDL"></a>[4] J. Li, et al. Minimum description length principle: Generators are preferable to closed patterns. AAAI'06.</p> <p> <a href="#CITEPrefixSpan" id="PrefixSpan"></a>[5] J. Pei, J. Han, et al. Prefixspan: mining sequential patterns efficciently by prefix-projected pattern growth. ICDE'01.</p> <p> <a href="#CITEBIDE" id="BIDE"></a>[6] J. Wang,J. Han. BIDE:efficient mining of frequent closed sequences. ICDE'04.</p> <p> <a href="#CITEclospan" id="clospan"></a>[7] X. Yan, J. Han, R. Afshar. CloSpan: Mining closed sequential patterns in large datasets. SDM'03.</p> </div> <hr /> <h3> Footnotes:</h3> <div class="par"> <a id="tthFtNtAAB"></a><a href="#tthFrefAAB"><sup>1</sup></a>Note that a similar checking has been adopted in a closed sequential pattern mining algorithm, CloSpan [<a href="#clospan">7</a>]. Here we adapted the technique to the setting of sequence generator mining. </div> </body> </html>