PageSim: A Novel Link-based Measure of Web Page Similarity

Zhenjiang Lin

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Michael R. Lyu

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Irwin King

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

ABSTRACT

To find similar web pages to a query page on the Web, this paper introduces a novel link-based similarity measure, called PageSim. Contrast to SimRank, a recursive refinement of cocitation, PageSim can measure similarity between any two web pages, whereas SimRank cannot in some cases. We give some intuitions to the PageSim model, and outline the model with mathematical definitions. Finally, we give an example to illustrate its effectiveness.

Categories & Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval clustering, search process; G2.2 [Discrete Mathematics]: Graph Theory graph algorithms

General Terms

Algorithms, Measurement

Keywords

similarity measure, link analysis, search engine, PageRank, SimRank

1. INTRODUCTION

In this paper, we propose a novel link-based similarity measure, called PageSim. Contrast to SimRank, our method can measure similarity between any two web pages, whereas SimRank cannot in some cases.

SimRank is a fixed point of the recursive definition: two pages are similar if they are linked to by similar pages. Numerically, for any web page

and

, this is specified by defining

and

Unfortunately, the result of SimRank is not convincing in some cases. In one case, if one of two web pages has no inlink, then the SimRank score of them is zero by definition, which means they are not similar. However, this is not always true. For example, in Figure 1 of section 4,

has no inlink, but it is clear that both

and

have some similarity with it for they are linked to by

. In another case (also in Figure 1), SimRank concludes that

and

are not similar. In fact, obviously

and

indeed have some similarity, for they link to each other.

2. PAGESIM

PageSim can be considered as an extension of cocitation algorithm, in which the similarity score between two web pages is defined by the number of inlink neighbors that they have in common. Actually, on the Web, not all links are equally important. For example, if the only common neighbor of page

and

is the Yahoo home page [1], whereas page

and

have several common neighbors from obscure places, then which page is more similar to page

, page

or page

? As we know, hyperlink from web page u to v can be considered as a recommendation of page v by page u [2], and the more important a web page is, the more important its recommendation is. Evidently, the reasonable answer should be page

, since the Yahoo home page is much more ``important". In another perspective, the action of recommendation can be considered that page u propagates some kind of similarity to page v, and the more pages it links to, the less similarity it should propagate to each of these pages. Therefore, it is also reasonable to think that the Yahoo home page has some kind of similarity with both page

and page

Since PageRank [5] is one of the most prominent ranking algorithm which assigns global ranking scores to all pages on the Web, we take the PageRank score of a web page as the importance (weight or similarity score) of it in the PageSim method. The intuitions to PageSim model is described as follows, and the mathematical definitions will be given later.

At the beginning, each web page only contains its own similarity score, and then each web page propagates its own similarity score to its outlink neighbors, receiving and propagating the similarity scores of others at the same time. After the propagation, each page contains its own similarity score as well as the similarity scores of others. These scores are stored in a vector called the similarity vector of this page. Then we can calculate the PageSim score of each pair of pages by summing their common similarity scores up.

3. DEFINITIONS

We model the Web as a directed graph

with vertices

representing web pages $v_i (i = 1, 2, \cdots, n = \vert E\vert)$ and directed edges

representing hyperlinks between web pages. Let

and

denote the set of inlink pages and outlink pages of

respectively, for any $v \in V$ . Let

denotes a sequence of vertices ${u_1, u_2, \ldots, u_s}$ such that $(u_i, u_{i+1}) \in E$ $(i = 1, \cdots, s-1)$ and

are distinct, it is called a path from

. Let

denotes the length of path

, define $length(p) = \vert p\vert - 1$ . Let

denotes the set of all possible paths from page

4. PAGESIM VS SIMRANK

A good evaluation of PageSim is difficult without performing extensive user studies or having a reliable external measure of similarity to compare against. In this section, we give a simple example in which PageSim is compared with SimRank to illustrate the performance of PageSim.

For a given graph

, where $V=\{v_i\} (i = 1, \cdots, 6)$ (see Figure 1). Let $\overrightarrow{PR}(V) = (PR(v_i))^{T}, i=1,\cdots, 6$ . We have $\overrightarrow{PR}(V) = (0.08, 0.23, 0.18, 0.14, 0.14, 0.23)^{T}.$

We can see that the results of PageSim and SimRank are different. First, SimRank shows that there's no page similar to

. While PageSim shows that

is most similar to

, which is more reasonable. Because the fact that

links to

implies

``considers"

has some level of similarity with it. Secondly, SimRank shows

is not similar to

, while PageSim shows that is not true. Obviously,

and

are similar, for they link to each other. Moreover, PageSim considers that

is most similar to

. SimRank shows

is most similar to

, for they have a common inlink page

. We believe PageSim is the winner in this situation because the ``link to each other" relationship really implies stronger similarity than that of the ``common inlink" relationship.

5. CONCLUSION AND FUTURE WORK

This paper introduces PageSim, a novel link-based similarity measure. Based on the strategy of PageRank score propagation, PageSim is capable of measuring similarity between any two web pages.

There are numbers of avenues for future work. Foremost, we must address the efficiency issue. For example, the computing time of PageSim is expected to be greatly reduced by limiting the radius of propagation, i.e., the path length of propagation. Especially, when the radius is reduced to 1, PageSim becomes a ``weighted cocitation algorithm". Finally, extensive evaluations of PageSim are needed.

6. ACKNOWLEDGMENT

REFERENCES

[1] http://www.yahoo.com.

[2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technology, 1(1):2-43, 2001.

[3] J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11-16):1467-1479, 1999.

[4] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In KDD '02: Proceedings of the eighth ACM SIGKDD, pages 538-543, New York, NY, USA, 2002. ACM Press.

[5] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.

PageSim: A Novel Link-based Measure of Web Page Similarity

Zhenjiang Lin

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

zjlin@cse.cuhk.edu.hk

Michael R. Lyu

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

lyu@cse.cuhk.edu.hk

Irwin King

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

king@cse.cuhk.edu.hk

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. PAGESIM

3. DEFINITIONS

4. PAGESIM VS SIMRANK

5. CONCLUSION AND FUTURE WORK

6. ACKNOWLEDGMENT

REFERENCES

PageSim: A Novel Link-based Measure of Web Page Similarity

Zhenjiang Lin

Dept. of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

zjlin@cse.cuhk.edu.hk

Michael R. Lyu

Dept. of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

lyu@cse.cuhk.edu.hk

Irwin King

Dept. of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong

king@cse.cuhk.edu.hk

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1. INTRODUCTION

2. PAGESIM

3. DEFINITIONS

4. PAGESIM VS SIMRANK

5. CONCLUSION AND FUTURE WORK

6. ACKNOWLEDGMENT

REFERENCES

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong

Dept. of Computer Science and Engineering
The Chinese University of Hong Kong
Shatin, N.T., Hong Kong