Analyzing Global Behavior of Web Community Evolution

Masashi Toyoda and Masaru Kitsuregawa
Institute of Industrial Science, University of Tokyo
4-6-1 Komaba Meguro-ku, Tokyo, JAPAN
+81-3-5452-6098 (ext. 57623)
toyoda@tkl.iis.u-tokyo.ac.jp, kitsure@tkl.iis.u-tokyo.ac.jp

ABSTRACT

Recent advances in storage technology make it possible to store a series of large Web archives. It is now exciting challenge for us to observe evolution of the Web. We analyze evolution of web communities by comparing four Japanese web archives crawled from 1999 to 2002. A web community is a set of web pages created by individuals or associations with a common interest on a topic. So far, various link analysis techniques have been developed to extract web communities. Statistics of community evolution are examined, and the global behavior of evolution is described.

Keywords

Link analysis, web community, evolution

1. INTRODUCTION

In this paper, we analyze the evolution of web communities, and show some statistics of evolution. A web community is a collection of web pages created by individuals or associations with a common interest on a topic, such as fan pages of a baseball team, and official pages of computer vendors. Recent research on link analysis[2,3,5] shows that we can identify a web community on a topic by extracting densely connected structure in the web graph, in which nodes are web pages and edges are hyperlinks. Although these techniques can automatically identify communities, evolution of communities still have not examined.

Since a web community represents a certain topic, we can understand when and how new topics emerged and evolved in the Web. Such information is important in the following situations: (1) Answering historical queries about topics on the Web. For example, how many and what kinds of web pages have been created related to the September 11, 2001 attacks; (2) Tracking and analyzing user communities of consumer products for marketing; (3) Observing social and cultural trends over time would be interesting topics for sociological research; (4) Observing the emergence of quality web communities on a topic. Finding emergent communities according to user's interest is challenging task for new search services.

Statistics of web community evolution is examined using four Japanese web archives crawled from 1999 to 2002 with 119M pages in total. We first extracts whole communities and their relevance from each web archive. It is based on our previous work of web community chart [6] that is a graph of communities, in which related communities are connected by weighted edges. Then compare these charts over time. We show that the structure of communities changes dynamically, though the size distribution of communities is rather stable. For the page limitation, we omit more detailed results. For example, we have found that most of changes in communities follow the power-law. In the following, we first briefly describe our technique for building the web community chart. Then show statistics of web archives and community evolution.

2. Web Community Chart

The web community chart is a graph that consists of web communities as nodes, and weighted edges between related communities. The weight of each edge represents the relevance of communities at both ends. The main idea of our technique is applying a related page algorithm, Companion-[6], to a number of seed URLs, then investigate how each seed derives other seeds as related pages. Companion- takes a seed URL as an input, and calculates related pages to the seed based on the concept of hubs and authorities [4]. It returns authorities near the seed as related pages.

To identify communities and to find their relationships, we focus on symmetric derivation relationship (SDR) between s and t, that is s and t derive each other by the related page algorithm. This often means that the both pages s and t are pointed to by similar sets of hubs, and have strong similarity. Using this relationship, we define that a community is a set of pages that are connected by the symmetric relationships, and that two communities are related, if a member of one community derives a member of an another community.

The community chart is built from a given seed set. We first apply Companion- to each seed in the seed set. Then, we put seeds into a community, when they are connected by symmetric relationships. Every two communities are connected by a edge, when there exists a derivation from a seed in one community to an another seed in the other community. A weight of a edge is the number of such derivations. Refer to [6], for more detailed descriptions.

3. Evolution of Web Communities

In this section, we first explain details of changes in web communities, and then show statistics of community evolution.

3.1 Types of Changes

We observe the evolution of communities from a series of web archives, W(t1), ... , W(tn). (tk represents the time when each archive is crawled.) We first build web community charts (C(t1), C(t2), ..., C(tn)) from all web archives, and then investigate differences between neighboring charts.

There are two ways to see the evolution of a community, backward and forward. For simplicity, here we explain backward examination. That is, first we select a web community chart C(tk), and see how communities had been evolved until time tk by comparing C(tk) and C(tk). We can do the same thing for forward examination of evolution by comparing C(tk) and C(tk+1). Here we show how communities change from C(tk-1) to C(tk).

3.2 Web Archives and Graphs

For experiments, we used four web archives of Japanese web pages (in jp domain) crawled in 1999, 2000, 2001, and 2002 (See Table 1). We used the same web crawler in 1999 and 2000, and collected about 17 million pages in each year. In 2001, the size of the archive became more than twice of the 2000 archive, since we improved the crawling rate significantly. Our crawlers collected pages in the breadth-first order.


Table 1: Details of our web archives
Year Period #Pages #URLs #Links #Seeds #Communities
1999 Jul. to Aug. 17M 34M 120M 657K 79K
2000 Jun. to Aug. 17M 32M 112M 737K 88K
2001 Early Oct. 40M 76M 331M 1404K 156K
2002 Early Feb. 45M 84M 375M 1511K 170K


From each archive, we built a web graph with URLs and links by extracting anchors from all pages in the archive. Our graph included not only URLs inside the archive, but also URLs outside pointed to by inside URLs. As a result, the graph included URLs outside jp domain, such as com and edu. Table 1 also shows the number of links and the total URLs.

By comparing these graphs, we found that the Web was extremely dynamic. About 60% of URLs disappeared from both 1999 and 2000 graphs. From 2001 to 2002, about 30% of URLs disappeared in four months. The number of URLs surviving through four archives was only about 5 million. We also examined the indegree distributions of our web graph. Indegree of a URL stands for the number of URLs pointing to the URL. Previous work, such as [1,5], shows indegree distributions fit to a power-law, in which the probability of the positive integer value i is proportional to 1/i^k for a small positive number k. The reported number of k is 2.1. Indegree distributions of our graphs also exhibit a power-law. The exponent of the power-law is around 2.2 for all graphs.

3.3 Evolution of Web Community Charts

In this section, we describe the global behavior of community evolution. From the above four web graphs, we built four community charts using the technique described in Section 2. As a seed set for each graph, we selected popular URLs (with 3 or more inlinks from different servers). Table 1 also shows the number of seeds and communities in the chart created for each year.

The size distribution of communities also follows the power-law. The exponent is around 2.95, and did not change so much over time. This result is similar to distributions of connected components in the Web graph reported in [1].



Figure 1:Transition of communities

Although the size distribution of communities is stable, the structure of communities changes dynamically. Figure 1 shows how many communities are involved in each type of changes from tk-1 to tk. Each bar represents the number of communities in charts at the time. Bars at 2000 and 2001 are split vertically, since they have the previous and the next charts to be compared. Each colored block represents the number of communities involved in a particular change. Dotted blocks represent dissolved communities from tk-1, and white blocks represent emerged communities. Gray blocks represent communities that are involved in split or merge. Finally, black blocks represent single communities that are not involved in these changes, but may grow or shrink. We can see that the structure of our chart changes mainly by split and merge, in which more than half of communities are involved. The number of single communities are small (only about 10% from 1999 and 2001). About 25% of communities are dissolved from each time.

4. REFERENCES

  1. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph Structure in the Web. In Proceedings of the 9th World-Wide Web Conference, 2000.
  2. G. W. Flake, S. Lawrence, and C. L. Giles. Efficient Identification of Web Communities. In Proceedings of KDD 2000, 2000.
  3. D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web Communities from Link Topology. In Proceedings of HyperText98, 1998.
  4. J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998.
  5. S. R. Ravi Kumar, Prabhakar Raghavan and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th World-Wide Web Conference, 1999.
  6. M. Toyoda and M. Kitsuregawa. Creating a Web Community Chart for Navigating Related Communities. In Conference Proceedings of Hypertext 2001, pages 103--112, 2001.