Measuring Extremal Dependencies in Web Graphs

Yana Volkovich

University of Twente
Enschede, The Netherlands

Nelly Litvak

University of Twente
Enschede, The Netherlands

Bert Zwart

Georgia Tech.
Atlanta, Georgia, USA

Copyright is held by the World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others.
WWW 2008, April 21-25, 2008, Beijing, China.
ACM ???.

ABSTRACT

We analyze dependencies in power law graph data (Web sample, Wikipedia sample and a preferential attachment graph) using statistical inference for multivariate regular variation. The well developed theory of regular variation is widely applied in extreme value theory, telecommunications and mathematical finance, and it provides a natural mathematical formalism for analyzing dependencies between variables with power laws. However, most of the proposed methods have never been used in the Web graph data mining. The present work fills this gap. The new insights this yields are striking: the three above-mentioned data sets are shown to have a totally different dependence structure between different graph parameters, such as in-degree and PageRank.

Categories & Subject Descriptors

{E.1}[Data structures] Graphs and networks; {G.3}[Probability and Statistics] Multivariate statistics

General Terms

Algorithms, Experimentation, Measurement

Keywords

Regular variation, PageRank, Web, Wikipedia, Preferential attachment

1 Introduction

The question of measuring correlations in the Web data has led to many controversial results. Most notably, there is no agreement in the literature on the dependence between in-degree and PageRank of a Web page [4], [5]. One of the main points that we make in this work is that the commonly used correlation coefficient is an uninformative dependence measure in heavy-tailed data [3], [7] typical for the Web and Wikipedia graphs.

A natural mathematical formalism for analyzing power laws is provided by the theory of regular variation. By definition, the distribution

has a regularly varying tail with index $\alpha$ , if ${\mathbb{P}}(X>x)=x^{-\alpha}L(x),\;x>0,$ where

is a slowly varying function, that is, for

, $L(tx)/L(t)\to 1$ as $t\to\infty$ . Below we analyze the dependence structure in the power law data by means of contemporary statistical techniques specially designed for multivariate regular variation [7].

2 Data Sets

The log-log plots for in-degree and PageRank in Figure 1 resemble the signature straight line indicating power laws. However, several techniques should be combined in order to establish the presence of heavy tails and to evaluate the power law exponent. Using QQ plots, Hill and altHill plots as well as Pickands plots [7] we confirm that the in-degree and PageRank follow power laws with similar exponents for all three data sets. We also conclude that the out-degree can be modeled reasonably well as a power law with exponent around 2.5-3, see [8] for details.

Although all plots in Figure 1 look alike, it does not imply that the three networks have identical structure. The main goal of the present work is to rigorously examine the dependencies between the network parameters.

3 Angular Measure

Now we need to consider the points $\{\Theta_{i,k}:R_{i,k}>1\}$ and make a plot for cumulative distribution function of $\Theta$ . In other words, we are interested in the angular measure, i.e. in the empirical distribution of $\Theta$ for

largest values of

. Thus, unlike the correlation coefficient, the angular measure provides a subtle characterization of the dependencies in the tails of X and Y, or, extremal dependencies. If such measure is concentrated around $\pi/4$ then we observe a tendency toward complete dependence, when a large value of

appears simultaneously with a large value of

. In the opposite case, when such large values almost never appear together, we have either large value of

or large value of

, hence, $\Theta$ should be around 0 or $\pi/2.$ The middle case plots can be seen as a tendency to dependency or independency.

In the case of bi-variate data, a suitable value of

can be determined by making a Starica plot [7]. We consider radii $R_{1,k},\ldots,R_{n,k}$ from (1) and write $R_{(i)}$ for the

th largest value. To get Starica plot we graph

4 Dependence measurements

The results are striking. For the Wikipedia data set we observe the independence of the tails of in-degree and PageRank. That is, an extremely high in-degree almost never implies an extremely high ranking. The picture is completely opposite for Growing Networks, where the angular measure is indicating a complete dependence. Thus, in highly centralized preferential attachment graphs, most connected nodes are also most highly ranked. Finally, the Web graph exhibits a subtle dependence structure with an angular measure close to uniform on $[0,\pi/2]$ . This suggests that the PageRank popularity measure can not be replaced by in-degree without significant disturbance in the ranking.

The picture is different in Figure 2(c). In the Web, the in- and out-degree tend to be independent which justifies the distinction between hubs and authorities. In Wikipedia the degrees are dependent but this dependence is not absolute. Finally, the dependence between out-degree and PageRank in the Web and Wikipedia resembles the patterns observed for in-degree and PageRank.

These results are useful in several ways. First, they reveal some important structural features thus extending our knowledge on real-life networks. Second, by comparing the dependencies for experimental and synthetic data we can considerably improve existing graph models.

REFERENCES

[1] http://law.dsi.unimi.it/ Accessed in January 2007

[2] R. Albert and A. L. Barab\'{a}si, Emergence of scaling in random networks, Science, 286:509-512, 1999

[3] D. Chakrabarti and C. Faloutsos, Graph mining: Laws, generators, and algorithms, ACM Comput. Surv., 38(1):2, 2006

[4] D. Donato, L. Laura, S. Leonardi, and S. Millozi, Large scale properties of the webgraph, Eur. Phys. J., 38:239-243, 2004

[5] S. Fortunato, M. Boguna, A. Flammini, and F.Menczer, How to make the top ten: Approximating PageRank from in-degree, In Proceeding of WAW2006, 2006

[6] A. N. Langville and C. D. Meyer, Deeper inside PageRank, Internet Math., 1:335-380, 2003

[7] S. I. Resnick, Heavy-tail Phenomena, Springer, New York, 2007

[8] Y.Volkovich, N. Litvak, and B. Zwart, Measuring extremal dependencies in web graphs, Memorandum 1858, http://eprints.eemcs.utwente.nl/11349/, 2007

Measuring Extremal Dependencies in Web Graphs

Yana Volkovich

University of Twente
Enschede, The Netherlands

y.volkovich@ewi.utwente.nl

Nelly Litvak

University of Twente
Enschede, The Netherlands

n.litvak@ewi.utwente.nl

Bert Zwart

Georgia Tech.
Atlanta, Georgia, USA

bertzwart@gatech.edu

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 Data Sets

3 Angular Measure

4 Dependence measurements

REFERENCES

Measuring Extremal Dependencies in Web Graphs

Yana Volkovich

University of TwenteEnschede, The Netherlands

y.volkovich@ewi.utwente.nl

Nelly Litvak

University of TwenteEnschede, The Netherlands

n.litvak@ewi.utwente.nl

Bert Zwart

Georgia Tech.Atlanta, Georgia, USA

bertzwart@gatech.edu

ABSTRACT

Categories & Subject Descriptors

General Terms

Keywords

1 Introduction

2 Data Sets

3 Angular Measure

4 Dependence measurements

REFERENCES

University of Twente
Enschede, The Netherlands

University of Twente
Enschede, The Netherlands

Georgia Tech.
Atlanta, Georgia, USA