Web Link Behavior and Consequences for Connectivity Based Authority Measures

Thomas Mandl
University of Hildesheim
Information Science
Marienburger Platz 22
31141 Hildesheim
Germany
mandl@uni-hildesheim.de

ABSTRACT

This study analyzes the link behavior of web page authors and draws conclusions for the design of connectivity based authority measures used in web information retrieval. Back links are more and more considered an important indicator for the authority of pages. A quantitative analysis of links to and from internet catalogues shows that the probability for a link to a page in a catalogue decreases drastically when the page is on a low level in the hierarchy. Furthermore, the number of links to a page in an internet catalogue does not correlate with the number of back links of the sites mentioned. As a consequence, link based authority measures need to be further refined in order to better reflect the cognitive processes involved in link creation.

Keywords

link-based information retrieval

1. INTRODUCTION

Links are an important element in human computer interaction in the web. Links are also a knowledge source for information retrieval systems on the web. Overall, web linking behavior exhibits some surprisingly consistent patterns. The number of back- and out-links per page compared to the number of pages with these number of links closely follows a power law distribution [1, 3].

The reasons for humans to point to other pages in their own pages need to be further investigated, since quantitative link based measures are becoming an important factor in today's web search engines. This paper explores issues arising in the context of link behavior and draws conclusions for link-based authority measure in web information retrieval. The main findings are the following:

These results should be considered in the design of connectivity based web search algorithms.

Furthermore, the analysis shows, how web page authors assume, that catalogue pages are best used. They prefer the presence of many further options over pointing to a very specific page. Rather than targeting at pages with a narrow topical focus on lower hierarchical levels, they infer that users will benefit from many options for browsing. Web page authors stress the importance of navigation by browsing as information finding strategy in human computer interaction.

2. EXPERIMENTS

The popularity on the internet search engine Google and the PageRank algorithm implemented as part of it [2, 7] have led to considerable scientific interest in link analysis and structure mining on the world wide web. First evaluation results of these algorithms are available. They show that the consideration of link structure does not lead to better retrieval performance for topical queries. Only for home page finding, improvement has been achieved by integrating the PageRank values into a retrieval algorithm [4].

Web directories or internet catalogues are important services for the orientation in the internet. They usually intend to topically organize information sources and to introduce a certain level of quality control. Human editors monitor the web, evaluate and comment on pages. Our two experiments focus on the following two issues arising in the context of web catalogues and their use:

The popularity on the internet search engine Google and the PageRank algorithm implemented as part of it [2, 7] have led to considerable scientific interest in link analysis and structure mining on the world wide web. First evaluation results of these algorithms are available. They show that the consideration of link structure does not lead to better retrieval performance for topical queries. Only for home page finding, improvement has been achieved by integrating the PageRank values into a retrieval algorithm [4].

Web directories or internet catalogues are important services for the orientation in the internet. They usually intend to topically organize information sources and to introduce a certain level of quality control. Human editors monitor the web, evaluate and comment on pages. Our two experiments focus on the following two issues arising in the context of web catalogues and their use:

The experiments reported here, are based on two German web catalogues, Yahoo.de and Google-Directory. Both the Google and Altavista search engines were used to determine the number of in links to each of the pages in the internet catalogues. In the second experiment, we also queried these search engines for the number of back links for the entries contained in the catalogue pages.

The first experiment analyzed the relationship between the number of back links to a catalogue page and the hierarchy level of that page. The information derived during the web mining process, shows a drastic decrease in the number of links for a decreasing hierarchy level. The lower the catalogue pages are positioned in the topical hierarchy, the less likely they are to receive back links. These findings show that humans are much more likely to set a link to a page which is positioned at a higher level of the hierarchy of a web site. The relation is visualized in figure 1.

Figure 1: Relationship between average number of in links and hierarchy position

The second experiment intended to investigate the adequacy of the number of back links as a quality indicator. If the number of back links is a good indicator, a popular catalogue page should point to popular sites. The indicator should be consistent for catalogue pages and referred pages.

The analysis showed that there is no correlation between the two different counts of in-links. The number of in-links for a catalogue page is independent from the number of in-links to the entries, the sites that the page refers to. This also shows that there seems to be a rationale for separating between content quality (authority value) and referral quality (hub value) as it is done in the Kleinberg algorithm [6].

Surprisingly, a significant correlation was found for another parameter, the number of sub categories. Whereas correlation with all other parameters remains close to zero, the number of sub categories exhibits a correlation over 0.5 for category level two. The results are shown in the following table:

in links based on Google in links based on Altavista number of sub categories number of entries
All pages -0.05 -0.06 0.34 -0.10
Only pages on level two -0.10 -0.08 0.58 -0.13
Only pages on level three -0.05 -0.06 0.04 0.10

3. OUTLOOK

Further analysis is necessary for non catalogue sites with hierarchical organization. In the case that a similar relationship is found, the hierarchical position of a page could be integrated as a factor for link analysis within search engines. An evaluation of the effectiveness of this approach would be necessary. Further research is required to identify additional indicators for the quality of web pages other than links [5].

4. REFERENCES

  1. Adamic, L. Huberman, B. The Web's Hidden Order. Communications of the ACM 44(9):55-59
  2. Brin, S. Page, L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1-7):107-117
  3. Dill, S. Kumar, R. McCurley, K. Rajagopalan, S. Sivakumar, D. Tomkins, A. Self-Similarity in the web, in Proceedings 27th Intl Conf on Very Large Databases (VLDB 2001)
  4. Hawking, D. Craswell, N. Overview of the TREC-2001 Web Track, in Proc. Ninth Text REtrieval Conference 2001 http://trec.nist.gov/pubs/trec9/t9_proceedings.html.
  5. Ivory, M. Hearst, M. Statistical Profiles of Highly-Rated Sites, in Proc. of CHI '2002 (Minneapolis, USA 2002)
  6. Kleinberg, J. Authoritative Sources in a Hyperlinked Environment, in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms. 1998.
  7. Page, L. Brin, S. Motwani, R. Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript. http://citeseer.nj.nec.com/page98pagerank.html.