Journal reference: Computer Networks and ISDN Systems, Volume 28, issues 7–11, p. 1321.

Automatically Organizing Bookmarks per Contents

Yoëlle S. Maarek
IBM Haifa Research Laboratory
MATAM, Haifa 31905, ISRAEL

Israel Z. Ben Shaul
Dept. of Electrical Engineering
Technion, Israel Institute of Technology
Haifa 32000, ISRAEL

Abstract

The explosive growth in the Web leads to the need for personalized client-based local URL repositories often called bookmarks. We present a novel approach to bookmark organization that provides automatic classification together with user adaption.

1. Introduction and Motivation

The explosive growth in information that becomes available on the Web, and the dynamic nature of the location and contents of such information, has led to the development of various search services (e.g., Yahoo, Lycos, InfoSeek, Excite, Alta Vista) which use "crawler" programs to collect new or updated documents, and information retrieval techniques for indexing and storing information to enable fast and accurate retrieval of these documents. Termed search engines, these are essentially huge URL (Universal Resource Locator) databases that hold characteristic information on each URL to determine its relevance to a given query.

In addition to the server-based global repositories maintained by search engines, there is a need for personalized client-based local URL repositories. Unlike their global counterparts, local repositories store only address and minimal state information per document and are used mainly to remember interesting sites or documents to revisit in the future. But despite the different purposes, they are nontheless URL repositories. These repositories are termed HotLists in Mosaic, Quicklists in WebExplorer, and Bookmarks in Netscape Navigator (we will refer to them as bookmarks from now on) and were initially represented as a simple flat list of URLs. As the use of the Web becomes more prevalent and the size of personal repositories grows, however, adequately organizing and managing bookmarks becomes crucial, somewhat analogous to the need to organize files in a private disk. As a result, a browser such as Netscape Navigator 2.0 has recently added support for tree-like directory (or folder) structure for organizing bookmarks. Such hierarchical organization enables users to browse easily through their repository to find the necessary information, (assuming that the documents were adequately classified and that the user remembers his/her classification scheme). However, manual URL classification and organization can be difficult and tedious when there are more than a few bookmarks to classify or when the user lost track of his classification scheme. Moreover, unlike file systems, the contents of the URLs may not be in the local cache so fetching the document only in order to refresh the user's memory about its contents might be time consuming.

In this paper we present a novel approach to bookmark organization as well as a tool that embodies this approach. The gist of our approach is in combining manual and automatic organization facilities that enable a user to flexibly determine when to apply automatic organization, and on what subsets to apply it. Section 2 describes how bookmarks are typically collected and explains why it is difficult to organize them entirely manually. Section 3 presents our high-level approach to bookmark organization. Section 4 presents the design of the organizer tool, covering analysis, clustering and search techniques used. Section 5 describes a sample scenario, and Section 6 gives some implementation notes. Section 7 concludes the paper. Related work is discussed as relevant throughout the paper.

2. Bookmark Collection and Organization Requirements

The typical paradigm of ``net-surfing'' is characterized by visiting a series of URLs, occasionally spotting an interesting URL and adding it to the local repository on-the-fly. Manual classification of bookmarks is likely to be difficult for the following reasons:

Multi-document classification
Often, a user needs to classify a large set of URLs at once, as opposed to adding a single URL. Reasons for such multi-URL classification include:
- Upgrading a "legacy" flat bookmark repository from a browser that does not support hierarchical organization.
- Using an import facility (such as the one already provided in Netscape Navigator 2.0) to merge bookmark hierarchies.
- Integrating a list of URLs returned from a Web search service. In some cases, the user might simply enter the URLs as new bookmarks without even visiting them.
- Integrating a list of URLs that the user had no time or patience to classify while surfing and that s/he procrastinated.
Lack of informativeness of the bookmark's title or URL
The user has often very little knowledge or remembrance of what the bookmark is about, and how URLs are related to each other. Even if s/he knows roughly what the bookmark is about, s/he can at best perform a high-level classification i.e., store the bookmark in a top category), but cannot make a fine distinction among related bookmarks. Besides the obvious advantage of organizing documents by their relevance to each other, fine-level classification also helps to detect duplicates or highly similar documents and determine whether to keep or delete them (a typical example is when a Web search returns the same RFC under mutiple URLs and in various formats).

Regardless of the origin of the bookmark set, manual classification into an existing hierarchy is a tedious task that one might want to automate. One alternative to automatic classification could be to organize bookmarks per "journey" while surfing, by simply recording the history of traversed URLs and storing them in the same folder. Indeed, some browsers support this organization in addition to a flat bookmark list. For example, the IBM OS/2 WebExplorer uses WebMap to store history alongside the flat Quicklist.

While potentially useful in some cases, classifying bookmarks per journey is not adequate because typically the set of traversed URLs is not homogeneous.

Typical bookmark repositories contain bookmarks which are collected by a single person with a limited and well-defined set of interests (compared to multi-user repositories). This implies that a repository is likely to have an inherent partitioning into groups with high intra-group similarity and low inter-group similarity. Such organization just needs to be discovered. Hence, providing automatic assistance in organizing bookmark repositories by their conceptual categories would be extremely beneficial. This is the goal of the Bookmark Organizer (BO) that we present here.

2.1 Requirements for Bookmark Organization

There have been previous attempts at automatically organizing document collections in the fields of information retrieval and digital libraries, to support not only classical search but also browsing as an information discovery paradigm. Typically, one would use advanced full-text indexing and cluster analysis e.g.,[Cutting et al. 92], [Maarek et al. 94], clustering based on hypertext links, [Botafogo 93] or even categorization techniques [Apte et al. 94] to support such tasks, depending on whether there exists a prior taxonomy of the library, a training corpus ,etc. However, the requirements for bookmark repository organization are quite different than those for ``conventional'' document collections.

The size of a typical bookmark repository is smaller in several orders of magnitude compared with a typical digital document collection.
The collection is highly evolutionary; the bookmark set changes both very frequently and dynamically,i.e., on-line. In contrast, a digital library changes infrequently, and sometimes off-line.
Unlike ``public'' digital libraries which serve many users, personalized repositories must include a tight user-interaction mechanism that allows users to interfere and control the otherwise automatic organization utility at as many stages of the organization as possible, to suite their organization preferences.
The HTML document pointed to by the bookmark is typically small in size compared with a typical digital book. This allows to scan and analyze documents more frequently with much smaller performance penalty.

3. A Hybrid Approach to Bookmark Organization

Given the above requirements, most fully automatic document clustering tools are not suitable for bookmark organization. Even an incremental clustering technique e.g., [Maarek 90] that allows to modify the taxonomy as a second minor phase following a primary automatic phase that generates the taxonomy from scratch, is inappropriate. Indeed, since bookmarks represent personal interests, the user may not want to relinquish control over the organization of its repository to an automated tool. As an analogy, imagine a stranger that organizes your papers every morning on your desk; even if the organization is better and more rational than what you would have done, it is usually not acceptable.

Facing the strong need for automated organization, coupled with the need to retain full control with the user, we advocate a non-intrusive policy and suggest a hybrid approach in which manual and automatic organization are ``equal partners''. A user can employ only manual organization, fully automatic organization, or as in the typical case, interleave manual and automatic phases. Most importantly, to enable fine granularity with respect to the scope of application, a user can apply either mode of organization on a defined subset of the repository, recursively.

Automatic and manual organizations correspond to two modes that can be used alternatively and in a coordinated way. A user can define ``core'' categories manually, or import existing high-level taxonomies such as SmartMarks, and then apply automatic organization only within these categories. Alternatively, a user can automatically generate categories and manually arrange each category. Still, the user is free to (re)define categories and the partitions that are under "automatic clustering" control, but he/she does not have to apply either mode globally, on the entire repository.

In addition to fine granularity, this approach also promotes modularity, in that applying automatic organization can be done on part of the repository without affecting other loosely related parts of the repository.

3.1 Freezing Nodes

One of the key features of BO is to allow users to mark internal nodes as frozen. By freezing nodes that have been manually defined or customized from automatically identified nodes, the user communicates with the automatic tool and ensures that the tool will not interfere with the manually-defined structure. More specifically, by freezing a node N, the user signifies that the subtree SN rooted at N represents a coherent group of documents. In particular, it prevents from automatic clustering to break this boundary by, say, re-clustering it with documents from other subtrees and generating new directory structure; however, it still allows to refine N's internal organization by automatically clustering any unfrozen subtree of SN.

Hence, a BO directory structure can be viewed in general as a tree with some "fat" frozen leaves treated as black-box subtrees. There are no restrictions on where in the hierarchy a frozen node can be placed; frozen nodes may even be nested, i.e., a frozen node can have a subtree which is also marked as frozen. However, in general we expect that frozen nodes will be mostly effective when defined relatively close to the root in the hierarchy, to represent high-level categories that are not likely to evolve as often as lower-level nodes.

3.2 Adding Bookmarks

The most common user-operation in bookmark management is the addition of new bookmarks. Consistent with the hybrid approach, BO allows to add documents in two complementary ways:

Fully automatic
The user does not know in which category to place the new bookmark (or does not care to find it), and requests automatic assistance. In this case, BO applies a very precise search technique (described later in Section 4), and determines the most relevant frozen cluster. It then inserts the document into the proper cluster by applying automatic clustering on it. This in turn generates a new organization in this cluster but does not affect any other part of the hierarchy.
Semi-automatic
The user specifies the node into which to insert the new document. He/she can now request (automatic) re-organization of bookmarks. Consequently, the algorithm climbs up the tree to find the closest frozen ancestor and applies the clustering algorithm on the sub-cluster. Note that this technique is applied at its best when the new bookmark is added to a frozen cluster.

3.2 Deleting and Moving Bookmarks

It may seem that deletion of documents should not impose further actions from BO. However, since clustering is based on inter-document relationships, the absence of several documents might invalidate the raison d'etre of a given cluster. In other terms, since a cluster represents a set of documents that share some commonality, it might happen that after deletion not enough documents remain present to represent any shared abstraction. One could argue that the node could remain even if half empty. This would be the case only if it is frozen by the user. If not, reclustering should be performed to reflect other abstractions since the context has changed. Suppose, for example that document A was initially clustered with document B as the most similar document; document C, which was slightly less related to A but on another topic, was left aside because the clustering technique does not support overlapping. If B is deleted, it means that the user lost interest in it, and its the key topic. But since A was not deleted, it should probably now be clustered with C. Thus to enforce consistency, whenever a document is deleted, reclustering is applied on the containing sub-cluster.

Finally, movement of documents across sub-clusters is handled as an addition in the target sub-cluster and as a deletion in the origin sub-cluster.

4. BO Design

In order to provide generic and browser-independent solution (as well as for pragmatic reasons of not having access to a browser's source code) BO was designed to operate as a separate external utility. The only requirements imposed by the BO algorithms on browsers are to support hierarchical structure and provide means to annotate nodes. Thus, users use their bookmark repository freely from their favorite browser, and when desired, they invoke BO, which in turn:

Parses the bookmark hierarchy and detects the changes which were made to the repository since its last invocation.
Applies the semi-automatic method on documents which were added or deleted inside existing frozen nodes, i.e., it finds the closest ancestor frozen node and reclusters its subtree.
Applies the fully-automatic method on documents which were added at the top-level, outside any subtree. That is, it first finds the most proper frozen subtree, and then reclusters that subtree.

We now turn to discuss each component separately.

4.1 Change Analysis

BO is typically invoked after the user has made several changes to the bookmark repository. As an external entity, it must be able to detect changes that were made to the hierarchy. One straightforward but tedious and expensive solution to detecting changes in the hierarchy is to compare the current tree with an older saved tree, generate the "delta" between the two trees, and recluster frozen sub-clusters. However, given that nodes can store state persistently, we can avoid tree comparison in the following manner:

Each time BO executes on a repository it stores in each internal node a computed checksum that characterizes its children (note that simply keeping the number of children is not sufficient because of possible addition followed by deletion). In addition, each leaf is annotated as "old". Addition of new documents or folders is easily detected since they don't have "old" or checksum notations. To identify movements/deletions, the checksum at each node is recomputed and compared with the previous value to determine whether the contents of its subtree have changed. If they did (whether added or deleted) and the parent node is frozen or a descendant of a frozen node, re-clustering takes place. Note that "addition" may actually be a movement of a pre-existing document from another folder, which explains why a binary flag would not suffice either.

4.2 Automatic Clustering

The automatic organization of bookmarks per contents is done by applying sophisticated document clustering techniques. Clustering of documents has been primarily used in information retrieval (IR) for improving the effectiveness and efficiency of the retrieval process [Rasmussen 92] or more recently, for providing a new access paradigm [Cutting et al. 92]. We are more concerned here with the primary use of cluster analysis which is the visualization of underlying structures for explanation purposes.

Since the bookmark hierarchy is going to be shown to the user not only used internally by, say, a retrieval engine, it has to be more meaningful and more precise than what is usually required by document clustering techniques in IR.

We have described in [Maarek et al. 94] how one could take advantage of sophisticated indexing techniques based on multiple word indexing (namely lexical affinities as defined in [Maarek et al. 91]), not only to generate precise cluster hierarchies but also to associate a great deal of conceptual information with each cluster so as to facilitate the understanding and naming of clusters. We briefly explain below what cluster analysis consists of and the general principle of the algorithms we use. The reader is referred to [Maarek et al. 94] for more details.

Hierarchical Clustering

Cluster analysis offers a wide range of techniques for identifying underlying structures in large sets of objects and revealing links between objects or classes of objects. There is no strict definition of cluster, but it is generally agreed that a cluster is a group of objects whose members are more similar to each other than to the members of any other group. Typically, the goal of cluster analysis is to determine a set of clusters, or a clustering, such that inter-cluster similarity is low and intra-cluster similarity is high. The clustering technique the most used in IR applications is the Hierarchical Agglomerative Clustering (HAC) method. The reader should consult [Everitt 80] for an extensive survey of HAC methods and [Willett 88], [Rasmussen 92] for their application to documents.

The HAC method can be described as follows. Given a set of objects (or documents in IR applications):

Start with a set of singleton clusters, each contains one object
Repeat the following steps iteratively until there is only one cluster.
- Identify the two clusters that are the most similar.
- Merge them together into a single cluster.

An example of the cluster hierarchy or dendogram to use the correct terminology thus generated is shown in Figure 1.

Figure 1

The method described above requires a measure of similarity (which is in most cases a dissimilarity index) over the set of clusters. Typically in the context of document clustering, each document is indexed - i.e., a characterizing profile usually defined as a vector of words and associated numerical score [Salton et al. 83] is generated. The measure of similarity between 2 documents is a direct function of how many indices (which are single words in most indexing schemes) they share.

One major drawback with clustering documents is that wrong clusters might be induced by noisy or ambiguous indices. To reduce the influence of such indices, we use here a highly precise indexing scheme in which indexing units are not single words but pairs of words that are linked by a lexical affinity (LA). An LA between two units of language stands for a correlation of their common appearance [Saussure 49]. It has been described elsewhere how LAs can be extracted from text at a low cost via a sliding window technique and how an LA-based scheme allows to significantly increase precision in domain specific collections such as software documentation collections [Maarek et al. 91], [Maarek 91].

The LAs are represented as pairs of lexically ordered words. Note that a morphological analyzer is used rather than a stemmer to obtain more precise and meaningful indices. The weight associated to each index is a simple cosine normalized frequency. We have demonstrated that using LA-based rather than single word profiles significantly improve the precision of the generated clustering hierarchy. The example below compare a LA-based profile and a single-word based profile for the same document (entitled: " IBM Online library: Softcopy Collection") to show how LAs are less ambiguous and more informative:

normalized frequency /single word profile

0.399 book
0.394 softcopy
0.343 ibm
0.218 library
0.161 cd
0.154 rom
0.147 bookmanager
0.142 pc
0.129 bookshelf
0.129 collection
0.127 program
0.127 kit
0.115 information
0.100 system
0.093 host
0.090 product
0.080 picture

LA-based profile

0.346 book softcopy
0.336 cd rom
0.263 collection kit
0.217 ibm softcopy
0.206 ibm library
0.201 book library
0.191 ibm program
0.181 library softcopy
0.170 bookmanager ibm
0.165 book bookshelf
0.139 kit softcopy
0.139 host system
0.134 collection softcopy
0.108 library program
0.103 library reader
0.103 book print
0.098 cd softcopy

Controlling the size of the hierarchy

Hierarchies generated by the classical HAC are often inconvenient to visualize as they are binary. In many applications of cluster analysis such as classification of animal species, chemical structures etc., experts have to manually select the most adequate level clusterings in the dendogram to ease their discovery task. This is a rather tedious job, we therefore propose to automate it by slicing the hierarchy at meaningful levels. In the previously cited paper [Maarek et al. 94] that introduced this LA-based document clustering, we proposed a slicing method based on the ideal branching factor of nodes so as to control the clusters size. After usability studies, we discovered that in the context of bookmarks organization, users were not comfortable with the notion of controlling clusters size (this was rather a library administrator's concern) as it has no semantic meaning. They rather were interested in getting clusters that have a comparable degree of conceptual similarity. Therefore we introduce here a slicing technique - which we find more adapted to bookmarks organization - that automatically identifies the level clusterings within the tree which have a comparable degree of intra-cluster similarity. This slicing technique is described below.

One can associate with every level in the dendogram a measure of similarity between 0 and 1 that corresponds to the value for which the last cluster merging was performed. For instance, in Figure 1, documents d1 and d2 are merged at level 0.97, and clusters {d1,d2} and {d3} at level 0.81. Thus for every level, we know what the minimum degree of intra-cluster similarity is and it can be expressed as a percentage (0%= not similar whatsoever, 100%= identical). Thus, from the dendogram shown in Figure 1, one can infer that documents d1 and d2 are 97% similar and documents d4 and d5 are 98% similar. It is clear that the user does not need to distinguish between levels of similarity that are so close. S/he would rather know what clusters correspond to a degree of similarity of approximately 90%, or 80%, etc.

One very simple way to select representative levels is therefore to slice the tree at regular intervals of 10% of similarity and collapse into one single level all levels between two slices. The slicing method is illustrated in Figure 2 below.

Figure 2

The 10% value is clearly arbitrary, but it presents the advantage of being intuitive to users. The user thus gets at most 10 levels in the tree. Note that if the tree is still too large, some constraints can be added to slice the tree further more. For instance, one can impose a minimum of intra-cluster similarity of 30% for instance if only precise information is required.

A bookmark collection is thus structured automatically in five steps.

For each URL in the collection, access the source of the corresponding HTML document
Index each document and generate an LA-based profile
Compute the pairwise similarity of all the documents
Cluster by applying the HAC method in order to produce a binary dendogram
Collapse the dendogram into a non-binary cluster hierarchy by applying the slicing technique

Enriching the cluster hierarchy with conceptual information

One major problem of cluster hierarchies is that they are difficult to interpret for non-experts. The problem is twofold. Users have difficulties identifying clusters in the classical dendogram layout (e.g., they have to understand that an horizontal line corresponds to a cluster) as shown in Figure 1. Once this difficulty is overcome, they have trouble understanding what the cluster is about. Indeed, with numerical clustering algorithms, clusters are defined in extension, i.e., by enumeration of their members, rather than in intention i.e., by membership rules. There is for instance no direct way to name a cluster. This is crucial when dealing with bookmark hierarchies since every node correspond with a folder that must receive a name to enable further easy browsing.

The problem of displaying clusters of documents to the user has already been addressed in the past. It has been often proposed to represent the cluster information defined by the dendogram in a different manner. Typical examples are two-dimensional displays, where clusters are represented as "bubbles" [TEWAT], [Botafogo 93]. We believe however that the tree representation should not be dropped. First, it is the historical mean of displaying cluster information since the beginning of taxonomy techniques in the 18th century, but mostly most computer literates are familiar and comfortable with hierarchical structures. To cite just a simple example: most users know how to browse hierarchical directory structures. Yet cluster hierarchies can be made easier to visualize making them dynamic graphical objects which layout can be interactively manipulated. Note that this view seems to be shared by some as the current bookmark windows in Netscape Navigator 2.0 precisely used this hierarchical folder, or directory like, structure (See Figure 4). In addition, such a hierarchy can be made easier to interpret by enriching them with conceptual information (in quantitative or qualitative form) as explained below.

Qualitative information
To help users understand what a cluster (or bookmark folder) is about, we dynamically generate a list of characterizing key concepts for each cluster by computing the centroid of all the vector profiles associated to the documents within the cluster. Using the center of gravity instead of performing a simple intersection operation allows us to keep all the relevant concept information. The cluster indices are then sorted according to their weights, and those with the higher scores are selected as key concepts. We have explained before that using lexical affinities instead of single words to represent concepts is much less ambiguous. However, it might happen that understanding the underlying concept in lexical affinities requires some prior tutoring. Typical examples are triple-word concept. For instance "NCSA Mosaic documentation" would be represented by 2 or 3 binary lexical affinities .e.g., (Mosaic,NCSA) (documentation,Mosaic) (documentation,NCSA) (note that component words are listed in lexicographic order). Thus, if the user sees the 3 following lexical affinities with similar score: (application,development), (development, kit), (application kit), s/he must able to infer that they actually stand for the single concept "application development kit".
We thus compute for every bookmark cluster (including the singular case of a singleton cluster formed by a unique document) a list of key concepts represented by LAs and that form a inherent part of the description of the bookmark folder. In addition, we select the best LAs from this list of key concepts, and use them as a pseudo-title for the folder that the user can leave as such or replace by a more stylistic phrase. It a future stage, we intend to use this information to reanalyze the document taking the extracted LAs into account so as to generate a syntactically correct title (via longer collocations for instance).
Quantitative information
Another way to enrich the cluster hierarchy is to keep the measure of similarity for which each cluster was formed. We have explained above that the intra-cluster similarity notion is intuitive to users. That is the reason for which we use it as a criterion to select representative levels from a binary dendogram. Thus, a cluster with an intra-cluster similarity of 99% is at the same level in the tree as another cluster whose similarity is 95% or yet another one whose similarity is 92%. By collapsing all these former levels into a single one some information is lost that might be useful in some cases. For instance a 99% degree of similarity between documents reveals that the documents are probably perfect duplicates, whereas a degree of similarity of 92% shows that they differ by few sentences.
Thus, we keep this intra-cluster similarity rate, which indicate the quality of the bookmark folder and associate it as one of its properties that the user can consult upon demand.

4.3 Search for Relevant Sub-tree

We have explained in Section 3 that when adding a new bookmark in a fully automated way, BO must be able to detect in which frozen node the new bookmark should be added. This is done in a very simple way using a precise full-text information retrieval engine, Guru [Maarek et al. 91] that supports free-text (aka natural-language) queries

As explained above, we can compute for every cluster/folder a profile (represented as a vector of LAs) by analyzing its members. Whenever, a node is frozen we do not need to index the cluster - as it has already been done - but only to store them in an inverted index format, as done in standard information retrieval methods [Salton et al. 83]. Then, searching for the most relevant node consists of feeding to the retrieval engine the entire contents of the bookmark as a free-text query and getting as result a ranked list of frozen nodes. At this stage, the rest of the mechanism can be again fully or semi automated.

In a fully automatic approach, the best candidate is selected, the bookmark is added to it and reclustering is performed using the technique described in Section 4.2.
In a semi-automatic approach, a short list of the best candidates is presented to the user who can manually select the one(s) in which s/he wants the new bookmarks to be added. Then again the reclustering is performed on these nodes. Note that the user can select more than one node and then obtain overlapping clusters.

5. A typical scenario

In this example, a user who surfed mainly for sites related to his hobbies (Tennis and Cars) and having a very short bookmarks list, decided suddenly to learn everything about Java and using Lycos issued the query "Java ". He scanned the results and from time to time added a bookmark, without deeply reading any of the documents. By doing so, he realizes that his flat bookmark list becomes suddenly pretty long and needs to be organized.

Therefore he decides to use the Bookmark Window Management in Netscape Navigator 2.0 for the first time. He first looks at the window and sees only a flat list. See Figure 3 below.

Figure 3: A Flat Bookmark List

So he decided to at least distinguish between his previous and new interests and created three high level folders that he called: My Hobbies, Java and he simply dragged and dropped (or cut and paste) all his bookmarks into the relevant categories (See Figure4)

Figure 4: A Bookmark List after simple high-level classification

Being exhausted at having done that, he stopped and decided to use BO on his bookmark file. So, he went into the properties of his two folders and entered the keyword FROZEN in the description field, to notify BO not to touch this high level classification, but only to organize what is below it. Then, as an external application, he run BO giving it as input his private bookmarks.html file and generate a new file that he called new_bookmarks.html. Turning again towards his Navigator Bookmarks Window, he imported the file new_bookmarks.html (and thus generated a high-level top folder) to check BO's results as shown in Figure 5 below.

Figure 5: After BO's organization

BO has identified 3 first level categories and attributed automatically a pseudo_title for each of them:

11% (interprets,java) (java,pages) (compile,java)
29% (javaio,package) (guide,package) (javautil,package)
Outliers

For the first two clusters, the user can supposedly easily infer that they respectively deal with "Java compilation/interpretation" and "Java packages". The third cluster, is a fixed title generated by BO to indicate that the bookmarks stored under this folder are not sufficiently similar (less than 10%) to any other bookmark to be clustered. By scanning them, the user will verify that indeed these two outliers consist of the lyrics of a Indian movie song and a plant.

In addition, the user can have more conceptual information about a cluster or document, via a list of key concepts represented as LAs. This information is stored as a property of the URL under the description field as shown in Figure 6

Figure 6: Key concepts of a bookmark after BO's processing

From this point, the user can decide to freeze some sub categories that seem really important and get rid of his previous unorganized falt-list, still using the same interface.

Note that for the sake of the simplicity and in order not to generate too large figures, we reduced the number of bookmarks handled in this example. In practice however, BO can easily handle several hundreds of bookmarks.

6. Implementation Notes

BO has been implemented in C under Unix (more precisely IBM's AIX operating system) using the bookmark repository generated by Netscape Navigator 2.0 for AIX. All manual customization of the bookmarks is done via the user-friendly Bookmarks window interface. State and "frozen" information is kept in the "description" property. If integrated more tightly with a specific browser, a special icon representing "frozen" node and more explicit representation of the state in the bookmark viewer could be easily added.

We used a standard "XOR and bit rotating" checksum function that computes the checksum value over a stream consisting of the concatenation of the state information on all children. Recall that this function is used only to indicate whether or not changes were made in a node's subtree, and not to characterize the changes (if any). Once a change is detected, reclustering is applied as explained above.

Evaluation of the quality of the clusters generated have already been presented in [Maarek et al. 94]. For this bookmark application, we intend to conduct further usability studies in the future.

7. Conclusion

Hierarchical organization of personalized URL repository has the potential to be extremely useful and allow users to browse through their repository to find the information they need. However, without being able to properly classify the URLs into a hierarchical structure, such mechanism becomes practically useless. For various reasons, manual classification of URLs is hard, but fully-automatic (re-)classification is simply unacceptable because most users demand control over their organization.

Our novel approach, embodied in the BO tool, is to provide automatic classification mechanism that uses clustering analysis to organize documents based on their conceptual similarity, but enable the user the freedom to select when and on what part of the hierarchy (which itself has been previously classified either manually or automatically) to apply it, thereby allowing full co-existence and compatibility of manual and automatic classification.

Acknowledgement

Thanks to Ofer Biran for volunteering his bookmarks files for our typical scenario, and to Amir Herzberg for complaining about strangers reordering his desk.

References

[Apte et al. 94]: C. Apte, F. Damereau, and S. Weiss, "Towards Independent Automated Learning of Text Categorization Models". Proceedings of SIGIR'94, 1994.
[Botafogo 93]: R.A. Botafogo, "Cluster Analysis for Hypertext Systems", in the Proceedings of ACM SIGIR'93, 1993.
[Cutting et al. 92]: D.R. Cutting and D.R. Karger and J.O. Pedersen and J.W. Tukey, "Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections", in the Proceedings of SIGIR'92, 1992.
[Everitt 80]: B. Everitt, "Cluster Analysis", Halsted Press (John Wiley & Sons), New York, 1980.
[Maarek 90]: Y. S. Maarek, "An Incremental Conceptual Clustering Algorithm with Input-Ordering Bias Correction", in "Advances in Artificial Intelligence, Natural Language and Knowledge Base Systems", MC. Golumbic ed., Springer-Verlag, 1990.
[Maarek et al. 91]): Y.S. Maarek, D.M. Berry and G.E. Kaiser, "An Information Retrieval Approach for Automatically Constructing Software Libraries", in Transactions on Software Engineering, 17:8, August 1991.
[Maarek 91].: Y.S. Maarek, "Software library construction from an IR perspective", in SIGIR Forum 25:2, Fall 1991.
[Maarek et al. 94]: Y.S. Maarek and A.J. Wecker, "The Librarian's Assistant: Automatically Assembling Books into Dynamic Bookshelves", in the Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.
[Rasmussen 92]: E. Rasmussen, "Information Retrieval, Data Structure and Algorithms", Chapter 16: Clustering Algorithms, W. B. Frakes and R. Baeza-Yates, eds., Prentice Hall, 1992.
[Salton et al. 83]: G. Salton and M.J. McGill, "Introduction to Modern Information Retrieval", McGraw-Hill, Computer Series, New York 1983.
[Saussure 49]: F. de Saussure, "Cours de Linguistique Generale", Librairie Payot, Paris, France, 1949. (in French).
[TEWAT]: "TEWAT, Operating and Installing Guide", Technical Report SH11-0954-00, IBM France.
[Willett 88]: P. Willett, "Recent trends in hierarchic document clustering: a critical review", in Information Processing & Management, 24:5, 1988.

About the authors

Yoëlle Maarek is a Research Staff Member at the IBM Research Lab in Haifa, Israel and a project leader in the fields of information retrieval and Internet applications. She received a Doctor of Science degree from the Technion, Haifa, Israel, in 1989. She graduated from the "Ecole Nationale des Ponts et Chaussées", in France, and simultaneously received her D.E.A. in Computer Science from Paris VI University in 1985. Before joining IBM Israel, Dr. Maarek was with IBM T.J. Watson Research Center in the US for about 5 years. She is the author of Guru, an advanced information retrieval system, widely used within IBM. She has published over 20 papers in the fields of information retrieval and software reuse.

Israel Ben-Shaul is an assistant professor in the Electrical Engineering department at the Technion. He received his BSc in Mathematics and Computer Science from Tel Aviv University in 1988, and his MS and PhD in Computer Science from Columbia University in 1991 and 1995, respectively. As a research staff at Columbia (1990–1991) he also served as the project manager in the Programming Systems Lab. Prior to joining the Technion he was a research fellow at the IBM research laboratory in Haifa. His research interests include distributed and heterogeneous systems, software engineering, workflow management systems, groupware, information retrieval, and federated databases.