Conceptually Assisted Web Browsing

Jacek R. Ambroziak
Knowledge Technology Group
Sun Microsystems Laboratories

Abstract

This paper presents a mixed-initiative Browse Guide that assists a person browsing the Web. The Browse Guide operates proactively in real time to construct a dynamic conceptual index of documents visited by the browser and documents from the immediate neighborhood of those documents. The conceptual index is a hierarchically organized taxonomy of word and phrase concepts found in the indexed material. The Browse Guide provides tools to query and browse the incrementally-built conceptual index, which can be seen as a sophisticated "bookmark" structure linking concepts found along the path with the places where they occur. The evolving conceptual index provides two important functions: (1) an automatically assembled conceptual logbook of the user's path through the Web and (2) a facility for conceptual "peripheral vision" that displays concepts in documents one step ahead of the browser while navigating the Web. Early uses of this tool have shown it to be a powerful adjunct to existing Web search engines as well as a way to structure a personal bibliography of explored web pages.

Keywords:

WWW browsing, content indexing, information access, natural language processing, mixed-initiative browsing, conceptual indexing

Introduction

The Browse Guide operates as an adjunct to a browser, such as Netscape Navigator (TM), to provide precise information access while surfing the Web. It builds and displays a conceptual index of documents visited by the browser as well as those linked directly to visited documents. The customized index that it builds, organized in a conceptual taxonomy, serves as a powerful tool to direct a user's exploration and to organize information for future reference.

This paper starts with a brief introduction to conceptual indexing and then describes the Browse Guide and an example of its use. It concludes with a discussion of usage patterns of the tool and a few thoughts for future work.

Conceptual Indexing

Conceptual indexing is a collection of techniques for automatically organizing all of the words and phrases of a body of material into a conceptual taxonomy that explicitly links each concept to its most specific generalizations [Woods1, Woods2]. The taxonomy is a graph structure that orders concepts by generality using ISA links. The taxonomy can be used alone to organize information for browsing, or can be used as an adjunct to search and retrieval techniques to construct better queries.

Conceptual indexing of text involves:

heuristic identification of phrases in the text,
mapping these phrases into internal conceptual structures,
classifying the structures into a taxonomy, and
linking the concept to the location of the phrase in the text.

As concepts are assimilated into the conceptual taxonomy during indexing, a broad coverage English lexicon is consulted to determine semantic relationships to other concepts, based on recorded knowledge about the meanings of words. If any of the words of an indexed phrase do not yet have conceptual counterparts in the evolving taxonomy, they are also assimilated into the taxonomy using information from the lexicon.

For example, if we encounter a phrase "graphic workstation," we may need to look up "workstation" in the lexicon, learn that it is a kind of "computer," and thus assimilate the relation "workstation" ISA "computer" into the taxonomy. The process may recurse on "computer" to uncover more general relationships, all of which are added to the taxonomy. Thus the phrase "graphic workstation" builds the following taxonomy fragment (neglecting for simplicity any concepts more general than "computer"):


computer

 |-- workstation

      |-- graphic workstation

This example presents a portion of the taxonomy as a tree structure, with more specific concepts indented under their more general parents. Note that the taxonomy does not contain all of the information from the lexicon, but only the information for words and concepts extracted from the indexed text or from other phrases assimilated into the taxonomy.

After indexing a collection of text, the taxonomy recorded for the concept "computer" might look like:


computer

 |

 |-- new computer

 |    |-- recent toshiba laptop

 |

 |-- toshiba computer

 |    |-- recent toshiba laptop

 |

 |-- workstation

 |    |-- graphic workstation

 |

 |-- server

 |    |-- web server

 |    |    |-- www server

 |    |

 |    |-- sun's new netra-j server

 |

 |-- laptop

      |-- recent toshiba laptop

Please note how some of the subsumption relations come from the lexicon:


computer

 |-- workstation

 |-- server

 |-- laptop

and


new

 |-- recent

while others from structural relationships in phrases:


workstation

 |-- graphic workstation

and still others from a combination of structural and lexical evidence:


new computer

 |-- recent toshiba laptop

Glancing at the tree of the "computer" taxonomy, a user will get a good idea of what is to be found in the underlying documents on the topic of computers. This is similar to walking up to a shelf in a library and, in addition to finding the book we came for (or not finding it), finding other books on similar topics that may attract our attention. Perhaps one of the other books will better meet our needs. Whereas the organization of a library translates topical proximity into physical proximity, the conceptual taxonomy display automatically provides similar groupings, with more conceptual organization and without the one-dimensional constraints of a physical bookshelf.

The taxonomy also aids in formulating queries. In querying the index, terms are treated as concepts and are expanded by their more specific children in the taxonomy. So, for instance, a query for a "fast computer" will also be looking for "fast graphic workstation" because "graphic workstation" is a more specific form of "computer." Moreover, a search for "new Japanese computers" would also find mentions of "recent toshiba laptop." Although all the words in "new Japanese computers" are different from the words of "recent toshiba laptop," a user who formulated his query at a general level will undoubtedly recognize the finding of "recent toshiba laptop" as a perfectly valid and even an obvious one.

This last example illustrates how conceptual indexing can provide a partial solution to the "paraphrase problem," which occurs when a query uses quite different terminology than the documents being searched. The conceptual taxonomy is also useful in other ways to help refine and sort hits that result from queries [Woods2].

Browse Guide: Dynamic Conceptual Indexing

The Browse Guide is an application of conceptual indexing to the collection of documents that unfolds as a user browses the Web. When a user visits a Web page, the text of the page is conceptually indexed on the fly. Then the system moves on to proactively index text pages immediately accessible from the current page that have not yet been indexed. Other systems have explored aids to search [Keller, Lieberman], but the Browse Guide combines both semantic structures and dynamic updates in a new way. In particular, Lieberman's Letizia deals more with modelling user's behavior, whereas the Browse Guide is concerned with building structured conceptual indexes from encountered web pages.

The taxonomy is displayed as one or more Active Views which are updated dynamically as the index expands. An Active View may be either:

a Concept Browser, which displays a taxonomy fragment for a specific concept like "computer," or
an Active Query, a stored query that displays a list of hits for a given query phrase such as "new Japanese computer."

Whenever a new text document is indexed, all Active Views are notified about the changes to the taxonomy and update themselves accordingly.

Active Views have controls that let them create other Active Views. For instance, an Active Query may be created to display continuously a list of hits for a concept extracted from a Concept Browser. Active Views can also direct the browser to display a particular document.

Concept Browsers have controls that tailor the information they present. The user may change the selection of "root" concept that the browser displays. The user may also request that certain concepts and their descendants not be displayed. For instance a concept of a "mainframe" may be pruned from the "computer" taxonomy. The display will continue to show information about workstations, servers, and laptops, but no more mainframes. These tools allow a user to select just those parts of a taxonomy to track during a browsing session.

Active Queries have similar pruning controls. In a session with an Active Query, a user may "delete" a query hit in order to make room on the display for other high-scoring hits.

This is an example Active Query for "active information". The window shows a list of top-rated hits. Each line contains concepts on which the query matched, and URL location of the hit. The user may select a URL and press "go to URL" button to make the Navigator display the page with the hit. The page might have already been visited by the user or only proactively scanned by the Guide.

Patterns of Use

Searching the Web. The Browse Guide is a superb adjunct to a search service such as AltaVista (TM). AltaVista provides broad Web coverage, which the Guide complements with precision access and conceptual organization. The fact that the Guide proactively indexes the immediate neighborhood of a focus page acquires a special meaning when the page in question is a response to an AltaVista query. The pages available in one hyperlink traversal from the query results list are not normally visible to us until we choose to explore them but these are the pages that potentially contain the information we desire.

As an illustration, a user interested in the topic of "cryptography" accessed AltaVista using Netscape Navigator and the Browse Guide. The taxonomy displayed by the Concept Browser:

In response to the "cryptography" query, AltaVista returned a response page with 10 hits, each a hyperlink to a relevant document, and a link to more hit pages. Within a few seconds of receiving this results page, the active view for the "cryptography" taxonomy was automatically updated to include the new information from the hit pages themselves. Any concept in this taxonomy that attracts the user's attention can be used by the Guide to make Netscape jump to the page or pages that contain that concept.

Conceptual bookmarks. The Browse Guide is a convenient way to organize information for a specific topic of interest to either a person or a group project. The Guide can be saved for later reference or augmentation by additional Web searching. Or a person interested in cryptography might obtain a copy of a cryptography guide collected and edited by an expert. Guides are thus a form of structured bookmarks for organizing reference information. One could imagine augmenting a Web site with a conceptual index of that site, allowing visitors a different way to find information embedded within the site's web.

The Browse Guide is an example of a mixed-initiative tool: the user provides important information, but the tool is operating somewhat autonomously as well. It has the feeling of an assistant who is keeping a constant index of your travel through the Web. And because you are steering the search, the taxonomy is rich in concepts you care about; the amount of "noise" in the index seems small and unintrusive.

Implementation

The Browse Guide prototype implementation is an assembly of components that communicate using simple network connections:

BGProxy, a proxy HTTP server that serves all requests from the Web browser being used. When the browser requests a document, the proxy fetches it from the network and forwards it to the browser. The proxy also scans the document for hyperlinks and adds each link to a queue of documents to be fetched. In a separate parallel thread, the proxy fetches documents cited on the queue and forwards the contents to the Indexer.
Indexer, a program that automatically builds a persistent conceptual index of concepts found in text files.
Active Views, processes that display and update Concept Browsers or Active Query display windows. These processes interrogate the taxonomy built by the Indexer when told that changes have occurred. These processes also communicate with a tiny browser "plug-in" that is used to direct the browser to a new page when a user clicks a button on one of the Active Views windows.

The Indexer is a set of modules that form a pipeline of text processing stages:

a character source that obtains HTML files from the network,
a markup processor configured for parsing HTML,
a text tokenizer that extracts "words,"
a morphology processor that analyzes unknown words,
a phrase extractor that selects and analyzes phrases to index,
a concept assimilator that assimilates concepts into the taxonomy,
and a retriever that can locate occurrences of concepts.

Some of the stages, especially the morphology processor and assimilator, access the system's modular lexicon. To support incremental indexing, the indexer is configured as a multithreaded server that can serve multiple clients. Clients open network connections to the indexer and make requests, such as:

index a particular text fragment
show results of a query
show a part of the current taxonomy

Future Directions

The Browse Guide is an experimental prototype; much remains to be done. The user interface, and the smooth interaction between the Browse Guide and the browser, are likely to be critical. We plan to investigate ways to present the taxonomy more clearly, to provide better control over what is displayed, and to use graphical cues to highlight useful information, such as those parts of the taxonomy that have changed since the last hyperlink was traversed by the browser. The ultimate design of the user interface will depend on user tests and feedback.

The technology underlying the Browse Guide can also be refined. The current indexing software was designed for batch indexing, and requires some reworking to do efficient incremental work.

Finally, there may be ways to direct the initiatives of the Guide. It performs "research" on behalf of the user by proactively visiting text pages accessible in one step from the current page. We hope to investigate different strategies for exploring the hyper-neighborhood. For example, it may be possible to recognize patterns of links in pages and use different strategies in different cases: search results, document table of contents, home pages, etc. Other systems have ideas to offer in this regard [IBM, Lieberman].

Much of the appeal of the Browse Guide appears to stem from its mixed-initiative character: the user is charting the direction of exploration, but the Guide is providing a concise map of detail that a human can obtain only by lots of tedious "clicking around," scrolling, and skimming.

Acknowledgments

The Browse Guide is a specific demonstration of conceptual indexing technology created by the Knowledge Technology Group at Sun Microsystems Laboratories, led by William A. Woods. Contributors include Gary Adams, Bob Kuhns, Patrick Martin, Phil Resnik, and the author. The author wishes to express special thanks to Bob Sproull and Bill Woods for critical reviews of this paper and a number of important ideas, to Ted Goldstein for inspiration, and to Stuart Adams, Nicole Yankelovich, Derek White and Gary Adams for many helpful suggestions.

References

[IBM] "Web Browser Intelligence," http://www.raleigh.ibm.com/wbi/wbisoft.htm

[Keller] Arthur M. Keller, "CommerceNet Smart Catalogs," http://cit.stanford.edu/cit/commercenet.html

[Lieberman] Henry Lieberman "Letizia: An Agent That Assists Web Browsing," International Joint Conference on Artificial Intelligence, Montreal, August 1995 http://lcs.www.media.mit.edu/people/lieber/Lieberary/Letizia/Letizia.html

[Woods1] William A. Woods, "Understanding Subsumption and Taxonomy: A Framework for Progress," in John Sowa (ed.), Principles of Semantic Networks: Explorations in the Representation of Knowledge, San Mateo: Morgan Kaufmann, 1991.

[Woods2] William A. Woods, "Conceptual Indexing: a better way to organize knowledge," forthcoming technical report, Sun Microsystems Laboratories. See also http://www.sunlabs.com/research/knowledge

Return to Top of Page
Return to Posters Index