FindUR: A Hybrid Approach to Search


Beginning with a known platform of well-used search functionality, Verity Search97 in this case, we considered how we might improve search by exploiting knowledge of user patterns, known forms of data, domain knowledge, and information sources. First we insured that the indexing tools could be made to understand what is contained in virtual pages. There are two approaches to be considered. The first approach uses or augments standard indexing software to index dynamically generated pages. Many spider programs provide basic functionality to support this. For example, Verity's spider supports indexing cgi-generated web pages. The second approach obtains the content that would be on the virtual pages, appropriately parses it into typed fields, and then uses the output as input to an indexing tool. This approach typically allows more flexibility and more meta-information concerning the content to be exploited. We have used each approach in different applications.

Given an index of content that would appear on static and virtual pages, the next task is to simulate understanding of query terms. This means we should return fewer irrelevant documents and more relevant documents which may have been missed. For both problems (as well as for supporting browsing and query refinement), we rely on a domain ontology applied to (potentially previously categorized) content.

In order to explore this approach further, we will first consider an application area where much is known about likely content. We developed knowledge representation augmented search facilities for four such content areas: a web site containing information about a computer science research organization, an event-based calendar application for a small city, a few community web sites focusing primarily on directory applications, and a health information web site. One common trait among all of these sites is that for the main content areas, there is a fair amount of available background knowledge. For example, in the computer science research web site, we might know that artificial intelligence is a subclass of computer science and is a superclass of knowledge representation. Further, we may know that description logic is a subclass of knowledge representation. A typical publication title in description logics will contain the phrase description logic but it will not contain the phrases knowledge representation or artificial intelligence. Thus, a query for publications on artificial intelligence or knowledge representation would miss these articles. However, a search could return the relevant description logic publication if it uses background knowledge of subclasses as additional evidence for their superclasses.

Our initial architecture to support retrieval of such related documents uses the Verity topic set tool. This allows one to give phrase ``evidence'' for another phrase. Thus, subclasses like knowledge representation would be evidence for superclasses like artificial intelligence and computer science. Our initial population of topic sets encodes the superclass-subclass relationship as well as instance relationships and synonyms. This is in support of the following goals:
  1. Reuse generally available ontologies which typically contain subclass-superclass relationships and instances.
  2. Generate easily explainable topic sets so that average users will be able to generate and maintain them.

Topic sets can include more than just subclasses as evidence terms, thus they provide great flexibility. We could include famous artificial intelligence researchers as evidence of AI without being concerned that individual people are not instances of the field of AI. We are working on identifying other principled relationships for inclusion in the topic sets. While we build our initial topic sets and gather data about their effectiveness, we consider the flexibility of topic sets to be a strength. Our goal, however, is to move to a semantically richer representation tool such as description logics1. See Section 4 for comments about future directions with respect to organizing background knowledge. It is also worth noting that others support the notion of topic sets as a valuable asset for search in constrained domains. One company's entire product line [15] is topic sets for various search engine languages.

Evidence phrases will undoubtedly reduce the ``no matching document'' problem and thus increase potentially relevant retrievals. The information retrieval [11] literature has of course noted that along with increased related results, one may obtain increased irrelevant results. Controlled experiments remain to be done on this issue, but our hypothesis is that web sites with constrained domain information which has obvious subclass relationships and other interrelationships will benefit from such an approach. Burke et al.[5] take a similar view in their FindMe project. Additionally, common competing statistical methods (e.g., [8]) may not apply well in many of these sites because of the small number of documents involved. For example, the use of clustering in AltaVista's new LiveTopics [4] benefits greatly from having a large number of documents; in fact, AltaVista's interface does not even provide LiveTopics as an option when a small number of documents are retrieved.

Another issue related to retrieval is the notion of limiting the scope of a query. In many of our sites, users desired the ability to limit their search to certain areas of information. For example, in our event-based web site, users might only want to search in the calendars of educational institutions; in the health site, they may be interested only in diagnostic information about a disease; in the directory information, they may be only interested in looking at retail stores related to sports instead of getting all the sports articles in the local paper as well. In all of these cases, if there is some broad categorization information available, as is the case in some of our sites by way of meta-tagging, then a search including a predefined meta-tag is appropriate. If, on the other hand, no vocabulary has been agreed upon for use by meta-tagging experts, or if the user is unaware of the vocabulary, then the user would not know what term to add to his or her query. In order to help the user with a lack of vocabulary familiarity, we expose the high level categories that are supported for tagging. The user does not have to type any of these in, she just chooses the phrase (or phrases) from our background knowledge organization, and then restricts the search to those categories. See Figure 1 for a view of categories which can be used to limit search.

Figure 1: Initial FindUR Page

The interface provides high level categories of information on our site. Categories may be opened by clicking on the folder icon next to the category name. When categories are opened, their subcategories are exposed. Any of these subcategories may be added to the query by clicking on a checkbox next to its name. Additional evidence phrases may be maintained to further increase the ways that a person can search for a document. This page is taken from our event-based web site in a small city. The Business and Education folders are both hot in this document and open to Figure 2 and Figure 3 respectively. If one opened the Education folder, one would find subclasses of education including local elementary schools, private schools, etc. One could click on elementary school and then search for the phrase elementary school and additionally search for the evidence phrases which were instances of the elementary schools, i.e., the names of the particular elementary schools in the town. Additionally, we could store synonyms or other related phrases which should be searched for.

If as is the case in this web site, one also has information tagged with high level content areas, one could limit one's search only to the education tagged documents in the collection.

The content organization serves both the purpose of providing a structured, iterative, presentation of the content as well as support for query refinement.

1 For more information see the official description logics home page at http://www.kr.org/dl/.


Abstract
Background
Search Goals
FindUR: A Hybrid Approach to Search
Future Directions
References