FindUR: A Hybrid Approach to Search |
Beginning with a known platform of well-used search functionality, Verity Search97 in this case, we considered how we might improve search by exploiting knowledge of user patterns, known forms of data, domain knowledge, and information sources. First we insured that the indexing tools could be made to understand what is contained in virtual pages. There are two approaches to be considered. The first approach uses or augments standard indexing software to index dynamically generated pages. Many spider programs provide basic functionality to support this. For example, Verity's spider supports indexing cgi-generated web pages. The second approach obtains the content that would be on the virtual pages, appropriately parses it into typed fields, and then uses the output as input to an indexing tool. This approach typically allows more flexibility and more meta-information concerning the content to be exploited. We have used each approach in different applications. Given an index of content that would appear on static and virtual pages, the next task is to simulate understanding of query terms. This means we should return fewer irrelevant documents and more relevant documents which may have been missed. For both problems (as well as for supporting browsing and query refinement), we rely on a domain ontology applied to (potentially previously categorized) content. In order to explore this approach further, we will first consider an application area where much is known about likely content. We developed knowledge representation augmented search facilities for four such content areas: a web site containing information about a computer science research organization, an event-based calendar application for a small city, a few community web sites focusing primarily on directory applications, and a health information web site. One common trait among all of these sites is that for the main content areas, there is a fair amount of available background knowledge. For example, in the computer science research web site, we might know that artificial intelligence is a subclass of computer science and is a superclass of knowledge representation. Further, we may know that description logic is a subclass of knowledge representation. A typical publication title in description logics will contain the phrase description logic but it will not contain the phrases knowledge representation or artificial intelligence. Thus, a query for publications on artificial intelligence or knowledge representation would miss these articles. However, a search could return the relevant description logic publication if it uses background knowledge of subclasses as additional evidence for their superclasses. Our initial architecture to support retrieval of such related documents uses the Verity topic set tool. This allows one to give phrase ``evidence'' for another phrase. Thus, subclasses like knowledge representation would be evidence for superclasses like artificial intelligence and computer science. Our initial population of topic sets encodes the superclass-subclass relationship as well as instance relationships and synonyms. This is in support of the following goals:
Topic sets can include more than just subclasses as evidence terms, thus they provide great
flexibility. We could include famous artificial intelligence researchers as evidence of AI
without being concerned that individual people are not instances of the field of AI. We are
working on identifying other principled relationships for inclusion in the topic sets. While we
build our initial topic sets and gather data about their effectiveness, we consider the flexibility
of topic sets to be a strength. Our goal, however, is to move to a semantically richer
representation tool such as description logics1. See Section 4 for comments about future
directions with respect to organizing background knowledge. It is also worth noting that
others support the notion of topic sets as a valuable asset for search in constrained domains.
One company's entire product line [15] is topic sets for various search engine languages. The interface provides high level categories of information on our site. Categories may be
opened by clicking on the folder icon next to the category name. When categories are
opened, their subcategories are exposed. Any of these subcategories may be added to the
query by clicking on a checkbox next to its name. Additional evidence phrases may be
maintained to further increase the ways that a person can search for a document. This page
is taken from our event-based web site in a small city. The Business and Education folders
are both hot in this document and open to Figure 2 and Figure 3 respectively. If one opened
the Education folder, one would find subclasses of education including local elementary
schools, private schools, etc. One could click on elementary school and then search for the
phrase elementary school and additionally search for the evidence phrases which were
instances of the elementary schools, i.e., the names of the particular elementary schools in the
town. Additionally, we could store synonyms or other related phrases which should be
searched for. |
1 For more information see the official description logics home page at http://www.kr.org/dl/. |
Abstract Background Search Goals FindUR: A Hybrid Approach to Search Future Directions References |