MIDAS: Towards Rich Site Structure and Content Metadata

Natasa Milic-Frayling
Microsoft research Ltd
7 J J Thomson Avenue
Cambridge, United Kingdom
+44 (0)1223 479 700
natasamf@microsoft.com
Ralph Sommerer
Microsoft research Ltd
7 J J Thomson Avenue
Cambridge, United Kingdom
+44 (0)1223 479 700
natasamf@microsoft.com
Gavin Smyth
Microsoft research Ltd
7 J J Thomson Avenue
Cambridge, United Kingdom
+44 (0)1223 479 700
natasamf@microsoft.com

ABSTRACT

Information about structure and content of Web sites and individual Web pages can be used by client applications to enhance user’s experience while browsing and searching the Web. At the moment, applications and services resort to various heuristics to generate such information from impoverished document publishing formats. This information could be best captured when the data is authored, with or without author’s explicit assistance, or, alternatively, when the data is published. We present MIDAS (Meta-Information Delivery and Annotation Service) that captures metadata about the structure and content of Web sites, both during authoring and publishing. MIDAS includes a mechanism for supplying flexible views of such metadata to client applications and services. We have also built prototype extensions of the browser that exploit content and structure metadata captured by MIDAS. Thus, we demonstrate a complete end-to-end solution, from metadata creation to metadata exploitation, that may potentially have a wide usage and impact. Lack of such application scenarios has been the major weakness of metadata initiatives so far.

Keywords

Metadata, Web publishing, Content structure, Site structure.

1. INTRODUCTION

Effectiveness of Web applications and information services has been constrained by continued focus of on-line publishing on document layout and rendering requirements. Information that could contribute to flexibility in viewing, searching, and navigating through documents is typically omitted from document formats.

For example, many Web authoring tools enable the author to specify the structure of a site and navigational elements on a page. However, at the publishing time, the site content is stored as a set of individual HTML pages with no explicit information about the site structure. Similarly, the semantics of navigational elements such as menus, embedded controls, etc., is not preserved in the HTML code. Some of this information can be recovered with limited reliability from the graph structure of the site or page layouts. Similarly, content authoring tools often include spell checking and grammar checking facilities that involve sophisticated linguistic analyses. Results of such analyses are discarded once the invoked function is completed. On the other hand, information access services, such as search engines and question answering systems, could benefit from ‘ready to use’ sophisticated text analyses of the Web content.

Motivated by these observations, we designed and implemented Meta-Information Delivery and Annotation Service (MIDAS) aimed at enhancing Web contents with rich metadata. The objective is to enable applications and services to provide better support for browsing, viewing, reading, and analyzing information on the Web. The current implementation of MIDAS involves automatic metadata generation or more or less straightforward capture of metadata from Web authoring applications. It is expected that once the utility of such metadata is demonstrated some of the analyses would be migrated into authoring tools and given an appropriate level of author’s attention.

2. METADATA GENERATION

Our selection of metadata types for MIDAS has been motivated by several issues related to information access on the Web. First, in order to be usable in interactive Web scenarios, document content analyzers have to provide practically instantaneous response. This rules out application of sophisticated and resource intensive linguistic analyses ‘on the fly’. For that reason we incorporated into MIDAS metadata the results of deep syntactic and partial semantic analyses of pages, provided by MS NLPWin software ([1]).

Second, Web documents are often viewed on devices with various display specifications. That poses a problem with pages that have complex and fixed layout. Ideally, we would have a fully generic document format that supports flexible and adaptive layout of the document content. However, such a format does not yet exist and Web browsers are relying upon heuristics to provide reasonable viewing of Web pages. With this in mind, we decided to take a step towards a richer characterization of the page structure by automatically identifying content and organizational units of the page and storing that information as the page metadata. More precisely, we apply SmartView technology ([3],[4]) to identify logical units within a page. That involves analyzing layout elements, such as tables, and applying heuristics to identify menus, forms, and similar. In many instances it is also useful to have a visual representation of a page. Thus, for static pages we generate thumbnail images that can be further used in conjunction with the page structure metadata.

Finally, we observe that Web is intuitively perceived as an intricate network of Web ‘sites but, in fact, the scope and the structure of an individual site is not well defined and understood. Furthermore, the Web hyperlink environment envelops the user into a tunnel-vision perspective. This may lead to the user’s disorientation from the lack of site overview. On the other hand, creating a site structure and providing an overview requires time and resources. Thus, within MIDAS, we provide tools that generate the structure of a Web site at the time of content publishing. That is achieved by crawling the site host and inferring the structure based on link analyses. We also illustrate how this process can be facilitated by an authoring tool. We extended MS FrontPage authoring application to capture the structure of a site template in use and persist it in the XML format compatible with MIDAS’.

In summary, MIDAS includes structure information about individual Web pages and the Web sites. It also generates syntactic and semantic analyses of the site content, starting with individual sentences. This content metadata is further aggregated at other organizational levels as needed (e.g., it is used to create searchable indices of the site or constituent sub-sites).

3.     FRAMEWORK

MIDAS is central to the three component framework that includes authoring tools, metadata service, and client applications (Figure 1). MIDAS’ metadata is stored on a dedicated server or the Web server hosting the site content. We differentiate between the core metadata that is persisted in the database format and metadata views that are generated upon a request by an application or a service.

In the current implementation, metadata storage and management are facilitated by the MS SQL Server. Requests for metadata from the client are communicated to the server in the form of function calls using SOAP. In response, the server transports back the requested metadata in the XML format. Depending on the scenario, XML is imbedded in the HTML code of the content or delivered separately.

The XML format of MIDAS’ site and page structure metadata is essentially an extension of the RSS standard ([5]). Similarly, a simple transformation could render appropriate parts of MIDAS XML into the RDF (Resource Description Framework) syntax ([2]).

Figure 1. Framework for generating and exploiting rich metadata

In conjunction with the standard Web server facilities, this setup provides flexible and extendible framework for both data and metadata delivery. The sophistication of the metadata service is essentially determined by the quality of content and structure analyses and the capability of the Metadata Server to generate required views upon request. We present two client applications that exploit the types of metadata currently included in MIDAS.

SiteExplorer is an extension of the Browser that exposes the site structure metadata in the form of an interactive, hierarchical map of the site. It combines the site structure information with the fully searchable site index to support both query based and navigation based searching through the site. Search related information, such as relevant pages, is presented as visual annotations on the links within the site map. Furthermore, regions of the pages that contain search terms are highlighted within thumbnail representations of the pages. SiteExplorer also incorporates the user’s navigation history, indicating the current and previously visited locations within the site, as well as the user’s favorite pages, paths, or sections of the site.

In this prototype, the site metadata is transported independently from the content. It is rendered in the browser using a default style-sheet or a style-sheet specific to the site. Our automatic site structure discovery suffers from a number of imperfections, such as inability to assign meaningful labels to page nodes when the title is missing or is incorrect. This problem is best addressed by enhancing the authoring tools to engage the author in the metadata creation process.

DocPrecis is an application that exploits content metadata of individual pages. It provides summary views of documents in order to facilitate skimming through or detailed reading of documents. As the user follows a link to a page, the HTML request is augmented with the request for associated metadata. The metadata includes prominent concepts, such as person names, and indicators of location and time associated with these concepts. The client application receives for each concept in the summary the positional information of sentences that contain the concept, including those with pronominal references to the concept. Location and time indicators are suitably highlighted in the text. The corresponding sentences are viewed in the form of a summary, following the order of exposition in the text and thus providing an outline of events and related locations.

In this application, the metadata is imbedded in the HTML code of the document. Our future work will explore efficiency issues related to various types of metadata transport within different scenarios.

4. REFERENCES

  1. Heidorn, G.E. (2000) Intelligent Writing Assistance. In Dale, R, H. Moisl, and H. Somers (Eds.) Handbook of Natural Language Processing (pp. 187-207) New York: Marcel Dekker.
  2. Lassila, O. and Swick, R. Resource Description Framework (RDF) model and syntax specification, 1999 http://www.w3.org/TR/REC-rdf-syntax .
  3. Milic-Frayling, N. and Sommerer, R. SmartView: Flexible Viewing of Web Page Contents. Poster presentation at the 11th World Wide Web Conference, 2002.
  4. Milic-Frayling, N. and Sommerer, R. SmartView: Enhanced Document Viewer for Mobile Devices. Microsoft Technical Report: MSR-TR-2002-114, November 2002.
  5. RSS 1.0 Specification http://web.resource.org/rss/1.0/spec.