From a Web to a linear HTML document


Martin Gagnon
gagnon@vlsi.polymtl.ca
Michel Dagenais
dagenais@vlsi.polymtl.ca
Department of Electrical and Computer Engineering
Ecole Polytechnique of Montreal
PO Box 6079, Succ Centre-ville
Montreal, Quebec, Canada
H3C 3A7

Abstract

The lingua franca of the World Wide Web is the HyperText Markup Language (HTML) which presents information as a Web of small documents linked to each other. While this Web navigation is convenient when searching for a specific information, it may be cumbersome when all the information must be read anyway (like traditional linear texts). The proposed approach adds linearization hints to a Web of HTML documents and provides tools to automatically generate a linear version of a Web. Furthermore, a table of contents, an index and a list of references are automatically produced for the linear version.

Introduction

In the past few years, the number of Internet users and the amount of information available have increased significantly. The World Wide Web is probably the medium which has enjoyed the most important growth. It can be defined as a set of protocols (Hypertext Transfer Protocol) [1], formats (Hypertext Markup Language, HTML) [2], software tools and conventions.

Hypertext documents are nodes of information linked together. This way, each separate topic resides in a separate file and may be used in different contexts. For instance, an HTML document on copying garbage collection may be pointed to from documents on algorithms, object oriented programming, or on the Modula-3 language.

Other document processing formats such as LaTeX [19] may be divided into small sections which are merged into a linear document with "input" commands. However, these systems have little support for reusing a file in several documents. Indeed, the pathnames must be relative to the document root, instead of to the current file in HTML, and header level must be adequate for the context of the including document. Other popular document formats such as GNU info [] and Linux-doc sgml present similar problems. Moreover, with the importance of the World Wide Web as the most up to date source of information, and the wide availability of tools such as Amaya [] for HTML authoring, it is preferable to use HTML as the source format than to convert from another format (e.g. LaTeX) to HTML.

The flexibility allowed by HTML to split documents into small reusable nodes is also a source of problems. Indeed, in many cases the information should be available both as an online Web and as a linear printable document. Examples include the Modula-3 language definition and libraries, the Java language definition, libraries and tutorial, and several documents from the World Wide Web consortium (W3C) and the Linux documentation project.


Objectives

The objective of the proposed approach is to enter all the information in HTML, possibly with some linearization hints. Then tools may be used to point at a HTML Web node and recursively follow the appropriate links in order to produce a linear HTML document. Converting to another format (ASCII, LaTeX, Postscript) is a somewhat separate and simpler issue.



Figure 1: Structure of a Web site

A Web site may be represented as a network of nodes where each node represents an HTML file (see Figure 1). However, A number of issues need to be examined for producing a linear document from a Web of HTML nodes.

These issues are discussed below.

Linearization Rules and Issues

An HTML document usually consists of:

<HTML>
<HEAD>
<TITLE> This is the title </TITLE>
</HEAD>
<BODY>
detailed text here
</BODY>
</HTML>

Where only the TITLE is mandatory. The BODY implicitly starts after the TITLE, if no HEAD or BODY section is tagged. The linearization of a Web starts at a node specified as root. From there, hypertext links are followed in a depth first search, with the following rules.

Construction of Derived Nodes

Once the linearization is performed, it is useful to construct derived nodes containing information about the linear document, or about the web of visited nodes. In the first case, the address within the linear document is used, while in the second case the original address within the web is used.

A table of contents lists all the headers. Nested lists are used to represent the sections hierarchy. Each list entry is a link containing the section title, and pointing to the relevant position in the linear document or the web of visited nodes. Thus, each header should have an ID attribute. Similarly, a list of figures or tables may be constructed; the FIGURE element was recently dropped from the newer HTML 3.2 proposal however.

A reference list is another important but slightly more complex derived node. It should contain a list of all references external to the linear document. However, these references are in URL form while most printed documents prefer traditional bibliographic entries. In some cases the URL may point to a bibliographic entry:

<DIV ID="Here" CLASS=BIBITEM>
Author,<I>book title</I>, O'Connell, 1995
</DIV>
In other cases, the URL may point to an ordinary document. However, a bibliographic entry may be deduced:
<TITLE> book title </TITLE>
<H1> book title </H1>
<P>
<DIV CLASS=AUTHOR> Author </DIV>
<P>
<DIV CLASS=BIBITEM.TAIL> O'Connell, 1995 </DIV>
...
If there is no such information to be found, the entry would be simply the URL.

Conversion to ASCII

To produce an ASCII document, the table of contents may be prepended to the document, and the reference list and index appended. The references in the table of contents should be replaced by the page number of the corresponding section. The entries in the reference list should have a tag assigned (order of first occurrence, e.g. [1] [2], or a symbolic name used in anchors, e.g. [Connolly1996]).
Then, external references in the document must be replaced by the tag associated with the corresponding entry in the reference list.
The internal references to figures, sections and tables should be replaced by the figure/section/table number. There may be cases where one wants the page number instead. This may be achieved with the attribute CLASS=PAGEREF in <A HREF=...>.

Producing LaTeX documents

LaTeX can automatically generate the table of contents and the index, and takes care of \label and \ref, or \pageref, for sections, tables and figures. Thus, the conversion between HTML and LaTeX is more straightforward. The major item where special care is required is the reference list. It must be generated by the linearization tool. However, LaTeX is able to connect each \cite to the corresponding \bibitem in the reference list.

Implementation

The rules defined earlier in this text, for the linearization of a web, have been implemented in a HTML processing framework named htmlkit. Unlike the algorithm of Sharples and Goodlet [3], which relies solely on heuristics to guide the librarization through the web, htmlkit uses heuristic rules and linearization hints annotations (REL attribute in links).

The framework is implemented as a set of reusable Modula-3 [4] libraries. The low level of coupling between the different modules in the libraries makes it easy to interchange or upgrade components. The libraries are decomposed as follows:

Libraries
Low-Level
  • HTTP, Gopher, FTP... communication
  • HTML parsing
  • File access
  • Input/Output streams
  • Low level data structures
  • Text utilities
  • Format matching
Intermediate level
  • Higher level data structures
  • Operations on the data structures
  • Link manipulation utilities
High level
  • Linearization tools
  • ASCII filter
  • Latex filter
  • Index builder
  • Table of contents builder
  • Reference list builder.
With this organization, it is relatively easy to build custom tools for performing various tasks for a Web site. For example one may want to construct an index or a table of contents of the Web site, without creating a linearization.

Results

The linearization tool implemented in htmlkit was tested on a Sun Sparc 5 workstation with 32 Mo of RAM. In each case, a web was linearized, an index, a table of contents and a list of references were built, and the output was converted to ASCII, LaTeX and Postscript.

Table 1 presents the elapsed time as a function of the number of nodes linearized. The processing time grows linearly with the number of nodes involved. In the current implementation, the modularity was favored over the performance. In any case, in typical applications, the network or disk access delay is likely to be the major limiting factor. Nevertheless, a large document may be linearized in approximately one minute.


Table 1: Linearization time versus number of nodes
Nbr. of
nodes
CPU time
(s)
User delay
(s)
Size
files
(Ko)
memory
(Ko)
3428.3131.8434.6872494
9435.1138.6253.6542536
12540.8145.8490.1322685
18652.7271.89101.8662699
21756.7576.33107.6302703
24861.1782.66128.8562709
30671.03110.84166.3972816


Figure 4: CPU time used versus nbr. of nodes in the network



Figure 5 presents the memory consumption as a function of the number of nodes linearized. The memory usage grows less than linearly with the number of nodes involved. The size of the VISITED NODES list and of the REFERENCES list grows linearly with the number of nodes; the entries in these lists are quite small though. The size of the memory devoted to storing complete nodes is proportional to the number of currently visited nodes, which is typically proportional to the logarithm of the number of nodes (it represents the depth of the tree formed by the links where the REL attribute is SUBDOCUMENT or INCLUDE). Again, the memory consumption for a large document is not a problem, at below 3MB.

Figure 5: Memory used versus nbr. of nodes in the network



Figure 6 presents the memory consumption, for a constant linear document size, as a function of the node size (which in this case is inversely proportional to the number of nodes). The effect of enlarging the granularity of nodes is to slightly reduce the tree depth, and thus the number of currently visited nodes, but mostly to enlarge the size of each of visited node. The size of the VISITED NODES and REFERENCES lists shrinks with the smaller number of nodes but their contribution to the total memory consumption is small. The net result is an almost linear increase of the memory consumption with the node size.

Figure 6: Memory used versus the size of bytes/nodes for a constant graph size (94Kb)



Conclusion

This paper presents a framework to linearize HTML documents. By adding a small number of hints to web nodes, as attributes in HTML tags, it is possible to derive automatically a high quality linear printable version from a web. This way, authors of technical documents such as reference manuals can benefit from using HTML (hypertext navigation, widely available browsers and authoring tools, content oriented structure, multi-platform support) while gaining the possibility of producing high quality books starting from different possible root nodes in their web. As this and similar approaches gain acceptance, the tags and attributes used to annotate links, indexing terms, bibliographic references... should get better standardized. Further work on the prototype will be aimed at providing a more thorough customization layer in the conversion between the linear HTML document and the printable LaTeX/Postscript output. This may be achieved by allowing the redefinition of the LaTeX command associated with each HTML construct. It may also be useful to allow optional LaTeX layout parameters to be annotated in HTML attributes, although this would overlap with the planned functionality for the style sheets.



References





Return to Top of Page
Return to Posters Index