AHA: Audio HTML Access

Frankie James Computer Science Department, Stanford University
Stanford, California 94305, USA
fjames@cs.stanford.edu

Abstract

This report discusses the "AHA" system for presenting HTML in audio for blind users and others who wish to access the WWW nonvisually. AHA is a framework and set of suggestions for HTML presentation based on an initial experiment. Further experimentation and further revisions will be performed with the system.

1. Introduction

Millions of people access the World Wide Web (WWW) every day for information and entertainment. As the WWW becomes more popular, more businesses and information providers are creating Web sites to display their products and services. However, what many of these providers do not consider is that a portion of their user population has a visual impairment. In the United States, there are 11 million people with some form of visual impairment, and 1.5 million people who are totally blind. [17]

HTML was designed as a markup language, which means that many of the structures in a document, such as headings, lists, and hyperlinks, are represented explicitly in the HTML file for the document. WWW browsers such as Netscape take this markup and present it visually to the user. Traditional access schemes for blind users (screen readers) depend on this visual rendering to decide how to present the document in audio. In this way, audio becomes a second-class interface modality for the presentation of HTML. To create more viable access solutions for blind users, designers must treat audio as a first-class interface technology (see AsTeR [21], Emacspeak [22], pwWebSpeak [19], and Marco Polo [25]).

Figure 1 gives a representation of the creation of a hypertext document in both the visual and auditory realms. If an audio document is designed straight from the author's intentions, it may correspond to the author making an explicit recording of the document or pieces of the document. While this seems like the best strategy for providing the most effective audio presentation, it means that authors must create two documents for everything that they write: one in audio and one in print.

Figure 1. When to create the audio representation.

Another way to create audio documents is to work directly from the visual representation, which is what screen readers do. [2] This is the current solution to Web accessibility for blind users. However, by the time the document has been presented visually, the explicit structural information in it has been made implicit. Recovering this structure is difficult, if not impossible. Screen readers also force blind users to interact spatially with documents since they are based on visual representations. Unfortunately, many blind users lack grounding in spatial and visual metaphors, and interactive screens do not map well to speech or Braille output. [24]

Finally, an audio rendering can be designed from the HTML representation of the document. Although the author's intent is not always truly represented in HTML¹, most of the visual elements important to navigation and structure are determined directly from the markup tags. This means that audio renderings can also be designed from the markup instead of by trying to determine the significance of visual elements. For example, headings can be directly identified by the tags <H1>...<H5>, rather than by guessing based on type size.

The AHA (Audio HTML Access) system is based on the principle that HTML files explicitly contain both the textual and structural content of a document, and that these two types of content are both essential for understanding the document. AHA provides a framework for discussing audio marking techniques and how they can relate to different HTML structures to provide an intuitive, easily-learnable, audio interface to HTML. AHA is not currently being developed into a commercial system for accessing the Web, but is rather a prototype system along with a set of guidelines and a framework which may underlie a commercial system. We hope in the future to see (or rather, hear) AHA become the basis for a commercial audio HTML browser, and have been talking with researchers at Productivity Works [19] and Sonicon [25] regarding this goal.

Audio interfaces can take advantage of what people already know and are familiar with in the real world, such as radio broadcasts, books on tape, and children's audio story books. AHA's framework does just this. The selection of sounds and melodies in the AHA framework is based on familiarity in the real world because familiar sounds are easier to comprehend in an interface than unfamiliar ones. Therefore, AHA can deal with a variety of audio marking techniques, such as multiple voices, "natural" sounds, and musical sounds.

We will first discuss the pilot experiment which we performed to test some of our basic ideas. Then, we will show how the results of this experiment plus research in the related fields of psychoacoustics and psychology came together to form our initial audio framework. Finally, we will discuss future work in this area, including a planned follow-up experiment to strengthen the ideas in AHA.

2. Pilot Experiment

The first stage in the development of AHA took the form of a pilot experiment comparing four different audio interfaces for HTML. The interfaces were chosen to compare various techniques such as using different voices, sound effects, and linguistic cues to mark HTML structures. The four formats used were:

one speaker, few sound effects (OS/FE)
one speaker, many sound effects (OS/ME)
multiple speakers, few sound effects (MS/FE)
multiple speakers, many sound effects (MS/ME)

We believe that the use of multiple speakers to present structure, which is common in non-computer interfaces such as radio, is also applicable to computer interfaces. Therefore, we created the interfaces in such a way that we could test the usefulness of multiple speakers as opposed to a single speaker, and also to test the specific circumstances under which speaker changes would be appropriate.

2.1 Experimental Design

The experiment was designed so that each of twenty-four paid subjects (twelve blind and twelve sighted) used the four interfaces in a random order, creating a 2 by 4 mixed design. All subjects had at least a working knowledge of the WWW and Web browsing. The experiment used a "Wizard of Oz" format so we could test different interfaces without having to implement an HTML parser. The interface consisted of recorded² speech³ and sounds⁴ in Hypercard running on a Macintosh. The eight HTML pages used were related to Project Archimedes and CSLI at Stanford University.⁵ They were chosen because they are substantially interlinked and represent a variety of page types found on the WWW, and because they are related to this project.

The sound effects in the interfaces were selected with the idea of auditory icons in mind. [8] An effort was made to choose sounds that seemed intuitively related to the structural element they were meant to represent. If there was no obvious sound, a short abstract sound was used. For example, in the -/ME interfaces, link points were marked with different overlaid sounds to indicate whether the link was to a place within the same document, to another document, or a mailto link. Within-document links used the sound of footsteps, out-of-document links used the sound of a phone ringing, and mailto links used the sound of a doorbell. Also in these two interfaces, the abstract effect of simple tones which varied in pitch were used to indicate heading level.

The choice of where to change speaker in the MS/FE and MS/ME protocols was inspired partially by Geiselman and Crawley's "Voice Connotation Hypothesis." [10] We used the analogy of a sports broadcast in which there is more than one announcer, each of whom has a specific role and presents only certain information. For example, in MS/ME, there is a "heading commentator" who only presents heading text, much like a color commentator in a hockey game only presents player statistics and analysis rather than play-by-play action.

2.2 Results

The data gained from the pilot experiment allowed us to make both specific observations and more general conclusions. Some of the general conclusions will be discussed in the next section in conjunction with a new framework for classifying audio effects; for more detailed information, see [13] and [14].

3. AHA- an Audio Framework

Most work on audio interfaces currently falls into two categories: interfaces based on earcons [4] and interfaces based on auditory icons. [8] The proponents of earcons believe that events in the interface should be indicated by musical sounds, where aspects of the sound represent aspects of the interface event. For example, a simple musical motif may be used to represent a folder, and the motif may be played with either a crescendo or a decrescendo to indicate the opening or closing of that folder, respectively. Auditory icons, on the other hand, are based on the idea that we make use of many sounds in the everyday world without ever analyzing their musical content. The sounds used in these interfaces are caricatures of everyday sounds, where aspects of the sound's source correspond to events in the interface. Opening a folder in this case might be marked by the sound of a file drawer opening, drawing on the user's knowledge of offices in the real world.

We believe, however, that this split between earcons and auditory icons is not the best way to look at audio interfaces. Both earcons and auditory icons are useful in some respects, but both are also inappropriate in others. Earcons become unusable in interfaces where there are many objects or events, since it is difficult for users to keep track of more than a few musical themes. Earcons may also present difficulties to nonmusicians if they are not selected appropriately. Auditory icons also have limitations, especially in cases where the event to be marked does not have a direct mapping to the natural world. We suggest that both earcons and auditory icons (in their broadest sense) have their place in an interface to HTML.

Figure 2 shows the visual representation of AHA that we have devised based on our pilot experiment and research in psychoacoustics and psychology. In this framework, we have abandoned the distinction between auditory icons and earcons. We are not dividing the world of sounds into musical sounds and nonmusical sounds (as the auditory icon/earcon model does), but, rather, into familiar and unfamiliar sounds, which can include both music and nonmusical sound.

FIGURE 2. Framework of sounds in the world

The diagram for this framework shows how elements from the document to be rendered in audio can be mapped onto sounds, which are divided into three categories:

the subset of natural language which is a direct linguistic mapping between the represented structures and their names or descriptions in a language understood by the user,
sounds which are within the user's inferred world (familiar sounds), and
sounds which are outside of the user's inferred world (unfamiliar sounds).

The dashed arrows from the document to the sound categories indicate the motivation for the mapping of sounds in that category to tags in the document. The thicker arrows going from sounds outside of the user's inferred world to the other two categories of sounds indicate the ability to transfer sounds from one category to another via learning. For example, words in a foreign language are outside the user's inferred world until that language is learned. Similarly, novel sounds or musical themes can move into the user's inferred world over time.⁶ Results from our pilot study suggest that sounds for an audio interface should in most cases be selected from the understandable natural language and sounds within the user's inferred world categories.

From our experiment and data from the psychological analysis of musical themes, we feel confident that musical sounds can also be categorized by familiarity (and other nonmusical factors) rather than by qualities of the music itself.⁷ We contend that when users hear a musical sound or melody which is familiar to them, they do not analyze its musical content (i.e., the instruments being played or the chord changes) but rather recognize and process it as a known sound which just happens to be musical. Andrea Halpern [11] has done research on the grouping of musical themes by users and found that people tend to group tunes not by musical similarities, but rather by what she calls extramusical similarities, which are related to things like the type of tune (such as Christmas songs, patriotic songs, etc.). She found that people are more likely to confuse two tunes if they are close to each other in a conceptual space based on song type, regardless of any musical similarity between the two tunes.

Java et al. [15] also discuss the separation between obscure and familiar musical themes. Their experiment involved the recognition of musical themes and the relation to semantic and episodic memory. They presented subjects with a set of familiar and obscure themes to learn, and then performed a recognition test to determine how well the subjects could remember the two types of themes and also whether the subjects remembered hearing the theme in the familiarity exercise or if they just knew that they had heard it before. Their data suggested that familiar themes are more often remembered than known, since we process familiar or popular themes in a more elaborate and conscious way, and that there are fewer "false alarms" for familiar than obscure melodies. This demonstrates an important distinction between known and unknown melodies in audio user interfaces. In particular, if a familiar theme is used to mark a structure, users may be more likely to remember hearing the theme when learning about the interface and may also be more likely to remember what the theme stands for in the interface than they would be if an unfamiliar theme was used.

Another aspect of AHA which is represented in Figure 2 is the idea of using multiple voices to represent structures. In the pilot experiment, we found that multiple voices (when used appropriately) were helpful to users for separating out different HTML structures. In particular, it turns out that different voices can be successfully used to mark macro-structures, such as all of the headings in a document or a nested list, but not for marking micro-structures such as some bold text in a sentence. This is intuitively reasonable, since human expectation when we hear a new speaker is to separate his or her thought from the previous speaker's thought. The new speaker is expected to add to the discussion, but not to complete the previous speaker's statement.⁸ The applicability of voice changes to various HTML structures is addressed below.

4. Issues in Audio Renderings of HTML

Based on the results of the pilot experiment, we have learned more about the trade-offs between using different voices, nonmusical sound effects, and musical sounds to mark various HTML structures. This section discusses what we have learned so far, as well as other ideas to be tested in the future.

4.1 Voice Changes

Voice changes seem to be appropriate for marking what we call document macro-structures. Figure 3 shows how a document can be represented as a hierarchy of macro-structures including images, lists, headings, tables, special sections, and forms.⁹ By assigning a different voice to each of these six structures, an HTML document could be effectively separated into its main components. Other possibly useful places for voice changes might be in the presentation of quoted material¹⁰ or of a link's URL. Links (see Figure 4) are comprised of a type, a "visitation status," a URL, and the link text itself. Although changing voice for link text is generally undesirable (since links are often found in a stream of text), using a different voice to present the URL for a link could be appropriate, since this would effectively separate the link's meta-information (URL) from the link text itself.

Figure 3. HTML Elements

Figure 4. Links

The motivation behind using voice changes to mark structures is that separating information by speaker is very natural and intuitive to users. Geiselman and Crawley [10] found that subjects can remember who spoke a particular statement incidentally while attempting to remember the actual statement. They suggested that this was because who says what is historically important; thus, humans have developed the ability to keep track of this information incidentally. What this means for audio- interface design is that users will generally be able to separate the various types of document structures if they are all presented using different voices. By using the same voice over an entire macro-structure, we can give the user an effective yet subtle reminder of where he or she is within the document (at least to the level of macro-structures).

Documents can first be divided into macro-structures, but each of these structures (as well as unmarked parts of the document) may also contain links and other textual markings as indicated in Figure 5. If these "micro-structures" are presented using a new voice, the Voice Connotation Hypothesis says that this text will also be remembered separately from the other text. Clearly, this goes against human intuition, since we do not expect new speakers to continue a previous speaker's thought for them. The emphasized word within a sentence must be clearly associated with the rest of the sentence's text and therefore should use the same voice. Also, since micro-structures can be a part of any macro-structure, changing voice to indicate micro-structures may cause the user to become confused as to what type of macro-structure is being presented.

Figure 5: Text Styles

In the pilot experiment, the most dramatic example we found of the appropriateness vs. inappropriateness of using multiple voices was in the presentation of lists. In MS/FE, we chose to have each level of list nesting read by a different speaker, which corresponds to assigning a new speaker each time we encountered a new macro-structure. In MS/ME, we used two speakers to read lists, but each speaker alternated reading the list items and list nesting was indicated by the volume of the list bell. This method makes more salient the separation between odd and even list items, rather than the list nesting level. However, as is shown in Figure 6, we generally conceptualize lists in terms of levels and make little distinction between the various list items. The macro-structure within a list is the list nesting level itself, but the list items are micro-structures belonging to the nested list. In fact, if we think of the visual analogy for our experimental techniques, the MS/ME technique would correspond to having alternating list items presented in two different colors and list nesting noted by a bullet change. On the other hand, MS/FE corresponds to a widely-used visual list presentation in which list items are separated by a small, unobtrusive bullet and list nesting is made more salient by indenting the subordinate list or presenting it using a different type style.

Figure 6: Lists

4.2 Auditory Icons

In our pilot study, an effort was made to choose sound effects which were based on Gaver's auditory icons [8], which are sounds that naturally occur in the world and whose source is evocative of the object or function they are meant to represent. For example, in OS/ME and MS/ME the various link types (see Figure 4) were each represented with a different auditory icon which was played while the link text was being read. We found that the auditory icons were easily distinguished and recognized by their source even though their mappings to the HTML structures were not as clear as we intended. That is, subjects were able to recognize the footsteps (within-document link), doorbell (mailto link), and telephone ring (out-of-document link) easily even if they didn't understand why they were chosen. Other sound effects, such as the camera sound used to indicate images, were both distinguishable to users and clearly mapped to the intended structure.

Auditory icons seem to be useful at a number of levels. First of all, they take advantage of the user's direct knowledge of sounds and their sources in the real world, as is pointed out in [9]. Most people easily distinguish thousands of sounds in the world and do not need any special training to remember them. This can be correlated to Java's [15] findings on known and remembered melodies, where natural sounds are known from earlier life experiences and may have rich associations in users' minds.

Another advantage of auditory icons is that since the natural sounds are remembered in a context which includes their source, the sounds usually already have an attached semantics. If a sound is chosen for an audio interface such that its semantics are compatible with the semantics of the structure it is meant to represent, users should find it easy to learn and remember. However, this also means that natural sounds should be chosen carefully so that any semantics associated with a particular sound are not in conflict with the semantics of the represented structure. As an example, in MS/FE, we used a tone following the anchor points in the text to represent links. One user commented that this sounded like the announcer was saying dirty words and "got bleeped out." Clearly, even sounds which are not natural per se may have social meanings which conflict with the meanings intended in the document.

4.3 Earcons

Earcons [4] attempt to make use of musical sounds and musical differences between sounds to represent objects and events, and are based on musical artifacts such as pitch, timbre, melody, etc. There are conflicting reports from researchers in this area as to the applicability of musical sounds and distinctions in interfaces intended for nonmusicians. [1][5][6][12][18] We found in our pilot study that users had difficulty differentiating between sounds that simply had different pitches or volumes, which is supported in [18].

We stated above that auditory icons can be very useful when we can find a sound that shares the same semantics as the structure which we want to represent, but there are some cases where there may be no sound which maps to the structure's semantics. For example, if we wish to use sounds to distinguish between the various heading levels shown in Figure 7, the semantics would need to include the fact that H1 is in some sense "higher" than H2, or at least more important. How can we select two natural sounds such that one is "higher"? This would be, in general, a musical distinction between the two sounds. In this section, we discuss two uses for earcons, namely, (recognizable) musical themes and what we call "iconic" sounds.

Figure 7: Headings

4.3.1 Musical Themes

There are some structures in HTML, such as forms, for which we may not immediately be able to think of an associated sound in the real world. Unlike images, which immediately call to mind photographs and cameras, forms are generally silent pieces of paper which are filled out in a quiet environment. Therefore, if we are trying to choose a quick audio cue to represent forms, we may not find one in the natural world. Instead, we need to choose a memorable sound from outside the natural world, such as a familiar musical theme.

Radio programs and television commercials make use of this technique all of the time. There may be no natural sound that makes us think specifically of Coke,¹¹ but Coke's advertising jingle is short and reminds us of Coke even when the words aren't sung. People can quickly learn and recall hundreds of jingles. We can make use of this in auditory HTML interfaces by associating jingles with document structures such as forms. In addition, if the jingles chosen come from already familiar songs, the presentation and recognition time should be much shorter. [26]

There are also structures in HTML documents which do not have a clear semantics, such as the special section tags like address, paragraph, and horizontal rule. Most of these tags started out with a semantics, such as a paragraph or section break, but are often used in quite different ways because of the visual rendering they produce in browsers such as Netscape. For example, the intended semantics of the address tag is associated with meta-information about the document or a part of the document, but it is often used simply to create an italicized section. The HR tag doesn't really have an intended semantics, but is most often used to indicate a break between document sections.

The problem with such cases is that even if the 90% case of the tag's usage has a clear semantics, there is still a portion of the time in which the tag is used in another way where the semantics are different or in direct opposition to the 90% case. Therefore, trying to choose a semantically relevant natural sound for any of these tags will yield a cognitive dissonance for outlying cases. By selecting a recognizable melody which is relatively free of imported semantics, we can provide the user with an easy mnemonic without creating a conflict.

4.3.2 Iconic Sounds

We also sometimes need to make distinctions between HTML micro-structures whose differences clearly map to qualities such as higher or lower, dominant or subordinate. Music contains a wealth of techniques for representing these relations using pitch, loudness, duration, etc. Our attempt to use music in the pilot experiment may have failed because of the perceptual and cognitive tasks we required of the users, namely, to recall a dominant/subordinate relationship rather than to recognize one.

"Iconic" sounds can be used to create a recognition task for distinguishing between dominant/subordinate relationships such as can be found when determining heading levels. Research in data sonification has shown that it is possible for users to use pitch contour and rhythm to interpret structure in a data set.[7][20] Similar techniques can be applied to mark different document structures so that they are distinguishable. For example, if we want to distinguish between H1 and H2, we can use a two-tone sequence such that the sequence goes up in pitch for H1 (C to E) and down in pitch for H2 (C to A). No recall of previous sequences is required; the user simply listens to the sequence and decides if the pitch contour goes up or down. Iconic sounds may also be created by varying volume or tone duration over a short tonal sequence.

Iconic sounds, then, create a case where we would choose to select sounds from the category in the AHA framework labeled as "sounds outside of the user's inferred world." Although the sounds used in data sonification and in the heading case described above are easily interpreted by the user, we are by no means saying that the sounds are recognizable. Unless the pattern of tones created by the sonification of a histogram, for example, turns out to be the same pattern in the tune for "Yankee Doodle," a user would not recognize the pattern in the sense that she would be able to name it and say "that tune stands for x." Rather, she could interpret the pattern as being a histogram of a certain shape.

5. Future Testing

Our next goal is to run a follow-up experiment to the pilot. We intend to structure this experiment not as a "Wizard of Oz", but rather as a test of some of the ideas in the current AHA system against two existing systems, namely, Emacspeak [22] and WebSpeak [19]. Since it should be feasible to implement the various marking schemes within a reference system, we will be able to give subjects a fully-functional system which they can use for an extended period to eliminate the novice user effects we saw in the pilot.

The follow-up experiment will give us a chance to explore several new areas that were not addressed in the pilot, as well as to confirm hypotheses first proposed in that experiment. These include:

the feasibility of grouping sounds by familiarity vs. unfamiliarity, as opposed to by music vs. non-music;
the usability of musical sounds in an audio interface, addressing problems of cue length and user fatigue for familiar tunes [4];
the efficient presentation of bracketing sounds by using an end bracket to signal that something is ending rather than that it has just ended, based on research in psychoacoustics and linguistics that suggest that users may not consciously perceive exactly when an overlaid sound is being played while listening to spoken text [3][16]; and
the usefulness of multiple voices to separate documents at the macro-structure level, as opposed to Emacspeak's less constrained method of choosing where to change voice.

There is no doubt that testing these new ideas will uncover new problems and discoveries which will cause us to add to and modify the AHA system as it stands today. We hope that the new findings will allow us to create a more robust framework for selecting audio effects to use in audio interfaces to HTML and for understanding the space of audio for document structural marking.

End Notes

Because of its limitations, HTML's tags are often used "creatively" to produce visual effects desired by authors.
Special thanks to Dave Barker-Plummer, Andrew Beers, Mark Greaves, Stephanie Hogue, Claire James, Connie James, and Dick James for providing the recorded speech. Many thanks also to Cliff Nass for help with the statistical analysis.
Although it is important to understand what effect less natural sounding voices have on the users of an audio browser [23], this study is focused on differentiating between voices. Less natural sounding voices could confound any related results.
Sound effects were obtained from freeware libraries or were recorded via SoundEdit Pro using ordinary household objects.
The pages used in this experiment can be found at http://www-pcd.stanford.edu/~fjames/testpages/
Motion from either the natural language category or the sounds within the inferred world into the category of unfamiliar sounds may occur via forgetting, but we are not dealing with this case right now.
We are dealing here with sounds which are purely musical and are not easily thought of in natural terms. For example, a bird singing can be thought of as either a natural sound (a sound produced by the bird) or a musical sound (the bird's song), therefore, we would not put that sound in the "music" category.
The combination of the words of two speakers into a single thought is found in children's games where each participant adds a new word to the sentence. The intended result is not information transfer, but rather, amusement.
In Figures 3-7, dashed lines indicate a component relationship between nodes, and solid lines indicate an "is-a" relationship.
What we mean by quoted material here is anything which is quoted because its source is from another person, that is, a direct citation from someone else. We are not referring to other more problematic uses of quotation marks, such as scare quotes.
The sound of a can opening may remind us of soda in general, but not necessarily of Coke in particular.