The Speech-Enabled WWW Browser

Emacspeak is a speech-enabled audio desktop (see [4,6]) that provides fluent speech access to a rich collection of authoring and messaging applications. It is tightly integrated with the Emacs W3 WWW browser by William Perry a powerful, standards compliant WWW browser implemented on top of Emacs. As a WWW browser, Emacs W3 has the virtue of integrating the WWW into ones day-to-day information environment, and enables the rich exchange of documents between the WWW browser and other information processing appications running on the desktop.

The integration of Emacspeak and W3 provides a powerful speech environment where the user can fluently browse the WWW while reading email or Usenet news, or when encountering a URL within a document that is being editted or proof-read. In addition to the advantages of a speech-enabled browser described in the section on user experience this tight integration between the speech-enabled WWW browser and the audio desktop is a crucial factor in making the Emacspeak platform a very productive environment for day-to-day work.

Architecture

The Emacs W3 browser handles parsing the incoming HTML document as well as drawing it to the screen. The browser preserves the parsed structure from which the visual display was generated, making it possible to speech-enable the browser. The Emacs W3 browser implements the cascaded style sheets (CSS) as specified in the CSS1 specification. In addition, it also implements the cascaded speech style extensions (see [7]).

The technique used to implement both visual and aural speech styles is analogous. In Emacs, all textual content is displayed and manipulated by placing the text in a buffer. The Emacs system is responsible for managing and displaying text placed in buffers. Such text can be annotated with additional properties that control the visual appearance of the text, e.g., the color and font used. W3 implements the visual style sheet by annotating the text being displayed with the appropriate visual properties. In addition, when running in a speech-enabled context as when using Emacspeak, W3 annotates the text with the speech properties specified in the speech style sheet. When the displayed document is spoken by Emacspeak, the user hears an audio formatted rendering (see [1]). This implementation strategy has the added advantage of keeping the synchronizing the spoken and visual renderings ---thus someone who is both looking at the screen and listening to the output perceives the effect of both visual and aural style sheets.

Fill Out Forms

The Emacs W3 browser implements interactive document elements such as HTML forms. The underlying implementation relies on the Emacs widget library. However, while creating the various widget objects making up the controls in an HTML form, the WWW browser adds in sufficient contextual information into each widget object so that Emacspeak can produce fluent spoken dialogues from these widgets. The user experiencewhen interacting with such fill out forms is described in the subsequent section. Here, we illustrate the process by which HTML markup for a specific form control --a radio button group-- is transformed into an appropriate widget object that encapsulates all of the contextual information available to the browser from parsing the HTML source. We also point out some of the shortcomings in the current forms specification in HTML 3.2 and suggest possible extensions that would improve the usability of fill out forms with speech. Note that some of these suggested modifications have already been implemented in the Emacs W3 browser as a proof of concept. These and other extensions have also been proposed in a draft specification available entitled Design Issues for HTML Forms available at http://www.w3.org/pub/WWW/TR/WD-forms-970203 .

We use a sample radio button group that might appear in a coffee order form to demonstrate the mapping of HTML form elements to contextually rich user interface widgets. Notice that the radio button group has been encoded using some of the enhancements detailed in the section on proposed extenstions. See the section on user experience for details on the the user experience when working with forms that are encoded using the current HTML standard encoding, and the improvements that result with the enhanced encoding.

When W3 parses an HTML document containing interactive form elements, it maps these to appropriate user interface widgets such as checkboxes or radio groups; in this sense, the processing is no different from other WWW browsers. However, when processing the radio group shown in the example, W3 annotates each radio button with its associated name e.g. a0 in the example, as well as the label (if available) e.g. 5 pounds. The interface treats the entire group of radio buttons as a logical entity --this is easier in cases where tag group is used, but is still possible in the standard case by applying heuristic techniques. When providing speech feedback, Emacspeak examines these additional contextual information stored in the interface widgets to produce spoken dialogues that approximate what a human would say. Thus, for example, the user hears

Group how much coffee would you like is currently set to 5 pounds

T. V. Raman
Email: raman@adobe.com
Last modified: Wed Feb 19 11:55:57 1997