Beyond Point and Click - A Conversational Interface to a Browser

Raymond Lau, Giovanni Flammia, Christine Pao and Victor Zue

Spoken Language Systems Group
MIT Laboratory for Computer Science
545 Technology Square
Cambridge, MA 02139
United States of America

raylau@sls.lcs.mit.edu, flammia@sls.lcs.mit.edu,
pao@sls.lcs.mit.edu, zue@sls.lcs.mit.edu


This paper presents WebGALAXY, a flexible multi-modal user interface system that allows wide access to selected information on the World Wide Web (WWW) by integrating spoken and typed natural language queries and hypertext navigation. WebGALAXY extends our GALAXY spoken language system, a distributed client-server system for retrieving information from on line sources through speech and natural language. WebGALAXY supports a spoken user interface via a standard telephone line as well as a graphical user interface via a standard Web browser using either Java/JavaScript or a cgi-bin/forms front end. Natural language understanding is performed by the system and information servers retrieve the requested information from various on line resources including WWW servers, Gopher servers and CompuServe. Currently, queries about three domains are supported: weather, air travel, and points of interest around Boston.

Table of Contents

1. Introduction
2. GALAXY Architecture
4. Java Interface
4.1. Java Implementation
5. Forms Interface
5.1. Forms Implementation
6. Conclusions and Future Directions
7. Acknowledgments
8. References

1. Introduction

We are witnessing an explosion in the quantity of information and services available on-line, brought on by the Internet and the World Wide Web boom. One can now obtain a plethora of on-line data, ranging from New York Times news stories to "Dilbert" trivia, and services, such as purchasing airline tickets and scheduling package pickups. Today, there are nearly 300 thousand Web servers hosting in excess of 30 million publicly accessible homepages. While the content of information is growing exponentially, the access mechanism for it has remained relatively primitive; the search engines are only capable of fetching and showing the information as is, and the user interface is restricted to typing on a keyboard and pointing/clicking with a mouse. Traversing through Web space is tedious and time-consuming, often requiring the users to expend valuable and scarce cognitive capacities to keep track of the links.

We believe, as do many others, that a speech interface for a browser is ideal for naive users because it is the most natural, flexible, efficient, and economical form of human communication. However, providing a speech interface is much more than simply being able to "speak" the icons and hyperlinks that are designed for keyboard and mouse. This is because replacing one modality by another, while undoubtedly useful in hands-busy environments and for disabled users, does not necessarily expand the system's capabilities or lead to new interaction paradigms. Instead, we need to explore how spoken language technology can significantly expand the user's ability to obtain the desired information from the Web easily and quickly. In our view, speech interfaces should be an augmentation of, rather than a replacement for, mouse and keyboard. A user should be able to choose among many input/output modalities to achieve the task in the most natural and efficient manner.

Spoken language interaction is particularly appropriate when the information space is broad and diverse, or when the user's request contains complex constraints. Both of these situations occur frequently on the Web. For example, finding a specific homepage or document now requires remembering a URL, searching through the Web for a pointer to the desired document, or using one of the keyword search engines available. The heart of the problem is that the current interface presents the user with a fixed set of choices at any point, of which one is to be selected. Only by stepping though the offered choices and conforming to the prescribed organization of the Web can the user reach the document they desire. The multitude of indexes and meta-indexes on the Web is testimony to the reality and magnitude of this problem. The power of spoken language in this situation is that it allows the user to specify what information or document is desired (e.g., "Show me the MIT homepage," "Will it rain tomorrow in Seattle," or "what is the zip code of Santa Clara, California"), without having to know where and how the information is stored. Complex requests can arise when a user is interested in obtaining information from on-line databases. Constraint specifications are natural to users (e.g., "I want to fly from Boston to Hong Kong with a stopover in Tokyo," or "Show me the hotels in Boston with a pool and a Jacuzzi") are both diverse and rich in structure. Menu or form-based paradigms cannot readily cover the space of possible queries. A spoken language interface, on the other hand, offers a user significantly more power in expressing constraints, thereby freeing them from having to adhere to a rigid, preconceived indexing and command hierarchy.

In fact, many tasks that a user would like to perform on the Web - browsing for the cheapest airfare, for example, or looking for a reference article - are in fact exercises in interactive problem-solving. The solution is often built up incrementally, with both user and computer playing active roles in the "conversation." Therefore, several language-based technologies must be developed and integrated to reach this goal. On the input side, speech-recognition must be combined with natural language processing so the computer can understand spoken commands (often in the context of previous parts of the dialogue). On the output side, some of the information provided by the computer - and any of the computer's requests for clarification - must be converted to natural sentences, and perhaps delivered verbally.

Since 1989, our group has been conducting research leading to the development of conversational interfaces to computers. The most recent system we developed, called GALAXY, is a speech-based interface that enables universal information access using spoken dialogue. The initial demonstration of GALAXY is in the domain of travel planning and knowledge navigation, making use of many on-line databases, most of them available on the Web. Users can query the system in natural English (e.g., "What is the weather forecast for Miami tomorrow," "How many hotels are there in Boston," and "Do you have any information on Switzerland," etc.), and receive verbal and visual responses.

The GALAXY conversational interface was a client application running under the X window system. A major constraint of the X-based client was that a user must have access to an X server in order to use it. Unfortunately, most personal computer users do not have X server software and furthermore the X protocol requires high bandwidth internet connection to function acceptably. This paper describes the WebGALAXY project, whose goal is to integrate the client into a Web browser, which is available on almost any platform today, with no need for the user to download any additional software or plug-ins. More importantly, WebGALAXY serves as an illustration of a possible new paradigm for interface that is rich, flexible, and intuitive.

2. GALAXY Architecture

At the Spoken Language Systems group, we have been exploring natural language technologies for real world applications for several years. Our primary research testbed is the MIT GALAXY system ([1], [2], [3]). GALAXY allows speech and natural language access to a variety of online information and services. A distributed client-server architecture is employed. A voice recognition server, using the SUMMIT speech recognizer ([4]), converts spoken input into hypothesized text strings which are then passed onto a natural language (NL) server, using the TINA understanding component ([5]). User requests can also be entered by typing or by clicking, in which case the input is passed directly to the NL server. The NL server returns a semantic frame with slots and values representing the user request. This frame is then forwarded to an appropriate information server to obtain the necessary information or to enter the appropriate transaction. Information servers then communicate with a combination of local databases, HTTP, Gopher and SQL servers, proprietary commercial networks like CompuServe, and potentially other online resources, to fulfill the user request conveyed by the semantic frame. The information server returns a semantic frame that contains a response which may include HTML display, a URL, and/or a natural language response. The Galaxy client is responsible for conveying the reply to the user using the most appropriate modality (graphic display, synthesized speech), completing the current round of user interaction. Information about the current state of the dialog is maintained between rounds to allow references to previous information (e.g. Give me more information for the first one). Currently, the demonstration GALAXY system handles requests about the following domains: An example of user interaction is shown in Figure 1. We have also developed information servers for other domains, including automobile classifieds ([6]) and restaurants ([7]).

User:I would like to go from San Francisco to Boston tomorrow.
GALAXY:Here are the flights that leave from San Francisco to Boston tomorrow.
(GALAXY shows a list of flights.)
User:(User clicks on United flight 32.)
Can I see more information about this flight
GALAXY: Here is the information for United flight 32.
(GALAXY shows schedule, aircraft type, meals served, etc.)
User:Show me the fares
GALAXY: Here are the fares.
(GALAXY shows the fares for United flight 32.)
User:What is the weather going to be in Boston
GALAXY: Here is the extended forecast for Boston.
(GALAXY shows the four day forecast for Boston.)
User:How many hotels are there in Boston
GALAXY: There are 46 hotels in Boston listed.
(GALAXY provides a listing of hotels.)
User:Are there any Chinese restaurants on Mass Avenue in Cambridge
GALAXY: Here are the Chinese restaurants on Massachusetts Avenue in Cambridge.
(GALAXY provides a listing of restaurants.)
User:Which one is closest to MIT
GALAXY: Here is the Chinese restaurant in Cambridge on Massachusetts Avenue closest to MIT.
(GALAXY provides a list with the closest restaurant.)
User:How do I get there
GALAXY: Starting from MIT, follow the traffic on Massachusetts Avenue...
(GALAXY provides driving directions.)
Figure 1. An example of interaction between a user and GALAXY.

Integrating a new information server into the system requires three stages. First, we have to define the appropriate semantic frame representation and the needed access protocols to the online resources. Frequently, a local database containing pointers to various resources along with auxiliary information is created to help with this step (e.g. a list of city names and airport codes for air travel). Then, we must add new entries to the pronunciation lexicon of the voice recognition component for the additional words in the new domain. Finally, we need to add a new set of grammar and discourse rules along with new lexical entries for the natural language understanding and the speech synthesis components. When a new server is integrated into GALAXY, the system performance can be iteratively improved by running several usability tests. Collecting speech data from user sessions is necessary to improve the voice recognition accuracy, while collecting natural language data allows refining and improving the coverage of the grammar rules that guide the natural language component.

Adding support for a new language requires the definition of new acoustic-phonetic units, a new lexicon and a new set of grammar rules. The semantic frame representation is language-independent, and allows for the translation of the information retrieved from the database from one language into the other. Prototype versions of GALAXY are currently available for Spanish and Mandarin Chinese.


As we have mentioned, the original GALAXY system employed a client program running under the X Windows system. One of the motivations behind the WebGALAXY project was to foster universal access to the GALAXY technology by moving the client functionality into a standard Web browser and adopting Internet standards, such as HTTP, HTML, and Java, in the client. Today, anyone with an Internet connection and a Web browser can in theory use WebGALAXY and access information via any natural combination of speaking, typing, pointing and clicking. No additional software or plug-ins are required.

The development of WebGALAXY required several changes in the original GALAXY architecture. The previous GALAXY client's functionality was split into two parts: a new WebGALAXY hub and a standard Web browser. The hub maintains the state of the current discourse with the user and also mediates the information flow between the various servers and the

Figure 2. The WebGALAXY client/server architecure utilizes the telephone network for speech input/output and World Wide Web standard protocols for the graphical interface, making GALAXY accessible to a global audience.

Web browser. The Web browser is used to provide all graphical user interface to WebGALAXY. Figure 2 outlines the WebGALAXY architecture.

Two graphical user interfaces are current supported: a Java/JavaScript interface with rich interactivity and a more austere cgi-bin/forms interface for browsers which do not support Java. Additionally, WebGALAXY is designed to also support a displayless interface using only spoken language interaction. To start WebGALAXY, the user simply goes to the WebGALAXY homepage, selects either the Java or forms interfaces, optionally enters a phone number for spoken interaction and clicks the Start button. If a phone number were entered, the user would be called shortly and a spoken language interaction could occur over the phone. With or without a phone number, the user can always interact with the system through typing and clicking.

4. Java Interface

An exampling display from the Java version is shown in Figure 3. The top area is where the Java Applet resides. There is a status display ("Ready"), a box for either the recognized spoken input or the typed input ("the forecast for Boston"), a paraphrase for the parsed input ("give me a weather report for Boston."), buttons for disconnecting from the system, aborting the current request, a combination button/status display for controlling and indicating the system's listening state ("Listening"), and finally, an iconic indication of the domain of the last request ("Weather"). The lower portion of the browser window is used to display WebGALAXY's response. For spoken input, automatic endpoint detection is employed to determine when a user has started and finished speaking. Audio tones and visual cues in the displayed listening status are provided to indicate when the system is listening to the user.

In the display shown here, the user asked for the forecast for Boston orally. The request was correctly handled by the natural language server. The reply with the forecast was generated by the Weather domain server and displayed by WebGALAXY. The user could have also typed the same request. Certain requests generate lists. For example, the request Show me Chinese restaurants in Cambridge would generate a list as a reply. The user can then continue to interact verbally, using the names of the restaurants of their ordinal positions (the second one). The user can also click on an item in the list and then say Give me the phone number referring to the clicked item. For certain types of lists, clicking twice on an item gives more detailed information. Requests for homepages, such as Show me the homepage for MIT will retrieve and display the target homepage in the lower area. The user is free to continue browsing with the mouse and keyboard from that page, such as by clicking a link.

Figure 3. The Java-based graphical interface. At the top, an Applet connected to the WebGALAXY hub displays the speech recognition output and the system status in real time. The Applet directs the display of HTML formatted responses from WebGALAXY to frames below.

4.1. Java Implementation

A key requirement for a fully interactive interface is for the hub to be able to push a new reply to the user's browser. To accomplish this, we take advantage of the ability to have a Java applet maintain control channel communications with the hub through a TCP socket. An API was developed which describes the set of control channel messages. An example message might be for the hub to tell the applet to load a new URL. This Java applet API is the only proprietary protocol employed between the hub and the browser. We decided that the actual reply is best kept in standard HTML. This allows full use of the browser's capabilities, including embedding Java applets, animated GIF's, tables, etc. We include rudimentary HTTP server functionality in the hub to support delivery of the HTML content to the Web browser. When a new reply is available, we cache it for later access by the hub's HTTP server and send a request over the control channel, directing the Java applet to load a specific URL from the hub's HTTP server. The HTTP server is also used to deliver the Java class files along with accompanying artwork.

There is also a limited amount of interactive input permitted within GALAXY. For example, when a list is displayed (e.g. "Show flights from Boston to Los Angeles tomorrow."), the user is permitted to click on items in that list. Normally, a single click only modifies the dialog state, so GALAXY knows that "item X has been selected." The user may then follow up with a request affecting the clicked on item. If we were to create normal links, i.e. standard HREF tags, then such clicks would necessarily generate an annoying visual update as the browser loads a new page. Instead, we have decided to use JavaScript to send a message to the Java applet, which then gets communicated back to the hub over the control channel.

In both the Java and Forms interfaces, access to the hub's control channel is regulated through the use of magic cookies. Before the hub will initiate a session over the control channel, it must receive a valid cookie. The hub's HTTP server does not currently impose any access restrictions. Thus any user can grab the WebGALAXY artwork and the Java class files used to implement the applet, but we do not currently consider this to be a serious risk. We merely want to restrict the initiation of new WebGALAXY sessions with the hub.

5. Forms Interface

The forms interface to WebGALAXY is essentially a stripped down version of the Java interface. We have created the forms interface for use behind firewalls which may not permit the Java applet's socket connection and also for situations where Java is not available or the communications bandwidth is too low. Many of the features here are analogous to those in the Java version. There user can type in a request and click the Submit button. If a phone number were provided, the user can also click a Record button to enter recording mode, at which point s/he may speak his/her request. Automatic endpoint detection is employed to detect the end of the spoken request. Audio cues help indicate when the user may start speaking and when the end of his speech has been detected. Unlike the Java case, the forms interface does not continuously listen and automatically detect a request. The reason for this is that without the help of Java, the system requires a user action through his browser (e.g. clicking on Record or Submit) before it can push updated information to the browser. If it were continually listening, it can detect and handle a spoken request, but would have no way of sending the reply to the user's browser. (It is possible with the some browsers to implement a server push model, but since we intend the forms version to require minimal communications, computational and browser support, we have elected not to do so.)

5.1. Forms Implementation

For the forms version, the exact same hub is used, but we created a translation gateway to convert between our hub-applet API and a simpler cgi-bin API. In this interface, the user clicks on a Record button to speak an input to the system. This invokes a cgi-bin script which communicates with the gateway. The translation gateway then tells the hub, via the same hub-applet control channel API used by the Java applet, to start listening and then it waits to receive a reply done message from the hub. At that point, it retrieves the response from the hub's HTTP server, makes some translations to the response, and passes it back to the browser. The translations referred to are for removing the JavaScript protocol used to handle interactive lists in the Java case. That gets converted into a simple HREF tag, which talks back to our cgi-bin script and from there, gets passed to the gateway and then the hub. As mentioned previously, this causes an annoying browser refresh, but without a control channel as in the Java case, we have to make this compromise. The design of the forms interface as a gateway using the exact same hub was made to reduce support costs involved with handling multiple API's in the hub. We note that the same strategy can be used to support a spoken interaction only displayless interface.

6. Conclusions and Future Directions

The GALAXY system demonstrates that advances in voice recognition, natural language understanding, and other technologies are beginning to make human language access to information and services on the World Wide Web and other online sources a reality, albeit in limited domains such as weather and air travel. The WebGALAXY effort demonstrates that it is possible to use this system within any Web browser and represents a tremendous step towards universal access to this technology. Additionally, the WebGALAXY architecture is designed to support a displayless interface using only spoken language interaction.

We have successfully tested WebGALAXY with Netscape Navigator 3.0 running under Windows, MacOS, Linux, SunOS, and Solaris, and with Microsoft Internet Explorer 3.0 running under Windows. We have demonstrated WebGALAXY from locations within the United States, Italy and Sweden, and with Internet connections as slow as 28,800 bps. However, due to constraints on the number of telephone interfaces we have (one), on the computational resources dedicated to the information, natural language, and recognition servers, and also to the constantly changing developmental nature of the system, we are not yet able to make WebGALAXY available to the general public.

We are witnessing a shift in the types of interfaces to the Web. Desktop browsers are being replaced by browsers that reside in many types of devices such as hand-held personal digital assistants, smart digital telephones, and television set-up boxes. Each one of these devices has specific input and output interfaces and limitations. A multi-modal user interface that supports typed and spoken natural language could provide easy and universal access to the Web from different devices and in multiple languages, reaching a much wider audience.

On the content side, we recognize that the type of semantic units on the Web is rapidly shifting from one of static modality (ASCII text and graphics files) to multiple dynamic modalities (text and graphics, responses generated on the fly, speech, audio, and video data). Clearly, all these diverse types of content and modalities require a paradigm shift in the content organization and underlying communication protocols.

WebGALAXY is a small step in the direction of shifting the paradigm of the user interface to the Web from a simple point-and-click navigation in a deep forest of HTML documents towards a richer, more flexible and intuitive navigation. The WebGALAXY servers and clients are being designed to handle multiple input and output modalities in a systematic way. Currently, WebGALAXY encodes the communication between the various servers and clients with specific protocols that are defined at the software level, i.e., in software source code and in text files that list program arguments and parameters. To allow for rapid application development and portability to new domains and new languages, we are trying to minimize the need for writing software by specifying many of the domain parameters such as lexica, dictionaries, grammar rules and dialogue management rules into text files that are easily edited. We would like to push this approach further by specifying a uniform semantic content communication protocol to the information that is currently scattered in a variety of formats such as HTTP, CORBA, SQL, and other open and propietary protocols. We are especially interested in standardizing the protocol for accessing the natural language server and the speech recognition server. A standard, generic and simple communication protocol modeled after HTTP and HTML will foster the rapid deployement of a multitude of natural language and speech recognition servers all across the Web. In addition, we would like to create an authoring tool for rapid development of information servers which rely primarily on Web-based resources.

Extensions to the HTML markup language coupled with an easy-to-use authoring tool could specifically handle semantic content across different media, natural language queries and responses, and dialogue management for mantaining the discourse state between interactions.

Finally, WebGALAXY, and in general spoken language interaction to the Web, would clearly benefit from advances in Internet and telephone technology that allow for simultaneous transmission of HTML data and voice input/output carried over the same connection line. We are also interested in extending HTML for explicitely representing mixed media such as speech signal and ASCII text that reside in the same document and possibly the same data transmission channel.


The GALAXY system, upon which the present work is based, represents the research efforts of many other current and former members of the Spoken Language Systems group. It is they who have really made human language access to the information space a reality. GALAXY developers include Eric Brill, Jim Glass, David Goddeau, Lee Hetherington, Ed Hurley, Helen Meng, Mike Phillips, Joe Polifroni, and Stephanie Seneff. WebGALAXY would not have been possible without the efforts of Rafael Schloming (assistance with the hub), Sally Lee (artwork) and Stephanie Kwong (assistance with the Java Applet).


[1] D.Goddeau, E.Brill, J.Glass, C.Pao, M.Phillips, J.Polifroni, S.Seneff and V.Zue "GALAXY: A Human-Language Interface to On-Line Travel Information." Proc. Int'l Conference on Spoken Language Processing '94, Yokohama, Japan , pp. 707-710, September 1994. URL http://www.sls.lcs.mit.edu/ps/SLSps/icslp94/galaxy.ps

[2] V.W.Zue, "Navigating the Information Superhighway Using Spoken Language Interfaces," IEEE Expert, vol. 10, no. 5, pp. 39-43, October 1995.

[3] "The Galaxy's Guide to the Hitch-Hiker," Economist, pp. 77, May 11, 1996. URL http://www.economist.com/issue/11-05-96/st1.html

[4] V.Zue, J.Glass, M.Phillips, and S.Seneff, "The MIT SUMMIT Speech Recognition System: A Progress Report," Proc. DARPA Speech and Natural Language Workshop Feb '89, Philadelphia, PA, pp. 179-189, February 1989.

[5] S.Seneff, "TINA: A Natural Language System for Spoken Language Applications," Computational Linguistics, vol. 18, no. 1, pp. 61-86, March 1992.

[6] H.Meng, S.Busayapongchai, and V.Zue, "WHEELS: A Conversational System in the Automobile Classifieds Domain," Proc. Int'l Conference on Spoken Language Processing '96, Philadelphia, PA, vol. 1, pp. 225-228, October 1996.

[7] S.Seneff and J.Polifroni, "A New Restaurant Guide Conversational System: Issues in Rapid Prototyping for Specialized Domains," Proc. Int'l Conference on Spoken Language Processing '96, Philadelphia, PA, vol. 2, pp. 248-251, October 1996.

Return to Top of Page
Return to Technical Papers Index