WebCutter: A System for
Dynamic and Tailorable Site Mapping

Yoelle S. Maarek
Michal Jacovi, Menachem Shtalhaim
Sigalit Ur, Dror Zernik  
IBM Haifa Research Laboratory
MATAM, Haifa 31905, ISRAEL
yoelle@haifa.vnet.ibm.com

Israel Z. Ben Shaul  
Dept. of Electrical Engineering
Technion, Israel Institute of Technology
Haifa 32000, ISRAEL
issy@ee.technion.ac.il

Abstract

Conventional information discovery tools can be classified as being either search oriented or browse oriented. In the context of the Web, search-oriented tools employ text-analysis techniques to find Web documents based on user-specified queries, whereas browse-oriented ones employ site mapping and visualization techniques to allow users to navigate through the Web. This paper presents a unique approach that tightly integrates searching and browsing in a manner that improves both paradigms. When browsing is the primary task, it enables semantic content-based tailoring of Web maps in both the generation and the visualization phases. When search is the primary task, it enables one to contextualize the results by augmenting them with the documents' neighborhoods. The approach is embodied in WebCutter, a client-server system fully integrated with Web software. WebCutter consists of a map generator running off a standard Web server and a map visualization client implemented as a Java applet runnable from any standard Web browser and requiring no installation or external plug-in application. WebCutter is in beta stage and is in the process of being integrated into the Lotus Domino.Applications product line.

1. Introduction and Motivation

The incredible size of the Web, coupled with its inherent lack of structure both at the inter- and intra- document levels, introduces great challenges for information discovery. To provide assistance in this daunting task, several approaches and corresponding tools have been proposed, mostly revolving around the two basic discovery paradigms --- searching and browsing.

The main limitation in most search engines and site mappers is that they are essentially search-only or browse-only, respectively. In other terms, they support one paradigm and effectively ignore the other. This approach introduces major scalability problems. At some point, information found by search engines becomes too big to be meaningful, due to the unrestricted global scope of the issued queries; and the generated maps tend to be too complicated due to lack of filtering or other content-based parametrization for both generation and visualization.

To address this limitation, several attempts were made to integrate browsing and searching. Some search engines, such as Yahoo, support search within a scope specified by a predefined browsing hierarchy. Other engines, such as WebGlimpse [Manber et al. 97], allow users to dynamically search the direct neighborhood of a visited page from within the Web browser. Other engines automatically create a browsing hierarchy from search results [Maarek 95], or at least a flat grouping for browsing, e.g., Excite.

Conversely, most site mapping tools provide primitive search facilities over the generated maps. Overall, however, most tools are either primarily search or primarily browse tools, with only simple cross-paradigm integration.

Our new approach, presented in this paper, does not make such a sharp distinction, and employs a tight and wide integration of searching and browsing, both in the underlying technology as well as in the overall purpose and functionality. The general idea is to integrate full-text search techniques inside the various functions of site mapping. This enables us to provide content-based tailoring of Web maps, not only in the visualization but also in the generation of the map itself. Moreover, the same tool can also serve for contextualized search, i.e., to find information along with its neighborhood, which may at times be as important as the found elements. So, depending on the user's goals, either browsing or searching or both can be employed. In short, our approach advocates the convergence of both paradigms in order to improve the overall support for information discovery.

The general approach of intertwining browsing and searching was embodied in the WebCutter system, presented in the rest of this paper. In Section 2, we outline the main concepts underlying WebCutter and its general functionality. In Section 3, we present WebCutter's unique and open architecture, which is by itself another important contribution of this work. Section 4 provides an operational view of WebCutter by walking through typical scenarios. Section 5 compares WebCutter to related work, and Section 6 summarizes the contributions of this paper.

2. System Concepts and Technical Overview

WebCutter is both a mapping and a visualization tool, enabling users to create, maintain and view "Web maps", i.e., an organization of a set of nodes (URLs) and their interconnections (hyperlink references). This is in contrast with visualization-only tools that read a configuration file to construct a graphical view of a site but do not actually generate the maps. Furthermore, WebCutter is a dynamic site mapping tool, in the sense that the browse structure is automatically generated by a crawler that starts off from a set of initial "seed" URLs and constructs the map dynamically (although it is possible of course to create the map off-line, separately from the viewing phase). WebCutter's functionality rests mostly on two main powerful enabling technologies that set it apart from most other existing site mapping tools:

  1. Advanced text analysis technology that implements the tailoring of Web maps to specific interests.
  2. A Java-based visualization technology that brings to the client multiple views of a Web map.

Based on these technologies, WebCutter provides several innovative features and properties, in addition to the conventional features found in most mapping tools.

2.1 Maps Tailored to Users' Interests

WebCutter enables users to tailor, or "cut" the map according to the concepts they are interested in, expressed in natural language. Thus, instead of indiscriminately crawling through a large volume of hyperlinks, WebCutter "shapes" the map so at to explore more thoroughly in directions where relevant information is found. This is done by employing the "fish search" paradigm, [De Bra et al.], whereby each crawler agent explores "nutritious" areas which are likely to contain more "food", i.e. relevant documents, and neglects "dry" directions with no relevant findings.

A key aspect in determining areas with probable relevant information is the ability to evaluate and rank the relevance of documents. This is done by using a dedicated text-analysis component derived from the Guru full-text search engine. Guru applies a statistical information-retrieval approach and uses an original indexing unit based upon the notion of lexical affinities (i.e., multiple-word, co-occurrence based indexing units). The Guru system is discussed at length elsewhere [Maarek et al. 91], [Maarek 91].

Relevance information (or search information) is not used exclusively in the map construction stage, however. It is also used in the visualization stage to highlight documents differently according to their relevance to the user-specified interests. For example, Figure 1, shows a map of www.Lotus.com tailored to "Domino". Highlighting of relevant documents is done by coloring, i.e., the darker the shade of blue, the more relevant the document is. WebCutter thus employs similar but distinct usages of the user-specified tailoring information: one for affecting the generation of the map, and the other one for showing this information in context during visualization. These two phases are interleaved, though, as the user alternates between firing (incrementally) automatic exploration and manually selecting relevant nodes from which to continue exploration, recursively. In addition, users can exploit the highlighting by applying special operations that apply to (relevant) sub-maps. Both the issues of visualization and of subset selections are discussed later in this section.

Figure 1. Star-like view.

2.2 Multi-Perspective Visualization

Another important feature of WebCutter involves its multi-perspective visualization. WebCutter employs sophisticated layout algorithms [Rudich et al. 92] and multiple presentations of maps in 2D and 3D. Each of these presentations is designed to reflect different aspects and additional manipulations of the map. Examples of views provided by WebCutter are the traditional tree-control view, particularly useful for abstraction and ellipsis; a star-like layout (shown in Figure 1), useful for pursuing incremental exploration; and a fisheye view [Sarkar & Brown 92], shown in Figure 2., which enables users to focus on different regions of the graph. Additional content-based views can be provided as well: for example, we are constructing a view based on the conceptual similarity between documents, by utilizing a variant of Guru for finding the similarity between documents and using the clustering technology developed in [Maarek & Ben Shaul 96]. Some of the layout algorithms were specifically customized to site map visualization, while others used conventional techniques [Di Battista et al. 93]. Finally, each of the views can be further customized manually by the user, by allowing him/her to change the position of the nodes in the graph.

The screen dumps shown in this paper are in GIF format, but could be provided here as actual Java applets with which users could interact, provided that the paper be only read on-line rather than as a hardcopy. When a user clicks on a node, Webcutter reacts as following: (1) it shows its URL in the bottom of the applet, a la Netscape; (2) it expands its full-title (when in ellipsis form); and (3) it shows backlinks to other nodes (shown as blue arrows in color displays, or as gray arrows in monochrome hard copies of this paper), as shown in Figure 3. Clicking on the yellow spot on an edge makes the HREF label (i.e., the label that holds the document's URL, which is often different from the node's title) appear on the visual map. Double clicking on a node makes the page contents appear in the Web browser. Various other interactive features are available that are not discussed here for space reasons.

In order to ease navigation and manipulation of the graph, several "overviewing" tools were designed specifically for the WebCutter environment. These mini-maps (displayed as a small black square at the bottom left part of Figure 2) provide a high-level, condensed presentation of the map. By dragging the mouse in this mini-map, the user changes the point of view of the graph.

Figure 2. Fisheye view.

The interaction with the graph is based on the support for subgraph selection and group operations. This enables hiding nodes of lesser importance (for example, the set of nodes that could not be reached), hiding edges of lesser importance or redundant (backlink) edges, highlighting relevant nodes, and performing various operations on these groups as discussed below.

Submap Selection and Group Operations

A prominent feature-set of WebCutter is the ability of selecting submaps based on graph topology, relevance to a topic, object type, etc. thereby providing an abstraction facility for group operations. Sample operations include:

 

 

2.3 Long-Term vs. Short-Term Maps

Another interesting aspect of WebCutter concerns the intended use of the generated maps.

Although used for different purposes, there is no inherent difference in the structure of, and underlying mechanism that supports short-term and long-term maps. The difference is mainly in their intended use and expected lifetime. Nevertheless, certain operations and system defaults are different in the two modes. As an example, consider WebCutter's viewing options. Using the view menu, a user can choose whether s/he wants to see, in addition to regular nodes, nodes which are timed-out (i.e., the crawler could not establish connection with the node's site in a given time), broken links (i.e., a node pointed to by a hyperlink is not reachable, possibly because this node does not exist any more or the hyperlink is misspelled), out of site nodes, and other non-HTML nodes (e.g., images). Broken links, for example, are crucial for web-management purposes (to identify and subsequently fix the links), although they are of little interest for an average short-term explorer. Similarly, a Webmaster would typically be interested in "cutting" a map of his own site only, following only external links that are directly pointed-to from within the site. In contrast, a "search" oriented user would not care in general whether the exploration spans several sites.

3. Architecture

The WebCutter system consists of two components:

The major characteristics of this architecture are its support of open standards and the decoupling between the view and model of the site map.

3.1 Implementation and Status

The WebCutter system has been fully implemented and is in beta stage. It is currently available from within the IBM firewall. The server has been implemented in ANSI C, and runs on AIX 4.1 and Windows NT, and is in the process of being ported to Sun Solaris and HP/UX. It comes with a set of CGI scripts and forms, as an add-on to regular HTTP servers (Apache, NCSA) as well as Lotus Domino. The client part, i.e., the visualization applet, comes as a set of Java classes that is stored on the HTTP server or locally on the client to work in disconnected mode.

WebCutter is in the process of being ported and integrated into Lotus Domino.Applications product line.

4. Applications and Typical Scenarios

We present two scenarios that demonstrate both uses of WebCutter. The first "search-oriented" scenraio gives the perspective of the end user who wants to discover information and builds dynamically a short-term map for this purpose. The second "browse-oriented" use exemplifies the case of a Webmaster who wants to manage his/her site (check for broken links for instance) as well as to build in advance ("pre-cut") long-term maps for specific groups of users or communities of interests.

4.1 End-User Scenario: A Short-Term Map

In this example, the user, Frank, is trying to find out what his former officemate in Grad School is doing these days. Frank knows that his friend, David, is working at Microsoft Research, but nothing more. So Frank accesses a WebCutter site, fills out an HTML form asking for a "tailored map", and indicating what site to explore, how long he is ready to wait, how big a map he wants, and his search string. After his request is issued, a tailored map pops up with relevant pages highlighted in blue, as shown in Figure 3. At the first level of the exploration tree, WebCutter discovered one relevant page, and automatically pursued in that direction to find more relevant information, it thus discovered not only the home page of his friend (entitled "David K." here) but also in which group he works (Frank clicked on it to get the full title of the department.) If Frank wanted more information, he could pursue the dedicated exploration from that point and incrementally discover relevant pages and links (nodes and edges) to be added to the map.

In real life, Frank first used the local search service at Microsoft Research (based on WAIS), but he got very bizarre results on that very day probably because of a temporary bug. By using WebCutter here, Frank not only finds a way to discover information on a site on which the search service is "out-of-order", but he gets at one glance both search results and contextual information. He is also guaranteed that every page identified is available at the very moment the map is generated, which is not necessarily the case with prebuilt search databases.

4.2 Webmaster Scenario: A Long-Term Map

In this example, the Webmaster of www.Lotus.com wants to attract the attention of visitors to the recent Domino efforts. He builds a large long-term map of his site tailored to "Domino", as shown in Figure 1.

As might be expected, the Domino site is the one that gets the darkest shade of blue, i.e., it is regarded as the most relevant document, but other related pages on the Lotus site itself are shown too (with a lighter shade). So the map provides a subject-map across sites boundaries. The Webmaster can simply store the map on his site and have interested users either use it for browsing, or as a basis for pursing exploration incrementally from one node or another.

Figure 3. Short-term Map.

5. Related Work

This section presents various tools and products that can be used for site mapping. We do not list here the site overview component that is an inherent part of authoring/publishing tools (e.g., Microsoft FrontPage, Netscape Livewire, SoftQuad HotMetal, etc.) nor do we discuss the more general visualization-only tools in which Web maps are constructed manually (e.g., Apple's HotSauce). Rather, we focus on dynamic mapping and visualization tools.

As can be seen from the above list, there is an increasing number of site mapping tools available for end-users for browsing and discovery purposes as well as for Webmasters for site management purposes. All of these tools, however, provide exclusively structural Web maps as identified by a crawler. They apply mainly the browsing paradigm and exploit the search paradigm only minimally. In particular, none of these tools provide user-customizable maps that take into account the user's interests. The only tool we found that provides more contents-based maps is Visual Site Map Research Prototype from the University of Kentucky which automatically organizes the site by using a neural network system. However, because of the high computation cost of neural network training, maps cannot be generated on the fly. The organization is done off-line after the map has been generated.

6. Conclusions

In this paper, we have presented WebCutter, a system for dynamically generating tailored Web maps. The interesting and unique aspect of WebCutter is the weaving of search and browsing technologies, which is reflected in many of its features. The idea is to utilize the "semantics" of the documents as drawn by the text-analysis component, to form the tailored visual representations. This is in contrast with most conventional mappers, which rely only on the "syntax" of the documents set, i.e., the physical structure of the nodes and links. In addition, WebCutter is a client/server system integrable into complete Web site management solutions. It supports sub-maps selection, map management and open standards. In particular, maps are viewed by Java applets, and are fully integrated with the Web-browser.

Acknowledgements

Thanks to Frank Smadja for playing the role of the average end-user and to David Kurlander for being the involuntary object of our experimental (but real!) search scenario. Our students, Yohai Makbily and Ethy Benaiche helped on the implementation of some of the WebCutter visualization features, and Daniel Rozenshein participated actively in the testing stage.

We wish to thank Ron Fagin for his emergency reviewing. We also wish to thank the supporters of WebCutter at the IBM Internet Division (Lisa Morley), at Lotus (Keith McCall) and at the IBM Research Division (Stu Feldman, Yoav Medan, Pnina Vortman) who are helping making a product out of what was a mere research idea only a year ago.

Note: Product and company names herein may be trademarks of their respective owners.

References

[De Bra 94]
P. De Bra, G-J Houben, Y. Kornatzky, and R. Post, "Information Retrieval in Distributed Hypertexts" in the Proceedings of RIAO'94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.
[Di Battista et al 93]
G. DiBattista, P. Eades and R. Tamassia and I. G. Tollis, "Algorithms for Drawing Graphs: an Annotated Bibliography", Computer Science Dept., Brown, Technical Report, November, 1993.
[Maarek 91].
Y.S. Maarek, "Software library construction from an IR perspective", in SIGIR Forum 25:2, Fall 1991.
[Maarek et al. 91]
Y.S. Maarek, D.M. Berry and G.E. Kaiser, "An Information Retrieval Approach for Automatically Constructing Software Libraries", in Transactions on Software Engineering, 17:8, August 1991.
[Maarek 95]
Y. S. Maarek, "Organizing documents to support browsing in Digital Libraries". In SIGOIS Bulletin, Special Issue on Digital Libraries, Guest Editor: Ann P. Bishop, 16:2, December 1995, pp 2-44.
http://edfu.lis.uiuc.edu/allerton/95/s4/s4index.html
[Maarek & Ben Shaul 96]
Y S. Maarek and I. Z. Ben Shaul, "Automatically Organizing Bookmarks per Contents", Computer Networks and ISDN System, 28, Elsevier Science, B.V., 1996, pp 1321-1333
http://www5conf.inria.fr/fich_html/papers/P37/Overview.html
[Maarek et al. 96]
Y S. Maarek, I. Z. Ben Shaul, M. Jacovi, M. Shtalhaim, S. Ur and D. Zernik, "Coping with Java Applets Security Restrictions in WebCutter"
http://www-ee.technion.ac.il/~issy/eilat97.html
[Manber et al. 97]
U. Manber, M. Smith and B. Gopal. , Usenix Technical Conference, Jan 6-10 1997. To appear.
Article at ftp://ftp.cs.arizona.edu/people/udi/webglimpse.ps.Z
WebGlimpse home page at http://glimpse.cs.arizona.edu:1994/webglimpse/index.html
[Rudich et al. 92]
A. Rudich, D. Zernik and G. Zodik., "Visage - Visualization of Attribute Graphs: A Foundation for a Parallel Programming Environment", In: Environments and Tools for Parallel Scientific Computing North Holland Pub., Ed. J.J. Dongarra and B. Tourancheau, Part 3. pp 171-192, 1992.
[Sarkar & Brown 92]
Manojit Sarkar and Marc H. Brown. "Graphical Fisheye Views of Graphs", Dec Src Tech report nu. 84. March 17, 1992.

URL References

Lycos
http://www.lycos.com
InfoSeek
http://www.infoseek.com
Alta Vista
http://www.altavista.digital.com
Yahoo
http://www.yahoo.com
WebGlimpse
http://glimpse.cs.arizona.edu:1994/webglimpse/index.html
Excite
http://www.excite.com
Microsoft FrontPage
http://www.microsoft.com/frontpage
SoftQuad HotMetal
http://www.softquad.com/hip/hip-sneakpeek.htm
Apple's HotSauce
http://hotsauce.apple.com
NetCarta
http://www.netcarta.com
WebAnalyzer
http://www.incontext.com/products/analyze.html
CLEARWeb
http://www.clearweb.com
Dynamic Diagrams
http://www.dynamicdiagrams.com
MAPA
http://www.dynamicdiagrams.com/products.htm#map
Visual Site Map Research Prototype
http://lislin.gws.uky.edu/Sitemap/Sitemap.html




Return to Top of Page
Return to Technical Papers Index