SiteHelper: A Localized Agent that Helps Incremental Exploration of the World Wide Web

Daniel Siaw Weng Ngu
Xindong Wu
Department of Software Development, Monash University
900 Dandenong Road, Melbourne, VIC 3145, Australia
daniel@insect.sd.monash.edu.au, xindong@insect.sd.monash.edu.au

Abstract

The World Wide Web (the Web for short) is rapidly becoming an information flood as it continues to grow exponentially. This makes it difficult for users to find relevant pieces of information on the Web. Search engines and robots (spiders) are two popular techniques developed to address this problem. Search engines index facilities over searchable databases. As the Web continues to expand, search engines are becoming redundant because of the large number of Web pages they return for a single search. Robots are similar to search engines; rather than indexing the Web, they traverse (`walk through') the Web, analyzing and storing relevant documents. The main drawback of these robots is their high demand on network resources that results in networks being overloaded. This paper proposes an alternate way to assist users in finding information on the Web. Since the Web is made up of many Web servers, instead of searching all the Web servers, we propose that each server do its own housekeeping. A software agent named SiteHelper is designed to act as a housekeeper for the Web server and as a helper for a Web-user to find relevant information at a particular site. In order to assist the Web-user finding relevant information at the local site, SiteHelper interactively and incrementally learns about Web-users' areas of interest and aids them accordingly. To provide such intelligent capabilities, SiteHelper deploys enhanced HCV with incremental learning facilities as its learning and inference engines.

1. Introduction

During the past four to five years, the Internet and the World Wide Web have been grown exponentially [Berners-Lee et al 94, Porterfield 94, McMurdo 95, Wiggins 95]. According to the Internet Domain Survey conducted by [Zakon 96], the Internet has grown from 617,000 hosts in October 1991 to over 9 million hosts in January 1996 and in excess of 50 million Web pages or Universal Resource Locators (URLs) in November 1995 [Bray 96]. The amount of information available on the Web is immense. Commercial sites like [AltaVista], [Excite], [Infoseek], [Lycos], [Webcrawler] and many others are search engines that help Web users find information on the Web given a certain criteria. These commercial sites use indexing software agents to generate indexes which cover as much of the Web as possible. For example, [Lycos] claims that it has indexed more than 90 per cent of the Web [Loke et al 96].

However, the enormous growth of the Web makes these search engines less favourable to the user because of the large number of pages they return for a single search. Thus it is time consuming for the user to traverse the lists of pages just to find the relevant information.

To remedy the above problem, many researchers are currently investigating the use of robots (or ``spiders'', ``Web wanderers'' or ``Web worms'' [Koster 95a]) that are more efficient than search engines. These robots are software programs also known as agents, like WebWatcher [Armstrong et al 95], Letizia [Lieberman 95], CIFI [Loke et al 96], BargainFinder [Krulwich 95], Web learner [Pazzani et al 95] or Syskill & Webert [Pazzani et al 96], MOMspider [Fielding 94] and many others. Some of theses agents are called intelligent software agents [Riecken 94] because they have integrated machine learning techniques. The Web page titled [Database of Web Robots Overview] lists 130 of these robots/agents as of 24th February 1997.

The advantages of the robots are that they can perform useful tasks like statistical analysis, maintenance, mirroring and most important of all; resource discovery. However, there are a number of drawbacks, such as: they normally require considerable bandwidth to operate, thus resulting in network overload, bandwidth shortages and increase in maintenance costs. Due to the high demand of robots, network facilities are required to be upgraded - consequently resulting in budget increases. Robots generally operate by accessing external servers or networks to retrieve information. This raises ethical issues as to whether people should improve their system just because too many robots are accessing their sites. This was addressed in a paper by Koster who mentioned in his paper titled ``Robots in the Web: threat or treat?'' where it detailed a robot visited his site using rapid fire requests and after 170 retrievals from the server, the server crashed [Koster 95a].

Because of many practical, fundamental and ethical issues surrounding the use of robots on the Web, Koster produced a ``Guideline for Robot Writers'' [Koster 95b] and ``A Standard for Robot Exclusion'' [Koster 95c]. The ``Guideline for Robots Writers'' gives the following points to consider:

Reconsider: Do we really need a new robot?
Be accountable: Make certain that the robot can be identified by server maintainers, and the author is easily contactable.
Test the robots fully using local data.
The robots should have moderate resource consumption.
Follow the Robot Exclusion Standard [Koster 95c].
Continually analyze the robot logs.

Koster went on to recommend that developed robots or agents operate locally in assisting external users. Within the paper he gave a description of two such agents, ALIWEB [Koster 94] and HARVEST [Bowman et al 94].

With these drawbacks of Web robots in mind, this paper proposes an alternative way to assisting users in finding information on the Web using incremental machine learning techniques. A software agent named SiteHelper is designed to act as a housekeeper for the Web server and a helper for the user to find relevant information on the site. In order to assist the user finding relevant information at the local site, SiteHelper learns about the user's areas of interest and aids them accordingly. In the following section, we provide a review of some related research work, followed by a detailed design of the SiteHelper agent, in particular its learning and inference engines. In Section 4, the potential application domains of using SiteHelper are described.

2. Related Work

Assisting Web users by identifying their areas of interest has attracted the attention of quite a few recent research efforts. Two research projects reported in [Balabonovic & Shoham 95] at Stanford University and [Yan et al 96] at Stanford University cooperation with Hewlett-Packard laboratories are along this line. Two other projects, WebWatcher [Amrstrong et al 95] and Letizia [Lieberman 95] also share similar ideas.

[Balabonovic & Shoham 95] developed a system that helped Web users to discover new and interesting sites that are of the users' interest. The system uses artificial intelligence techniques to present users with a number of documents that it thinks the users would find interesting. Users evaluate the documents and provide feedback for the system. From the feedback, the systems knows more about the users' areas of interest in order to better serve the users in the following searches. This systems adds learning facilities to existing search engines, and does not avoid the general problems mentioned in the introduction associated with search engines and Web robots.

[Yan et al 96] investigates a way to record and learn user access patterns in the area of on-line shopping. The system identifies and categorises a user's access pattern using unsupervised clustering techniques. After the user's pattern have been identified, the system dynamically reorganizes itself to suit the user by putting similar pages together.

WebWatcher [Armstrong et al 95] is an assistant agent that helps the user by using visual representations of links that guide the user to reach a particular target page or goal. It learns to assist by creating and maintaining a log file for each user and from the user feedback it improves its guidance.

Letizia [Lieberman 95] learns the areas that are of interest to a user by recording the user's browsing behaviour. It performs some tasks at idle times (when the user is not reading a document and is not browsing). These tasks include looking for more documents that are related to the user's interest or might be relevant to future requests.

Our localized Web agent, SiteHelper's design is detailed in the following section. The design starts with the same idea of assisting Web users by learning and identifying their areas of interest. However, SiteHelper works with a local Web server and indexes the Web pages on the Web server by using a keyword dictionary local to the Web server. Furthermore, based on the indexing of the Web pages, SiteHelper supports interactive and silent incremental learning using HCV (Version 2.0) [Wu95].

3. SiteHelper: A Localized Web Helper

3.1 Overview

SiteHelper is a software agent with incremental machine learning capabilities to help the user explore the Web. It first learns about a user's areas of interests by analyzing the user's visit records at particular Web sites, and then assists the user retrieving information by providing the user with updated information about the Web site. SiteHelper carries out two types of incremental learning: interactive learning and silent learning.

Interactive incremental learning functions in cycles that interact with the user. SiteHelper prompts the user with a set of keywords which are likely to be in the area of the user's interest, and asks for feedback. Considering the feedback SiteHelper makes changes to its search and selection heuristics and improves its performance.

Many Web site servers implement a log file system that records user access information to their sites. The log files normally consist of the computer's logical or numerical address on the Internet, the time of access and the accessed Web pages. Silent incremental learning will use the log information as its starting point. SiteHelper extracts a log file for each user from a local Web site server. From the log file, SiteHelper learns the user's interest areas. It extracts a set of key words about the user's interest areas according to the Web pages the user has visited in the past. In addition, SiteHelper examines the time the user spends on each page. If the user spends little or no time on certain pages, these pages will not be counted as interesting pages to the user. SiteHelper modifies its search and selection heuristics to improve its performance.

SiteHelper is a site agent that learns a Web user's areas of interest and assists the user in finding information on one localized Web site. It works differently from search engines and other kind of agents like WebWatcher [Joachims et al 95], [World Wide Web Worm] that help the user on the global Web. However, other Web sites can deploy SiteHelper to assist users finding information in the same way. This design of SiteHelper has followed the ``guidelines for Robot Writers'' [Koster 95b], therefore it avoids the drawbacks of existing robots outlined in the introduction. In addition, there are other advantages of having SiteHelper at a local Web site.

Through incremental learning of the user's characteristic or interest areas SiteHelper becomes an assistant to the user in retrieving relevant information. The idea of each Web site doing its own housekeeping and assisting its clients (Web users who visit the site) would have significant potential to become a good methodology or technique for information retrieval on the Web.
SiteHelper has the potential to reduce user's accessing and retrieval time. For example, six months ago, a user visited the [Software Development departmental web site] looking for a particular paper. At the relevant Web page, the Postscript file of the paper was not yet available. During the user's next access, at the point of entry to the [Software Development departmental web site], SiteHelper displays a list of changes that have been made since the user last visited. By viewing these changes, the user will know if the Postscript file is available or not, rather than accessing the Web page again.
SiteHelper can be easily adopted for many other services, such as library sites, Internet music stores, and archives. With library sites, for example, different users have interests in different topics: SiteHelper can retrieve the books related to the user's topics during their visit. In their next visit, the system will let the user know what other additional books or materials have become available since their last visit. We will discuss this further in Section 4.

3.2 Indexing Web pages

Since SiteHelper is designed as a local agent, we have incorporated information about a local site to improve SiteHelper's efficiency. The local information is provided in the form of a dictionary, containing key words in an hierarchy that describe staff and postgraduate students' areas of interests and related detailed topics. In the Department of Software Development located at Monash University(Caulfield), there are three main research groups; Artificial Intelligence (AI), Object Oriented Software Engineering (OOSE) and Distributed Object Technology (DOT). AI, OOSE and DOT are the starting keywords on the hierarchy of the local Web site. Each of these starting keywords is expanded, and some sub-area keywords are shared in the hierarchy. For example, `knowledge objects' is a sub-area which is common to all the research groups.

When a member in a research group creates a new or modifies a Web page, SiteHelper scans the Web page and identifies some keywords from the dictionary to index the Web page. The group title(s) of the member are used to start the search of the keywords - A researcher can belong to more than one research group. This way, the research profile of the researcher is taken into account in the indexing of the Web pages. If no keywords from the dictionary match the content of the new Web page, the new Web page is then classified as a non-technical Web page. Non-technical Web pages are not indexed in SiteHelper.

3.3 Interactive incremental learning

SiteHelper aids user interaction by providing a graphical user interface allowing the user to input a set of keywords as their areas of interest. The keywords are then used to match the keywords in the dictionary and search the Web pages at the local site. For example, if a user inputs ``Artificial Intelligence'' as keywords, SiteHelper will search the network for Web pages containing ``Artificial Intelligence'' and topics related to ``Artificial Intelligence''. SiteHelper returns a list of Web pages to the user and asks for approval, during the process it records the keywords the user inputs. After the user has approved some of the Web pages and disapproved the others, the learning aspect starts to identify the actual keywords of interest to the user. The approved Web pages are treated as positive examples, and all others are negative examples of the user's areas of interest. HCV [Wu 95] is then run on these examples to induce a set of rules describing the user's areas of interest. For example, a user might be interested in areas of combining data mining and the World Wide Web, rather than specific references on either data mining or the World Wide Web. Logic conditions such as AND, OR and NOT are embedded in HCV rules. The Web pages satisfying these rules are then returned to the user for further approval.

SiteHelper continues performing the above cycle until the user is satisfied with the list returned. Thus SiteHelper conducts incremental learning during the cycles by modifying the HCV rules for searching and selecting the Web pages according to user approval.

If the user does not input a set of keywords, the system starts with the main research areas of a local Web site. As mentioned in Section 3.2, AI, OOSE and DOT are used in the case of the Monash Software Development Department Web server. It then continues the cycles as described above.

Interactive incremental learning designed for SiteHelper is similar to a system being developed by Balabanovic and Shoham at Stanford University in California, which also helps the user discover new Web sites that relate to the user's interest [Balabonovic & Shoham 95]. The Stanford system presents users with a number of documents that it thinks the users would find interesting. Users evaluate the documents and the system learns from their feedback and modifies its heuristics to improve its next search. Induction is not carried out in the Stanford system.

3.4 Silent learning

Most local Web sites provide global user access to their Web pages, and have logging facilities in place [Pitkow & Bharat 94] to record users' access details. The logging facilities normally have a weekly log file containing records of one week's activities. At the Monash Software Development Department, the log files consist of three main elements for each access: the machine name with its Internet address from which the access is performed, the time of access and the Web page being accessed.

SiteHelper extracts a log file for each user accessing a local Web site. The log file keeps records of all the Web pages a user has accessed. These accessed Web pages are treated as positive examples for SiteHelper to run HCV [Wu 93] and learn about the user's areas of interest. For example, if the user has accessed the Artificial Intelligence group's Web pages regularly, SiteHelper picks up AI as one of the user's interest areas.

3.5 Using learning results to assist the user

Silent learning in Section 3.4 and interactive incremental learning in Section 3.3 are both used to learn about a user's areas of interest. Learning results are given in the form of logic rules.

When a new Web page is constructed or modified, SiteHelper runs its indexing facilities to identify a set of keywords to index it. The creation or modification date is also recorded in the indexing file. After silent learning and/or incremental incremental learning has taken place, these keywords indexing the Web page can be used to match the HCV rules to check whether it would be of the user's interest. When the user visits the local site again, SiteHelper can list all these Web pages matching the user's HCV rules which have been created or modified since the user's last visit. If the user disapproves some of these Web pages, incremental interactive facilities in Section 3.3 can be run to improve the HCV rules.

4. Deploying SiteHelper for Other Applications

SiteHelper is currently designed as a local agent for the Monash Software Development Department Web server. As mentioned in Section 3.1, it can be plugged in other Web sites in the same way by revising its keyword dictionary. In addition, the idea of having a localized agent to help the user find interesting information can be applied to many other domains. This section outlines some of these domains.

4.1 On-line libraries

A library can have its catalogue put on the Web to make it an on-line library. Journal papers, books and other paper based materials can be indexed, digitized and placed in such an on-line library on the Web.

When a particular user arrives at an on-line library site, the site allows the user to search for particular subjects, books, authors and other items. It then logs the user's searches as well as their browsing behaviour, based on the section of the library the user browses. For example, the reference section, the computer science section, and the medical section. From the logging information, SiteHelper can be used to learn the user's areas of interest, and at the user's following visit, SiteHelper may prompt the user to look at these new books, new articles and other materials that match their areas of interest.

Potential advantages of deploying SiteHelper in this case:

One of the frustrating things we have experienced with many library cataloguing systems is that we often find many books are ``on order''. especially in the case of computer science books because of the fast pace of the field. Therefore, we have to check back periodically to see if the books have been processed. With SiteHelper, a user may expect an on-line library to outline the arrival of new, relevant books when the user visits the on-line library next time.

Many university libraries have hundreds of journals being added every week or month. In each journal there are articles categorised by different topic keywords. With SiteHelper, a user may expect an on-line library system to learn about his/her areas of interest at first visit and automatically prompt relevant journal articles afterwards.

4.2 On-Line shopping centers

This scenario is similar to that described in [Yan et al 96] in that when a client arrives at the on-line shopping, the shopping system starts to log the client's browsing behaviour. From the browsing behaviour, the system can determine the characteristics of the client (for example, whether they are male or female). If the client is male, then the system can dynamically reconstruct the whole Web pages to suit his need. Since males often want to buy male products, the system can put those Web pages closer together so that the client can browse more efficiently. For example, in the case of male clients, sports Web pages and business clothing or suite Web pages can be put closer together.

4.3 On-line archives

As the Internet continue to grow, many sites are being set up everyday, each of which has a collection of certain things, like research papers, photos archives, movies archives, free software packages, and so on. In the case of a software company, say Microsoft, which has hundreds of different services and products. It would be a good idea if a company site logs and analyses users' areas of interest, such as word processing, spreadsheet or development packages like Visual Basic, Access and JAVA.

For a games site, the system can categorise a user into different certain kinds of games such as educational games, adventure games and combat games. For a movies site, the system can help the user by knowing whether they like horror movies, comedy, cartoon, or drama.

5. Conclusion

The localized Web agent SiteHelper designed in this paper is different from existing search engines and robots on the World Wide Web in that it does not traverse the global Web, but acts as a housekeeper for a local Web server and as a helper for the Web user who reaches the site to find relevant information. SiteHelper interactively and incrementally learns about the Web user's areas of interest and aids them by matching what it has learnt with the newly developed and modified Web pages. Information about a local site is given in the form of a keyword dictionary to improve SiteHelper's efficiency. We have analyzed in detail the advantages of having such a local Web agent in Section 3 and its potential wider applications in Section 4.

HCV has been integrated into SiteHelper to carry out the designed learning. Future work will include the evaluation of HCV's performance with other incremental learning techniques: C4.5 [Quinlan 93], ID4 [Schlimmer & Fisher 86] and ID5R [Utgoff 89].

References

[Armstrong et al 95] R. Armstrong, D. Freitag, T. Joachims & T. Mitchell, WebWatcher: A Learning Apprentice for the World Wide Web, 1995, http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-6/web-agent/www/webagent-plus/webagent-plus.html

[Balabanovic & Shoham 95] M. Balabanovic & Y. Shoham, Learning Information Retrieval Agents: Experiments with Automated Web Browsing, In On-line Working Notes of the AAAI Spring Symposium Series on Information Gathering from Distributed, Heterogeneous Environments, 1995.

[Berners-Lee et al 94] T. Berners-Lee, R. Gailliau, A. Luotonen, H. F. Nielsen & A. Secret, The World-Wide Web, Communication of the ACM, Vol 37, No 8, August 1994.

[Bray 96] T. Bray, Measuring the Web, Proceedings of the Fifth International World Wide Web Conference, 6-10 May 1996, Paris, France. Also appear in Computer Networks and ISDN Systems, Vol 28, 1996, 993-1005.

[Bowman et al 94] C.M. Bowman, P.B. Danzig, D.R. Hardy, U. Manber & M.F. Schwartz, The Harvest Information Discovery and Access System, Proceedings of the Second International World-Wide Web Conference, Chicago, Illinois, Oct 1994.

[Fielding 94] R. T. Fielding, Maintaining Distributed Hypertext Infrastructures: Welcome to MONspider's Web, Proceedings of the First International World-Wide Web Conference, CERN, Geneva Switzerland, May 1994.

[Joachims et al 95] T. Joachims, T. Mitchell, D. Freitag, & R. Armstrong. WebWatcher: Machine Learning and Hypertext. http://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/webwatcher/mltagung-e.ps.Z; To appear in Fachgruppentreffen Maschinelles Lernen, Dortmund, Germany, August 1995.

[Koster 94] M. Koster, ALIWEB - Archie-Like Indexing in the Web. Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.

[Koster 95a] M. Koster, Robots in the Web: threat or treat? ConneXions, Volume 9, No 4, April 1995.

[Koster 95b] M. Koster, Guidelines for Robot Writers, http://info.webcrawler.com/mak/projects/robots/guidelines.html

[Koster 95c] M. Koster, A Standard for Robot Exclusion, http://info.webcrawler.com/mak/projects/robots/norobots.html

[Krulwich 95] B.T. Krulwich, An Agent of Change, Andersen Consulting, http://bf.cstar.ac.com/bf/article1.html

[Lieberman 95] H. Lieberman, Letizia: An Agent That Assists Web Browsing, Proceedings of the 1995 International Joint Conference on Artificial Intelligent, Montreal, Canada, August 1995.

[Loke et al 96] S. W. Loke, A. Davison & L. Sterling, CIFI: An Intelligent Agent for Citation, Technical Report 96/4, Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia.

[McMurdo 95] G. McMurdo, How the Internet was indexed, Journal of Information Science, Vol 21, 1995, 479-489.

[Pazzani et al 95] M. Pazzani, L. Nguyen & S. Mantik, Learning from hotlists and coldists: Towards a WWW information filtering and seeking agent. Proceedings of IEEE 1995 Intl.Conference on Tools with AI, 1995.

[Pazzani et al 96] M. Pazzani, J. Muramatsu & D. Billsus, Syskill & Webert: Identifying interesting web sites, AAAI Spring Symposium on Machine Learning in Information Access, Technical Papers, Stanford, March 25-27, 1996. Also at http://www.parc.xerox.com/istl/projects/mlia/papers/pazzani.ps

[Pitkow & Bharat 94] J. E. Pitkow & K. A. Bharat, WebViz: A Tool for WWW Access Log Analysis, Proceedings of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.

[Riecken 94] D. Riecken, Intelligent Agents, Communication of the ACM, Vol 37, No 7, July 1994.

[Porterfield 94] K.W. Porterfield, WWWW (What's a WorldWideWeb?), Internet World, May 1994, 20-22.

[Quinlan 93] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.

[Schlimmer & Fisher 86] J.C. Schlimmer & D. Fisher, A Case Study of Incremental Concept Induction, Proceedings of the Fifth National Conference on Artificial Intelligence, Morgan Kaufmann, 1986, 496-501.

[Utgoff 89] P.E. Utgoff, Incremental Induction of Decision Trees, Machine Learning, Vol 4, 1989, 161-186.

[Wiggins 95] R.W. Wiggins, Webolution: The evolution of the revolutionary World-Wide Web, Internet World, April 1995, 35-38.

[Wu 93] X. Wu, The HCV Induction Algorithm, Proceedings of the 21st ACM Computer Science Conference, S.C. Kwasny & J.F. Buck (Eds), ACM Press, 1993, 169-175.

[Wu 95] X. Wu, Knowledge Acquisition from Databases, Ablex Publishing Corp., USA, 1995.

[Yan et al 96] Tak Woon Yan, Matthew Jacobsen, Hector Garcia-Molina and Umeshwar Dayal, From user access patterns to dynamic hypertext linking, Proceeding of the Fifth International World Wide Web Conference, Paris, France, May 1996.

[Zakon 96] R.H. Zakon, Internet Timeline v2.3a February 1996. http://info.isoc.org/guest/zakon/Internet

URLs

[AltaVista] http://www.altavista.com

[Database of Web Robots Overview] http://info.webcrawler.com/mak/projects/robots/active/html/index.html

[Excite] http://www.excite.com

[Infoseek] http://www.infoseek.com

[Lycos] http://www.lycos.com

[Software Development departmental web site] http://www.sd.monash.edu.au

[Webcrawler] http://www.webcrawler.com

[World Wide Web Worm] http://www.cs.colorado.edu/home/mcbryan/WWWW.html

Return to Top of Page
Return to Technical Papers Index