Using A Data Fusion Agent for
Searching the WWW

Alan F. Smeaton and Francis Crimmins
School of Computer Applications, Dublin City University
Glasnevin, Dublin 9, IRELAND
asmeaton@CompApp.DCU.ie

Abstract

This paper describes a meta search engine for searching the WWW. It is based upon a data fusion approach wherein a user’s query is broadcast to 6 different WWW search engines (AltaVista, Excite, InfoSeek, Lycos, OpenText and WebCrawler) and the output from these searches is combined into a unified ranking of WWW pages. Unlike other meta search engines such as MetaCrawler, Savvy Search and ProFusion, our meta search engine is written in Java and launches a Java applet on the user’s machine to handle the user interface, which is via a separate window from the user’s normal WWW browser. The user’s WWW browser is used to display individual WWW pages selected for viewing. This paper presents details of the system we have developed, concentrating on the client-server architecture for the fusion operation and outlines the extensions we plan for this work in order to deliver a WWW search agent using more sophisticated and effective information retrieval techniques than are currently available via today’s popular WWW search engines.

1. Introduction

As a result of the explosive growth of the WWW coupled with our inability to document or catalogue the web at a fast enough rate so far, navigating the WWW presents a difficult task to the unprofessional laity who make up the greatest proportion of WWW users. Apart from serendipitous browsing (ever more difficult given the growth rate of the web and the inconsistent and often poor linking techniques used), the common way to find information on WWW is to use one of the many keyword-based search engines available. These search engines operate by continually crawling through the web seeking new text pages previously undiscovered or updated since last visited, and adding these to their respective catalogues. As new or updated pages are discovered, they are indexed, typically by all words which appear in these pages, and this information is added to a central index. Users’ queries sent to a search engine are matched against this index in order to generate pointers or URLs to the original web pages and this is the output of a search. The disadvantages with this situation are that the sets of pages indexed by the different search engines are overlapping yet none is complete, and the information retrieval techniques used by the current generation of search engines are relatively unsophisticated and ineffective.

In this paper we describe our work in developing a meta search engine which uses existing search engines as an underlying implementation layer. We first begin by giving a brief introduction to techniques from information retrieval research which can be shown as more effective in satisfying users’ information needs. In section 3 we present the characteristics of some of the current crop of WWW search engines and in section 4 we describe our data fusion search agent. Section 5 gives details of some performance evaluations on our fusion-based search agent and in section 6 we outline some other meta search engines and how our work differs from them. Finally we have conclusions and an outline for future work.

2. Current Approaches to Information Retrieval

The task of matching a user’s information need expressed as a query, against a collection of texts or documents, called information retrieval, has been the subject of much research for almost 4 decades. In that time the techniques that have been proposed and refined have gradually improved the effectiveness of the information retrieval operation.

Initial IR research and implementations were based on extending the boolean model where users’ queries were boolean combinations of search terms and the retrieval criterion was that documents either matched queries and were retrieved, or did not and were discarded. The introduction of automatic search term weighting based on term frequency information and the ranking of documents rather than presenting a set of documents to a user, followed extensive work on mathematically modeling the retrieval process using vector space, probabilistic and other approaches. Here, each document is given a score by which documents are ranked, the score being some function of the weights of the search terms which index the document. The modeling work continues to this day and is an important component of IR research as it still continues to yield incremental improvements to the effectiveness of the retrieval operation in areas like new term weighting formulae [Robertson95] and how to address issues related to normalising document scores by document lengths [Singhal96].

In addition to the basic paradigm of search term weighting and document ranking there are variations which contribute positively to improving relevance feedback. Relevance feedback is a technique whereby once a user has seen some documents and judged their relevance to their query, these relevance judgments are fed back into the system to dynamically re-adjust term weights causing a re-ranking of the as yet unseen documents. This re-ranking normally allows the system to refocus the search as a clearer picture of the user’s information need is now available for the system. Relevance feedback can also be used in conjunction with query expansion whereby the user’s original query is augmented with additional search terms, manually or automatically, to allow it to become a more precise statement of the user’s information need. The query may be augmented using known relevant documents as a source of candidate extra search terms [Efthimiadis95] or using terms from a static thesaurus [Ginsberg93]. Finally, the notion of using phrases as indexing units in conjunction with using words or word stems, is appealing and has been shown to improve retrieval effectiveness [Jing95]. Besides those mentioned above there are other so-called "smarts" which can improve retrieval effectiveness such as different ways to handle word variants, using passages within documents, automatically recognising objects in texts such as company and place names, etc. [Croft95].

The evaluation of the effectiveness of information retrieval systems is typically carried out on a collection of documents with a set of user queries for which the relevant documents are known in advance. For many years, IR collections consisted of the order of some thousands of documents but these days IR systems are evaluated on collections of at least the order of hundreds of thousands of documents constituting gigabytes of text. Probably the greatest drive for this has come from TREC, the annual evaluation and benchmarking exercise coordinated by the National Institute for Standards and Technology (NIST) [Harman96, Smeaton97]. In TREC, as in most IR research, evaluation of effectiveness is done in terms of the precision (percentage documents retrieved that are relevant) and recall (percentage relevant documents retrieved) at various points in a document ranking.

Apart from pushing IR research to scale up to large collections, TREC has also been instrumental in facilitating the development of IR techniques such as data fusion. In data fusion, the results of running a search on a document collection using 2 or more independent retrieval strategies are merged by combining in some way, the relative performance of a document in each of the independent document rankings. The fusion can be based on the rank positions of the documents, the scores each document is assigned in the rankings, either normalised or not, or the contribution of document rankings from different retrieval strategies can be weighted differently. When applied to information retrieval, data fusion is a paradox; it has been shown consistently to yield improvements in retrieval effectiveness when the retrieval strategies whose document rankings are merged, are independent of each other (see [Lee95] as just one example).

In summary, we can see that information retrieval can be more than just simple term weighting and document ranking, both in the functionality as the user sees it (e.g. relevance feedback and query expansion), and in the implementation which the user need not see (e.g. term weighting variants and data fusion for example). Such extensions of the basic paradigm are important as they consistently and demonstrably yield improvements in the effectiveness of the information retrieval operation. A more detailed overview of this can be found in [Smeaton96].

3. Searching the WWW

The WWW search engines that we use underlying our work are AltaVista, Excite, InfoSeek, Lycos, OpenText and WebCrawler. All are based on ranking documents/pages based on their retrieval status values (RSV), a document score computed by summing some variant of a tf*IDF weighting of search terms and it is the term weighting variations, as well as the sets of pages in the respective indexes, that set the search engines apart from each other. tf*IDF weighting is a term weighting function where the occurrence of search term i in a document causes that document’s RSV to be incremented by a term weight defined as tf * log (N/ni) where tf is the frequency of occurrence of the term in the document, N is the total number of documents and ni is the number of those documents indexed by term i. Typically, each engine represents each page in its catalogue by the words or word stems which appear in the page. Some allow a user to specify a phrase as part of a query and the occurrence of a phrase in a page causes extra weight to be assigned to such a page.

Apart from variations in search technology used, the search engines differ in the amount of the WWW that they claim to have in their catalogues. We could say that a PR battle exists among search engines, each claiming to have indexed the most number of pages, in an attempt to attract customers and advertising revenue. The size of the index rather than the sophistication or effectiveness of the search technology is currently the dominant factor in marketing these tools.

The characteristics of the search engine operations are summarised in Table 1.

SEARCH ENGINE INDEXING USE OF PHRASES ? SCORING RANGE
AltaVista Full Document Uses " " No score returned
Excite "Concept based" indexing No 0-100 (%)
InfoSeek Full Document Uses " " 0-100
Lycos Title, Headings, 100 Key Words No 0-1
OpenText Full Document Phrase assumed No upper limit
WebCrawler Key Words Uses " " 0-100 (%)

Table 1: Characteristics of each search engine

It is important to note that these characteristics are based on the versions in operation as of June 1996 and some may have changed their search functionality since then. Even if this was to happen, the tools and techniques used in our data fusion would remain valid and operational.

It is clear from this overview of current search engines that they lack many of the techniques which as we saw earlier, are known to yield improvements in retrieval effectiveness. The obvious reason for this is that techniques such as relevance feedback and query expansion are more computationally expensive to implement than the simple term weighting and ranking currently on offer, and WWW search engine developers are more concerned with increasing the coverage of their indexes than with improving their search effectiveness. In time, as search engines compete for advertising revenue based on other criteria, improved search technology will appear, but not just yet.

While waiting for the developments in search technology to happen, we can develop more elaborate and effective searching techniques on top of existing WWW search engines without having to change them. This is analogous to the situation that existed about 15 years when online searching of bibliographic databases was available only by formulating queries as boolean combinations of keywords. At that time, term weighting and document ranking had just been demonstrated as more effective than boolean information retrieval, yet system developers were reluctant to embrace this and throw away their investments in boolean search systems. This led to the development of "intelligent terminals" [Morrissey82], which were the equivalent of what would now be called agent software, programs that took a user’s natural language query and broadcast a series of independent searches to an underlying boolean IR system and processed the results that came back to effectively provide search term weighting and document ranking. Such intelligent terminals were eventually phased out when vendors built term weighting etc. directly into their IR systems. In the present situation we can consider developing search agents which interact with a user and based on a user’s query, issue search commands to, in this case, multiple WWW search engines. We are developing such WWW search agents and what we report here in this paper is our first prototype system. In the next section we describe our data fusion agent for searching the WWW.

4. A Data Fusion Agent for Searching the WWW

4.1 Overall System Architecture

The Fusion system was designed and implemented using a client-server architecture. The Fusion server is a multi-threaded server which receives requests from clients known as Fusion applets. It creates new threads to deal with each new connection request and the architecture of the system is shown in Figure 1.

Figure 1. Fusion System Architecture

The system operates in the following manner. On start-up, the fusion server dynamically loads in search engine classes from the local file system. The server is then ready to accept requests from clients. Users launch a fusion applet or client when they load the applet’s page into a Java-enabled browser such as Netscape and this applet creates its own window display as its user interface. A connection is made to the fusion server which creates a fusion connection thread to handle it. The fusion applet is now connected to the server.

When the user inputs a query and this arrives at the fusion server, a data fusion thread is created to deal with this. Search engine classes are used to query the different search engines in parallel, with the results returned by the engines being parsed to extract relevant information and then fused together as described later. The fused results are then sent back to the client for display.

When the fusion applet on the user’s machine sends a query to the fusion server, a fusion listener thread is created by the fusion applet window to handle replies from the server. When this receives a reply it displays the results in the applet’s window. The user can then select URLs from this display and the applet will load these into a browser window.

Our choice of a client-server architecture rather than embedding all the fusion functionality in a single Java applet, is motivated by considerations of future developments. By developing our client as a Java applet which is downloaded every time a user starts a query session we can enhance system functionality without having to distribute new releases of our system. In addition, we can also build in further information retrieval features and take advantage of processing on our server, processing which may use local resources in running a query. Such resources could include static thesauri or word lists as well as dynamic information generated during query processing. Our architecture is called a "knowledge-server" approach by Eriksson [Eriksson96] who describes a similar Java client-server system for a non-IR application.

4.2 System Components

The main components of the fusion system are described in more detail below. The implementation language is Java from Sun Microsystems [Java96].

Fusion Connection

Fusion connection threads handle all communications with clients of the fusion server and are created when connection requests arrive. A simple communication protocol involving name/value pairs is used between the client and the server. Incoming messages are passed to a message handler which deals with them based on their type. This approach allows new messages and functionality to be added to the system if needed.

Data Fusion

A data fusion thread is created in response to a query message from the client. It queries the search engines in parallel using the search engine objects loaded at start-up. If all the search threads return before a specified time-out, the data fusion thread is woken and it retrieves the results, otherwise it will wake when the specified time-out expires and process whatever results are available. Any engines which return after this will be ignored.

The data fusion operation is performed on the retrievable objects by their rank position. Before fusing the respective rankings, we first perform some simple processing to convert URLs to a canonical form. For example, we resolve the duplication caused by a pathname and the same pathname with the file index.html being identical. The ranked objects are stored in a hash table, with their URL being used to generate the hash code. Thus duplicate objects have their ranks summed, and the objects are penalised if they have not been retrieved by a particular search engine. Once all objects have been inserted into the table they are sorted into ascending order based on rank, using Quicksort. The fused results can then be sent to the client which queried the fusion server.

In fusing the ranked results from search engines we encountered some problems which do not arise in fusing in traditional IR environments. As we saw earlier in Table 1, AltaVista does not return an RSV or document score for a document, only a rank position. In addition to this the scores returned by some search engines can be very skewed. For example, the first few documents returned in response to a query might receive a score of 100%, though they would be ranked uniquely and not tied. The differing criteria used by the engines in assigning scores also contributes to this problem, resulting in incompatibilities between the scores received. It is for these reasons that the fusion system performs data fusion based on rank position and not on a document’s score [Kantor95].

Search Engine Classes & Retrievable Objects

A search engine class is created for each of the engines queried by the fusion server. An engine object can be queried by sending it search terms and this in turn queries the relevant search engine by using a form object. The results returned from using a search engine are then parsed in order to extract retrievable objects. For our system, a retrievable object represents a document on the World Wide Web. It consists of a document title, its URL and the rank assigned to it by a search engine. These elements are extracted from the query results by parsing the HTML tags in the output generated by the search engine. The fact that the search engine classes are loaded in by the Fusion server on start-up facilitates what we call a "plug-and-search" approach. This means that if a search engine radically changes the way it operates or the format of its output, an upgraded version of its class can be plugged into the system. Similarly, if a new search engine appears on the web and we decide to use this, a new class can be written for this and used by the fusion server.

Form Objects

A HTML form includes a template for a form data set, which is a sequence of name/value pairs. This data set is sent to a CGI (Common Gateway Interface) program when a user submits the form. The server then returns the result of the form submission to the user’s browser. This is how search engines are queried.

A form object has the ability to connect to a search engine, download its page and derive the default data set. This can then be used to construct a query based on the search terms entered to our fusion applet. For example, the following shows the query syntax for two different search engines, AltaVista and OpenText:

AltaVista: pg=q&what=web&fmt=.&q=______
OpenText: SearchFor=______&mode=and

The "___" in the query indicates where the search terms would be inserted. On start-up, the search engine classes use a form object to derive their data set and this allows them to continue operating even if a search engine changes the format of its form.

Fusion Applet

The fusion applet acts as a client front end to the fusion server. The windowing interface was created using Java’s Abstract Windowing Toolkit (AWT). This allows the interface to assume the ‘look and feel’ of whichever platform the user is currently running on e.g. Motif, Windows 95, Macintosh etc.

Due to Java’s security model, the applet is unable to write to the client machine or connect to any host except the machine that it came from. Thus it acts purely as an interface, relaying queries to the server and displaying the results returned.

5. Evaluating Data Fusion Searching of WWW

There have been very few attempts to measure the effectiveness of searching for information on the WWW in the same way as effectiveness is evaluated in traditional information retrieval research. Gauch and Wang in describing their ProFusion meta search engine [Gauch96] have conducted some evaluation experiments on some of the large WWW search engines and although their query set was small they found their approach to combining the outputs of search engines better than any individual search engine. In subsequent work [Gauch96] they have developed ProFusion further so that it weights the impact each search engine has on the final ranking depending on the domain of a user query and the a priori performance of the search engines for queries in those domains.

In [Yuwono96] there is a report of the evaluation of four document or page scoring and ranking algorithms based on Boolean Spread Activation, most-cited, tf*IDF term weighting and vector spread activation. These were implemented on an index of a small number of WWW pages in a local domain (CUHK.HK) and using a set of 56 user queries, finding that the tf*IDF term weighting and vector spread activation approaches were the most effective.

When compared to evaluation of traditional information retrieval techniques, evaluations like those reported above can only be taken as indicators of effectiveness, however there is no reason to suspect that retrieval techniques developed and evaluated in information retrieval research for more traditional applications, will not transfer directly to searching WWW. Our work is based on this assumption.

Shown in Appendix A are the results of running twenty five sample queries on our WWW fusion agent. Some of these queries were taken directly from [Yuwono96], some are our localised equivalents of them and some are real queries to WWW search engines executed by us over the last few months. The engines used by the Fusion server were AltaVista, Excite, InfoSeek, Lycos, OpenText and WebCrawler. On average all six engines returned within the specified time-out (15 seconds) and the average overall query time was 8.2 seconds.

From the results in Appendix A it can be seen that there is very little overlap in WWW pages in the results returned from the different search engines. Each query returns an average of 50.08 documents, including duplicates for the top 10 rank positions for the 6 search engines used. The reason this figure is not 60 (6 search engines, top 10 per engine) is that some engines may not have returned results before the timeout period or may have returned less than 10 documents in total. The number of documents duplicated in this set is only 2.88 on average, clearly showing a need to retrieve more than the top-10 documents from each search engine in order to get reasonable duplication. The low level of duplication also questions the overall effectiveness of any approach to combining results from different search engines which do not go beyond the top-10 per search engine.

The average percentages for numbers of pages returned per search engine is shown in Appendix B, which is based on a sample of 3900 of the searches conducted by the Fusion system over a period of weeks since Fusion was released to the internet community during the Summer of 1996.

As expected, the largest entries are for engines returning 10 documents for a query except for Excite and InfoSeek which return 8 and 9 documents more often than other sized document sets.

6. Related Work

There are a number of different resources available on the WWW which use more than one source of information for a search, and they can be divided into two main groups. The first group are the so-called all-in-one pages such as All-In-One, CUSI, Find-It! and Search.com. These are basically a compilation of the form interfaces of different search tools found on the web. They cover a number of general and specialised engines, divided into categories e.g. web, software, people, technical reports etc. There is no parallelism or combination of results involved; they simply redirect the browser to the relevant engine with the appropriate query.

The second group includes the meta-search engines, of which Fusion is one. Examples of this type would be Highway 61, Inference Find!, Mamma, MetaCrawler, ProFusion and SavvySearch. These systems all operate in essentially the same way, querying their underlying engines in parallel in order to answer user queries. Where they differ is in the processing they perform on the results before presenting them to the user.

Highway 61, Mamma, MetaCrawler and ProFusion combine their results by fusing based on document score i.e. they sum the scores given to a document by the different engines. MetaCrawler and ProFusion offer broken link detection as well, although this results in an increase in query time.

Inference Find! clusters the documents into groups based on their location i.e. what WWW site they are at. MetaCrawler also offers this as an alternative to ranking based on score. SavvySearch has a large number of underlying engines it knows about and concentrates on selecting engines to route a users query to.

The main difference between Fusion and these other meta-search services is that the others all use HTML forms as their user interface. This limits the functionality and interaction which can be offered to users. Fusion’s client/server architecture and its use of Java allows us to offer an improved and more interactive interface to the user but most importantly offers us more scope for future developments.

7. Future Work

Clearly the data fusion agent we have developed and described in this paper is a building block for future work but even as it presently stands it is a useful tool for searching the WWW being representative of what we believe to be the future trends for WWW search engines. Since its release to the internet community, the Fusion agent has served over 12,000 user searches. The client-server architecture of our data fusion system and its implementation in Java allows us to build in more sophisticated and effective information retrieval features. In order to improve the effectiveness of the data fusion operation we are extending the search engine classes to allow them to retrieve more than the top 10 ranked WWW pages per search engine, which is the default provided by most engines. By going further down the ranked list of URLs returned from each engine we will find more duplicated pages which will yield a more effective data fusion result. This will require a series of requests in parallel, rather than a single request from the data fusion server to each search engine. We also plan to download from WWW, the actual pages that are highly ranked for post-processing, generating candidate additional search terms for the user to select from and add to the query. Relevance feedback will also be incorporated to adjust the relative weights of search terms and these relative weights for different search terms will be passed back to search engines by padding the query with multiple occurrences of highly weighted query terms.

All these extensions will slow down the overall time to execute a search simply because of the number of interactions we must have with the search engines. Even though this may lose the interactive feel that a user may have had when using a single search engine, we believe that a user will be prepared to tolerate this given the anticipated improvement in retrieval.

As we mentioned earlier, we do expect the developers of WWW search engines to build directly into their systems the kind of retrieval functionality we talked about earlier but even this may be implemented by the kind of client-server architecture we have developed. One glaring drawback in such approaches to searching WWW pages which are linked through hypertext links, is the fact that the information links themselves are not used as part of retrieval except perhaps to locate new pages to be incorporated into search engine catalogs. Our previous work on searching through hypertexts has developed search techniques which use hypertext links as part of the search [Guinan92]. Similar approaches have been use by Savoy [Savoy95] and by Yuwono and Lee [Yuwono96]. The information retrieval techniques used in present WWW searching approaches, including our own, carry the baggage from the document corpus application where documents are assumed to be independent. In the WWW, pages are not independent and some of the dependencies are manifest through hard coded information links. Notwithstanding the poor and the inconsistent hypertext authoring techniques used in the WWW, information links do contain information, information which is being disregarded at present. In the long term we would like to develop our data fusion system further to address including such information links as part of retrieval.

References

[Croft95]

"Effective text Retrieval Based n Combining Evidence from the Corpus and Users", W.B. Croft, IEEE Expert: Intelligent Systems and their Applications, 10(4), 59-63, 1995.

[Efthimiadis95]

"User Choices: A New Yardstick for the Evaluation of Ranking Algorithms for Interactive Query Expansion", E. N. Efthimiadis, Information Processing and Management, 31(4), 605-620, 1995.

[Ericsson96]

"Expert Systems as Knowledge Servers", H Eriksson, IEEE Expert: Intelligent Systems and their Applications, 11(3), 14-19, 1996.

[Gauch96]

"Information Fusion with ProFusion", S. Gauch and H Wang, in Proceedings of WebNet’96: The First World Conference of the Web Society, San Francisco, CA, USA, October 1996.

[Ginsberg93]

"A Unified Approach to Automatic Indexing and Information Retrieval", A. Ginsberg, IEEE Computer, 8(5), 46-56, 1993

[Guinan92]

"Information Retrieval from Hypertext Using Dynamically Planned Guided Tours", C. Guinan and A.F. Smeaton, in Proceedings of ECHT92, Milan, Italy, 1992, D. Lucarella et al. (Eds.), 122-130.

[Java96]

JavaSoft. http://java.sun.com/ April 1996.

[Jing95]

"An Association Thesaurus for Information Retrieval", Y. Jing and W.B. Croft, in: Proceedings of RIAO 1994I, C.I.D., Paris, 1994, 146-160.

[Harman96]

"Overview of TREC-4", D. Harman, NIST Special Publication, 500-236, 1996.

[Kantor95]

"Combining the Evidence of Multiple Query Representations for Information Retrieval", P. Kantor et al., Information Processing & Management, 31(3), pp 431-448, 1995.

[Lee95]

"Combining Multiple Evidence from Different Properties of Weighting Schemes", J. Lee, in Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, 1995, pp180-188.

[Morrissey82]

"An Intelligent Terminal for Implementing Relevance Feedback on Large Operational Retrieval Systems", J. Morrissey, in Lecture Notes in Computer Science No 146, Research and Development in Information Retrieval, Berlin, 1982,

[Robertson95]

"OKAPI in TREC 4", S. E. Robertson, S. Walker, S. Jones, M. M. Beaulieu, M. Gatford, A. Payne, in Proceedings of TREC-4, D. Harman, (Ed.), NIST Special Publication, 500-226, 1995.

[Savoy95]

"A New Probabilistic Scheme for Information Retrieval in Hypertext", J. Savoy, The New Review of Hypermedia and Multimedia, 1, 107-, 1995.

[Singhal96]

"Pivoted Document Length Normalisation", A. Singhal, C. Buckley and M Mitra, in: Proceedings of the 19th International ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR96), Zurich, Switzerland, August 1996, pp 21-29

[Smeaton96b]

"An Overview of Information Retrieval", A.F. Smeaton, in Information Retrieval and Hypertext, M. Agosti and A.F. Smeaton (Eds.), Kluwer Academic Publishers, 1996.

[Smeaton97]

"The Text Retrieval and Evaluation Conferences (TREC) and Europe: Impact to Date", A.F. Smeaton and D. Harman, Journal of Information Science (in press), 1997.

[Yuwono96]

"Search and Ranking Algorithms for Locating Resources on the World Wide Web", B. Yuwono and D.K. Lee, in Proceedings of the 12th International Conference on Data Engineering, New Orleans, 1996, 164-171.

Appendix A: Sample Queries and their Returned Pages

Query Pages Returned Duplicates
     
Arlington National Cemetery 53 6
Dublin Airport 58 3
Bank of Ireland 59 7
Arthouse Multimedia Centre for the Arts 45 3
Blackburn Rovers Football Club 43 4
Celtic studies 58 10
Sociocultural factors of marketing 49 2
Washington Redskins football team 50 1
Postgraduate course in Information Retrieval 47 2
Weather forecast for Ireland 50 6
New York subway map 39 3
Walt Disney's Toy Story 49 1
The TREC conference 43 2
Dynamic load balancing 59 7
Political anthropology 56 2
Partial differential equations 59 1
Irish music theory 48 0
Multimedia production techniques 47 0
London transportation 56 1
The Lincoln memorial monument 48 2
Irish musical instruments 50 3
Information filtering techniques 39 0
Image deblurring techniques 50 2
European Championship soccer tournament 50 1
Computer chess championships 47 3
     
Totals: 1252 72
Average: 50.08 2.88

Appendix B: Average percentages for numbers of pages returned per search engine

Total number of searches in sample: 3900 Timeout: ***

  Altavista Excite InfoSeek Lycos OpenText WebCrawler
10 2531 (64.9%) 972 (24.9%) 164 (4.2%) 1382 (35.4%) 1761 (45.1%) 2625 (67.3%)
9 452 (11.6%) 2 (0.05%) 2493 (63.9%) 41 (1.0%) 654 (16.8%) 16 (0.4%)
8 174 (4.4%) 2176 (55.8%) 57 (1.5%) 53 (1.3%) 13 (0.3%) 16 (0.4%)
7 68 (1.7%) 4 (0.1%) 25 (0.6%) 52 (1.3%) 21 (0.5%) 20 (0.5%)
6 36 (0.9%) 4 (0.1%) 20 (0.5%) 74 (1.9%) 25 (0.6%) 20 (0.5%)
5 59 (1.5%) 3 (0.08%) 30 (0.8%) 97 (2.5%) 18 (0.4%) 25 (0.6%)
4 21 (0.5%) 6 (0.2%) 38 (1.0%) 96 (2.5%) 19 (0.4%) 31 (0.8%)
3 40 (1.0%) 4 (0.1%) 35 (0.9%) 118 (3.0%) 29 (0.8%) 56 (1.4%)
2 48 (1.2%) 4 (0.1%) 54 (1.4%) 153 (3.9%) 88 (2.3%) 48 (1.2%)
1 42 (1.1%) 6 (0.2%) 80 (2.0%) 263 (6.7%) 55 (1.4%) 92 (2.4%)
0 300 (7.7%) 314 (8.1%) 745 (19.1%) 1182 (30.3%) 469 (12.0%) 463 (11.9%)
*** 129 (3.3%) 405 (10.3%) 159 (4%) 389 (9.9%) 748 (19.2%) 488 (12.5%)

Appendix C: Web Site Locations

Search Engines

All-In-One Pages

Meta-Search Engines





Return to Top of Page
Return to Posters Index