Web mining with a genetic algorithm

F. Picarougne, N. Monmarché, A. Oliver, G. Venturini
Laboratoire d'Informatique, Université de Tours,
64, Avenue Jean Portalis, 37200 Tours, France.
Phone: +33 2 47 36 14 14, Fax: +33 2 47 36 14 22
fabien.picarougne@etu.univ-tours.fr, monmarche@univ-tours.fr, oliver@univ-tours.fr, venturini@univ-tours.fr

ABSTRACT

We present in this work a genetic search strategy called GeniMiner for a search engine in the context of strategic watch. We highlight the relations that exist between Web statistical studies, search engines and optimization techniques. The user query is used to mathematically define a fitness function of Web pages. The genetic algorithm (GA) manages a population of pages and aims at maximizing this fitness function. We define a creation operator which uses the results given by standard search engines. The mutation operator consists in exploring links. We present an empirical evaluation.

Keywords

search engines, genetic algorithms, meta-search, optimization

1. INTRODUCTION

Searching the web is very similar to optimization problems: the web is a search space S structured as a graph with a neighborhood relation (Albert et al. 1999) (Broder et al. 2000). A fitness function f: S Ž R⁺ can evaluate web pages (like for instance in Google (Brin and Page 1998)). A search engine tries to output pages which maximize f. The web can be explored with operators which are related to those used in optimization: creation operators that generate pages in S from scratch either randomly like for instance random IP generation (Lawrence and Giles 1999) or with heuristic principles like in meta-search engines which use themselves standard index-based search engines. Local search operators are also used. They consist in exploring the neighborhood of a given point in S by choosing links going out of that point, as local search operators do in optimization.

We propose in this work a search engine that is complementary to standard ones. We suppose that the user can wait for his results during several hours but provided that he spends a shorter time in manual analysis of the results. This is a crucial point in strategic watch. We are especially interested in the following problem: given a fixed number of pages evaluations (i.e. pages downloading, content analysis, etc), what strategy can hopefully maximize the user's gain (i.e. which pages/links should be explored in order to find interesting documents that will minimize the time devoted to manual analysis)? We use the fact that the GA solves (almost) optimally such a decision problem as demonstrated in (Holland 1975) on the 2-armed bandit problem.

2. A GENETIC ALGORITHM FOR WEB SEARCHING

Define the evaluation function f from the user query,
Pop Ź Ć (Po is progressively filled in through steps 3, 4, 5)
Generate an offspring page O:
1. With probability (1-P_mut) (or if |Pop| < 2) Then O Ź next page from standard search engines (creation operator)
2. With probability P_mut Then Select one parent page P from Pop and let O Ź page indicated by a link in P (local exploration with a mutation operator),
Evaluate f(O)
Insert O in Pop if (|Pop| < Pop_Max) or if f(O) is greater than the fitness of the worst page in Pop which is deleted,
Go to step 3 or Stop (Pop is the output given to the user).

Table 1: the GA used in GeniMiner

The GA that we propose for Web mining is represented in table 1. An individual in the population is a Web page which can be numerically evaluated with the fitness function f. This function represents the user query and requires to download the page and analyze its content. It can take into account for instance keywords, regular expressions, links, size of document, referenced files (images, mp3, etc).

The creation operator queries standard search engines (Altavista, Google, Lycos, Voila, Yahoo) with the user's keywords. The mutation of a parent page P consists in 1) selecting P in the population (choosing the best between two randomly chosen individuals), 2) choosing one link l going out of P, and 3) proposing the so-pointed page as an offspring. We have sorted P's links according to their distance to the keywords in the text so that the most promising links are explored first.

3. RESULTS AND PERSPECTIVES

Num	Keywords (K₁, K₂, ...)	Should (S₁, S₂, ...)	Should not (Sn₁, Sn₂, ...)
1	telecom crisis analysis	mobile future	France
2	fiber optic technology information	network
3	text poet flower wind	Baudelaire	Rimbaud
4	wine excellent price good	Bourgueil	Bordeaux
5	buy cd music michael jackson	download mp3
6	mouse disney movie animation	DVD
7	artificial ant algorithm
8	genetic algorithm artificial ant	experimental comparison
9	javascript window opener	tutorial free
10	dll export class template	code example

Table 2: the queries used in our tests

We have compared GeniMiner with P_mut = 0.5 (GA dominates the search) to GeniMiner with P_mut = 0 (a meta-search where pages are generated with standard search engines only). We have tested also two evaluation functions for a page P: f₁ is closer to the evaluation functions used in standard search engines (presence of all keywords (K₁, K₂, ...) + maximization of the number of keywords). f₂ is more complex and takes into account the keywords but also the number of links going out of P (with a more important weight when they are close to keywords), additional words (S₁, S₂, ...) and (Sn₁, Sn₂, ...) which should (resp. should not) be present in P, the fact that the proportions of keywords are uniform. Both algorithms are given 1000 pages evaluations for each tested query (see table 2) and Pop_Max = 100 . A run represents from one to two hours of downloading and computing time on a very standard computer. We have performed a parameter study (Pop_max, P_mut) which is not presented here.

Query	GeniMiner (P_mut = 0.5)			Meta-search (P_mut = 0)
	Min	Max	Mean	Min	Max	Mean
1	18	177	35.4	20	177	39.5
2	37	856	119.0	42	856	131.0
3	33	922	144.4	46	8011	196.8
4	68	908	163.7	81	908	170.0
5	45	10006	324.8	27	10006	267.6
6	66	819	137.9	90	869	188.9
7	33	317	76.1	32	715	73.5
8	52	932	145.1	55	1017	171.1
9	68	672	132.3	68	672	140.1
10	42	489	94.9	54	2504	156.6

Table 3: comparative results for function f₁ (last population)

Query	GeniMiner (P_mut = 0.5)			GeniMiner (P_mut = 0.3)			Meta-search (P_mut = 0)
	Min	Max	Mean	Min	Max	Mean	Min	Max	Mean
1	37.2	1034.4	83.2	39.5	1034.4	88.2	41.2	1034.4	91.3
2	102.7	810	212.9	109.8	825.5	222.5	99.3	794.5	205.4
3	48.8	1223.7	195.7	53.4	1839.1	182.1	54.6	1102.7	147.4
4	100.6	977.4	234.1	108.2	977.4	204.0	115.4	992.0	243.3
5	277.4	2229.7	439.6	219.8	2107.7	418.5	172.3	2107.7	375.2
6	150.3	841.4	256.7	151.8	1037.2	284.3	155.2	1037.2	281.7
7	63.7	575.3	129.3	59.7	575.3	120.3	51.4	575.3	96.6
8	122.7	1462.3	271.6	107.7	1510.8	273.5	129.3	1510.8	310.3
9	87.3	1097.0	198.2	78.8	1097.0	177.6	88.5	1097.0	198.4
10	99.2	2091.7	258.7	130.4	2091.7	328.8	141.5	3371.7	430.3

Table 4: comparative results for function f₂

Results are presented in tables 3 and 4. Our conclusions are the following: when the evaluation function (like f₁) is close to those used in standard search engines, then the results obtained by the GA are identical to those obtained with a meta-search. But when the evaluation function is different (like f₂), then the results obtained by the GA can be more relevant compared to a meta-search. This means that the genetic search can be valuable when 1) the user can wait for a longer time than in standard search engines, 2) queries are more complex or more precise than a list of keywords. We argue that strategic watch is such a domain.

The main perspectives that can be derived from this work are the followings: the comparison to other agent-based search like (Menczer et al. 1995), the use of a parallel GA to speed up the search, the self-adaptation of parameters which will improve the GA performances, the clustering of the results, the interactive personalization of the search, the use of a large disk cache, the study of a real world application in the context of strategic watch.

4. REFERENCES

Albert R., Jeong H. and Barabasi A.-L. (1999), Diameter of the World Wide Web. Nature, 401:130-131, 1999.
Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A. and Wiener J. (2000), Graph structure in the Web, Proceedings of the Ninth International World Wide Web Conference, Elsevier, 2000.
Brin S. and Page L. (1998), The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 30, 1-7, pp 107-117, 1998.
Holland J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.
Lawrence S. and Giles C.L. (1999), Accessibility of information of the Web, Nature 400, pp 107-109.
Menczer F., Belew R.K., Willuhn W. (1995), Artificial life applied to adaptive information agents, Spring Symposium on Information Gathering from distributed, Heterogeneous Databases, AAAI Press, 1995.

File translated from T_EX by T_TH, version 1.0.