Web mining with a genetic algorithm

Web mining with a genetic algorithm

F. Picarougne, N. Monmarché, A. Oliver, G. Venturini
Laboratoire d'Informatique, Université de Tours,
64, Avenue Jean Portalis, 37200 Tours, France.
Phone: +33 2 47 36 14 14, Fax: +33 2 47 36 14 22
fabien.picarougne@etu.univ-tours.fr, monmarche@univ-tours.fr, oliver@univ-tours.fr, venturini@univ-tours.fr

ABSTRACT

We present in this work a genetic search strategy called GeniMiner for a search engine in the context of strategic watch. We highlight the relations that exist between Web statistical studies, search engines and optimization techniques. The user query is used to mathematically define a fitness function of Web pages. The genetic algorithm (GA) manages a population of pages and aims at maximizing this fitness function. We define a creation operator which uses the results given by standard search engines. The mutation operator consists in exploring links. We present an empirical evaluation.

Keywords

search engines, genetic algorithms, meta-search, optimization

1. INTRODUCTION

Searching the web is very similar to optimization problems: the web is a search space S structured as a graph with a neighborhood relation (Albert et al. 1999) (Broder et al. 2000). A fitness function f: S ® R+ can evaluate web pages (like for instance in Google (Brin and Page 1998)). A search engine tries to output pages which maximize f. The web can be explored with operators which are related to those used in optimization: creation operators that generate pages in S from scratch either randomly like for instance random IP generation (Lawrence and Giles 1999) or with heuristic principles like in meta-search engines which use themselves standard index-based search engines. Local search operators are also used. They consist in exploring the neighborhood of a given point in S by choosing links going out of that point, as local search operators do in optimization.

We propose in this work a search engine that is complementary to standard ones. We suppose that the user can wait for his results during several hours but provided that he spends a shorter time in manual analysis of the results. This is a crucial point in strategic watch. We are especially interested in the following problem: given a fixed number of pages evaluations (i.e. pages downloading, content analysis, etc), what strategy can hopefully maximize the user's gain (i.e. which pages/links should be explored in order to find interesting documents that will minimize the time devoted to manual analysis)? We use the fact that the GA solves (almost) optimally such a decision problem as demonstrated in (Holland 1975) on the 2-armed bandit problem.

2. A GENETIC ALGORITHM FOR WEB SEARCHING

  1. Define the evaluation function f from the user query,

  2. Pop ¬ Æ (Po is progressively filled in through steps 3, 4, 5)

  3. Generate an offspring page O:

    1. With probability (1-Pmut) (or if |Pop| < 2) Then O ¬ next page from standard search engines (creation operator)

    2. With probability Pmut Then Select one parent page P from Pop and let O ¬ page indicated by a link in P (local exploration with a mutation operator),

  4. Evaluate f(O)

  5. Insert O in Pop if (|Pop| < PopMax) or if f(O) is greater than the fitness of the worst page in Pop which is deleted,

  6. Go to step 3 or Stop (Pop is the output given to the user).

Table 1: the GA used in GeniMiner

The GA that we propose for Web mining is represented in table 1. An individual in the population is a Web page which can be numerically evaluated with the fitness function f. This function represents the user query and requires to download the page and analyze its content. It can take into account for instance keywords, regular expressions, links, size of document, referenced files (images, mp3, etc).

The creation operator queries standard search engines (Altavista, Google, Lycos, Voila, Yahoo) with the user's keywords. The mutation of a parent page P consists in 1) selecting P in the population (choosing the best between two randomly chosen individuals), 2) choosing one link l going out of P, and 3) proposing the so-pointed page as an offspring. We have sorted P's links according to their distance to the keywords in the text so that the most promising links are explored first.

3. RESULTS AND PERSPECTIVES

Num

Keywords (K1, K2, ...)

Should (S1, S2, ...)

Should not (Sn1, Sn2, ...)

1

telecom crisis analysis

mobile future

France

2

fiber optic technology information

network

 

3

text poet flower wind

Baudelaire

Rimbaud

4

wine excellent price good

Bourgueil

Bordeaux

5

buy cd music michael jackson

download mp3

 

6

mouse disney movie animation

DVD

 

7

artificial ant algorithm

   

8

genetic algorithm artificial ant

experimental comparison

 

9

javascript window opener

tutorial free

 

10

dll export class template

code example

 

Table 2: the queries used in our tests

We have compared GeniMiner with Pmut = 0.5 (GA dominates the search) to GeniMiner with Pmut = 0 (a meta-search where pages are generated with standard search engines only). We have tested also two evaluation functions for a page P: f1 is closer to the evaluation functions used in standard search engines (presence of all keywords (K1, K2, ...) + maximization of the number of keywords). f2 is more complex and takes into account the keywords but also the number of links going out of P (with a more important weight when they are close to keywords), additional words (S1, S2, ...) and (Sn1, Sn2, ...) which should (resp. should not) be present in P, the fact that the proportions of keywords are uniform. Both algorithms are given 1000 pages evaluations for each tested query (see table 2) and PopMax = 100 . A run represents from one to two hours of downloading and computing time on a very standard computer. We have performed a parameter study (Popmax, Pmut) which is not presented here.

Query

GeniMiner (Pmut = 0.5)

Meta-search (Pmut = 0)

 

Min

Max

Mean

Min

Max

Mean

1

18

177

35.4

20

177

39.5

2

37

856

119.0

42

856

131.0

3

33

922

144.4

46

8011

196.8

4

68

908

163.7

81

908

170.0

5

45

10006

324.8

27

10006

267.6

6

66

819

137.9

90

869

188.9

7

33

317

76.1

32

715

73.5

8

52

932

145.1

55

1017

171.1

9

68

672

132.3

68

672

140.1

10

42

489

94.9

54

2504

156.6

Table 3: comparative results for function f1 (last population)


Query

GeniMiner (Pmut = 0.5)

GeniMiner (Pmut = 0.3)

Meta-search (Pmut = 0)

 

Min

Max

Mean

Min

Max

Mean

Min

Max

Mean

1

37.2

1034.4

83.2

39.5

1034.4

88.2

41.2

1034.4

91.3

2

102.7

810

212.9

109.8

825.5

222.5

99.3

794.5

205.4

3

48.8

1223.7

195.7

53.4

1839.1

182.1

54.6

1102.7

147.4

4

100.6

977.4

234.1

108.2

977.4

204.0

115.4

992.0

243.3

5

277.4

2229.7

439.6

219.8

2107.7

418.5

172.3

2107.7

375.2

6

150.3

841.4

256.7

151.8

1037.2

284.3

155.2

1037.2

281.7

7

63.7

575.3

129.3

59.7

575.3

120.3

51.4

575.3

96.6

8

122.7

1462.3

271.6

107.7

1510.8

273.5

129.3

1510.8

310.3

9

87.3

1097.0

198.2

78.8

1097.0

177.6

88.5

1097.0

198.4

10

99.2

2091.7

258.7

130.4

2091.7

328.8

141.5

3371.7

430.3

Table 4: comparative results for function f2

Results are presented in tables 3 and 4. Our conclusions are the following: when the evaluation function (like f1) is close to those used in standard search engines, then the results obtained by the GA are identical to those obtained with a meta-search. But when the evaluation function is different (like f2), then the results obtained by the GA can be more relevant compared to a meta-search. This means that the genetic search can be valuable when 1) the user can wait for a longer time than in standard search engines, 2) queries are more complex or more precise than a list of keywords. We argue that strategic watch is such a domain.

The main perspectives that can be derived from this work are the followings: the comparison to other agent-based search like (Menczer et al. 1995), the use of a parallel GA to speed up the search, the self-adaptation of parameters which will improve the GA performances, the clustering of the results, the interactive personalization of the search, the use of a large disk cache, the study of a real world application in the context of strategic watch.

4. REFERENCES

  1. Albert R., Jeong H. and Barabasi A.-L. (1999), Diameter of the World Wide Web. Nature, 401:130-131, 1999.

  2. Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A. and Wiener J. (2000), Graph structure in the Web, Proceedings of the Ninth International World Wide Web Conference, Elsevier, 2000.

  3. Brin S. and Page L. (1998), The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 30, 1-7, pp 107-117, 1998.

  4. Holland J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.

  5. Lawrence S. and Giles C.L. (1999), Accessibility of information of the Web, Nature 400, pp 107-109.

  6. Menczer F., Belew R.K., Willuhn W. (1995), Artificial life applied to adaptive information agents, Spring Symposium on Information Gathering from distributed, Heterogeneous Databases, AAAI Press, 1995.


File translated from TEX by TTH, version 1.0.