WebSkimming: An Automatic Navigation Method along Context-Path for Web Documents

Kazutoshi Sumiya
Graduate School of Informatics, Kyoto University
Yoshida-honmachi, Sakyo, Kyoto 606-8501, JAPAN
sumiya@i.kyoto-u.ac.jp

Minori Takahashi
Graduate School of Science and Technology, Kobe University
Rokkodai-cho, Nada, Kobe 657-8501, JAPAN
minori@ai.cs.kobe-u.ac.jp

Katsumi Tanaka
Graduate School of Informatics, Kyoto University
Yoshida-honmachi, Sakyo, Kyoto 606-8501, JAPAN
ktanaka@i.kyoto-u.ac.jp

ABSTRACT

In recent years, the growth of the WWW has rapidly expanded and it is not always easy to retrieve information. Search engines are very useful but they treat each page as an independent document. In this paper, we propose an automatic navigation method following the user's requirement in Web sites. This method detects the essential pages in the site and generates the order list of the pages to present results by linkage information of the pages. Furthermore, we have designed and implemented a prototype system based on our proposed method.

Keywords

Navigation; Browsing; Skimming; Context; User Interface.

1. INTRODUCTION

Generally, users have to view many pages by browsing one by one in a Web site. However, in such a situation, it is not easy to understand the structure and the contents of the site and to specify portions which fit the user's requirement.

To deal with the problem, Cha-Cha[1] and ODiN[2] have been proposed. Cha-Cha and ODiN are the system which analyzes the link structure of sites and adds the shortest course from a top page to reference result pages to a list of results. However, in these systems, it can not be said that it is easy to understand the structure of the sites. Moreover, since the lists are presented as static order, user requirements are not reflected.

In this paper, we propose an automatic navigation method following the user's requirement in Web sites. This method detects the essential pages in the site and generates the order list of the pages to present results by linkage information of the pages. We have designed and implemented a prototype system based on our proposed method.

2. Automatic Navigation

2.1 Essential Pages in Web Sites

The set of essential pages in a Web site must be the abstract of contents. Many kinds of content are included in a Web site. It is, therefore, very difficult to determine essential pages in the site, however, some techniques may be adapted to this problem. One is applying the results of search engines which are the answers of a keyword in a specific site. For example, if the name of a sports player is given as a keyword, the engine returns some pages on which the name is included. The pages may consist bibliography of the player, game results, speech es for show business and so on.

Another technique is automatic topic extracting. Several methods are proposed. In this research, because it is only required to detect the essential pages in a site, any of the methods are adaptable. Moreover, since topic extraction is not the scope of this research, we omit a description of it.

Figure 1 - Essential Pages.

2.2 Common Ancestor Page

It is not enough to show only essential pages in order. So as to show each essential page in order, we have to decide a starting point. We calculate a shared ancestry point for all essential pages. We call this page the common ancestor page. In Fig.1, the common ancestor page of Q, M, and U is E. The common ancestor page is found by following procedure:

step 1. If there are target pages in which the common ancestor page is not found, Step 2 will be performed. Otherwise, it ends.

step 2. Pages linked to each target page are taken out.

step 3. Paths are deleted when pages obtained by Step 2 are pages which it had already followed. When the obtained pages are pages which have already been followed, except for themselves or those visited simultaneously, the pages are a common ancestor page. Go to Step 1.

2.3 Navigation-Path

A navigation-path is calculated by the following procedure:(1) take out pages linked to each target page, and compute the reciprocal of out links of each page. (2) choose a path including the link where the calculated value is the largest. This procedure is repeated to the common ancestor page.

If navigation-paths are found for all target pages, the paths set showing the target page in a certain domain can be found.

2.4 Context-Path

Context-path is a set of the navigation-path and yet gives a presentation order to it. We generate context-paths based on similarity between pages. The degree of similarity between pages is calculated using tf-idf (term frequency-Einverse document frequency). The calculated degree of similarity is given to each path, and a depth priority search is performed using the degree. A page order is determined by this operation. The left hand side of Fig.2 shows the calculated degrees of similarity, and the right hand side shows the generated page order. The composition of the site is taken into consideration in this order of the page, and the context of a page is formed. In this case, the context-path is set as E -> I -> N -> U -> M -> G -> Q.

Figure 2 - Context-Path.

3. Presentation of Context-Path

The essential pages stated the previous section will contain meaningful part and other useless parts. So, we propose a method of main portion extraction from Web pages and a linkage method of main portions in each path. Generally, tags of HTML documents are classified as follows: structure tags, visualizing tags, embedding tags and so on (Structure tags are <H[1-6]>,<UL>,<OL>,<DL>,<TABLE>,<BLOCKQUOTE>and <P>). Structure tags are classified into two types: heading tags and other tags. When a page and a keyword are given, the main portion of the page is the neighbor of the position where the keyword exists in the page. The neighbor can be calculated as the maximum portion where it is divided by the nearest heading tags vertically, both up and down. If there are no heading tags in the page, the maximum portion is divided by nearest other structure tags.

Next, we describe how to present elements for context-paths. We generate the new path for this purpose which skim version of context-path to delete useless part of each page. The elements of the path are titles, anchors and partial documents of page. The path is called presentation-path.

If it shows the paths in order, the same scenes may be reproduced repeatedly. Therefore, we think that these paths must be merged. It is combining the path containing the same anchor as merges here.

The idea of merging is shown as follows. If there are paths with the same combination of path elements from a title of common ancestor page, those paths are called same-level-path-group. Each path in the same-level-path-group is compared from the head. If a path element which is different from the same class appears, the element will be considered as a key path. The relation of these key paths determines whether it is shown in parallel.

a) Structure tag of a key path is the same:

A designer of sites is considered to carry out the same semantic attachment to those anchors into the pages. Therefore, we thought that having shown the anchor in parallel made it to understand easier for users. In order to show it in parallel, the same level path group was combined and one reproduction path was generated.

b) Other cases:

When the structure tag differs, a designer is considered to have thought that he wanted to show the difference semantically by designing those anchors as different structures. Therefore, it reproduces as a separate path in the case.

4. Structure Pattern for Presentation Effect

The kinds of presentation elements included in presentation-paths are three: title, anchor, and content. There are five cases where different kinds of presentation elements are connected by presentation-paths. We call these structure patterns.

(1) anchor to anchor: An anchor points to another page. The anchor included in the page points to a different page.

(2) parallel anchors: An anchor points to another page. The anchors included in the page points to a different page.

(3)anchor to contents:An anchor points to another page and content exists on that page.

(4)parallel contents:An anchor points to other page and several contents exists on that page.

(5)content to anchor: An anchor including a content points to another page and another content exists on that page.

Figure 3 - Structure Pattern.

The animation which moves an elements is prepared for every pattern.

5. Prototype System

5.1 Outline

We call the system “e-dward” which is a prototype system that are based on our proposal method. In this section, we explain the outline of e-dward. First, a user specifies a website and their interested keyword. Then, processing which proposed in Section 2 and Section 3 is performed to the specified website, and a presentation-path is generated. The process is described in Perl language.

Next, we mount visual effects as presentation effects this time. The Flash[6] movies of Macromedia is used for the mounting. The templates by Generator are prepared so that Flash movies can be automatically created for each structure pattern. And Flash movies which present the specified contents are created by combining presentation-paths and templates dynamically. Here, since Flash movie are created for each structure pattern, they must reproduce this continuously. We use SMIL for this implement. By covering all Flash movie files with the <seq> tag of SMIL, it can be shown as it is reproduced one after another and each movie is reproducing one path.

Finally, when showing a presentation-path, different background music for each navigation-path are put in, and it is possible for users to understand the switchover of paths.

5.2 System Architecture

e-dward was applied to web site to verify the ability. The pages in which partial documents exist were 5 among 56 pages that exist in this site. After path candidates from those pages to the common ancestor page are generated, paths are merged and a presentation order is generated.

Figure 4 - Example of Path Navigation.

After path candidates from those pages to the common ancestor page were generated, a merger of paths was performed and a presentation order, as shown in Fig.4, was generated. ``T'' means a title of a site, ``A'' expresses an anchor and ``C'' expresses a partial document.

As a result of applying to the site, the appearance of the performed presentation screen is shown in Fig.5

Figure 5 - Screen Images using Presentation Effects.

6. Conclusion

We discussed a technique of extracting only a required portion from page groups which exists in a certain website. We proposed a presentation system called e-dward, which automatically adds performance in consideration of hyperlink structure of operation when showing them.

7. REFERENCES

Michael Chen, et al. : Cha-cha: a system for organizing intranet search results. : In Proc. of the 2nd USENIX Symposium on Internet Technologies and SYSTEMS : (1998).
K. Kazama, et al. : Multi-level Grouping of Search Results for the Search Engine(in Japanese) : In Proc. of The Second Workshop on Internet Technology : (1999).
Wen-Syan Li, et al. : Retrieving and Organizing Web Pages by “Information Unit” : The 10th International WWW Conference: pp.1375-1389 (2001).
N. Shinagawa, et al. : Dynamic Generation and Browsing of Virtual WWW Space Based on User Profiles : In Proc. of 5th International Computer Science Conference : Vol.1749 pp.93-108 (1999).
Y. Mizuuchi, et al : Finding Context-paths for Web Pages : In Proc. of ACM Hypertext ‘99 :pp.13-22 (1999).
Macromedia Flash: http://www.macromedia.com/flash/

Demo:

Example of Web site:
http://www.ai.cs.kobe-u.ac.jp/~minori/demo/index.html
Automatic Navigation for the site:
http://www.ai.cs.kobe-u.ac.jp/~minori/demo/WWW2002.html