Image Ranking Schemes Using Link-Structure Analysis Algorithm

Chik Ching Yiu and Ip Che Yin
Department of Computer Science and Engineering
Chinese University of Hong Kong
Shatin, Hong Kong
cychik, cyip@cse.cuhk.edu.hk

ABSTRACT

The existing image search engines do not use link-structure analysis algorithms to analyze and classify images according to the link-structure on the Web. The image-ranking methods using link-structure analysis algorithms proposed by the paper [4] may cause the problem of the "Tightly-knit" community (TKC) effect as described in [3, 5, 7]. We propose three combinations of image-ranking schemes to solve the problem of TKC effect on image search. Our experiments demonstrate that the image-ranking methods proposed by [4] have the TKC effect, but the three proposed combinations of image-ranking schemes can solve the TKC effect on image search.

Keywords

Image Ranking Scheme, PageRank, Kleinberg's mutual reinforcement approach, LinkSink, TKC effect

1. INTRODUCTION

The most famous link-structure analysis algorithms are PageRank [6] and Kleinberg's mutual reinforcement approach [2]. However, PageRank has the problem of Link Sink [6] while Kleinberg's mutual reinforcement approach has the problem of "Tightly-knit" community (TKC) effect [3, 5, 7].

Link Sink occurs when page a and page b point to each other but have no links link to other pages. If page a is pointed to by an external page, during the iteration of PageRank, the loop accumulates the weight and never distributes the weight to other pages. This causes the oscillation of the algorithm and the algorithm cannot converge.

The paper about SALSA [5] describes the TKC and TKC effect as follows: "A 'tightly-knit' community is a small but highly interconnected set of sites. Roughly specking, the TKC effect occurs when such a community scores high in link-analyzing algorithms, even though the sites in the TKC are not authoritative on the topic, or pertain to just one aspect of the topic".

To solve these problems, [6] modifies the PageRank equation and [3] introduces a modification of Kleinberg's algorithm called Hub-Threshold Kleinberg's algorithm.

For image search using link-structure analysis algorithms, the paper on PicASHOW [4] defines a t-relevance score r_t(p) of page p and suggests that the authority scores calculated in Kleinberg's Mutual Reinforcement approach can be the t-relevance score. Then, it suggests three image ranking schemes. They are PageRank-influenced ranking scheme, Co-citation based analysis and Enhanced co-citation based analysis.

However, they still have the TKC effect if the t-relevant score is calculated using Kleinberg's algorithm. To solve the TKC problem, we propose there combinations of image-ranking schemes.

In this paper, we first propose three combinations of image-ranking schemes to solve the TKC problem. Then, we describe the experiments to evaluate the image-ranking schemes and demonstrate that the three combinations of image-ranking schemes proposed can solve the TKC problem. Beside the TKC problem, we also evaluated whether the algorithms had the link sink problem and showed the image-to-image relations. Moreover, we used large random generated link structures to evaluate the speed of the algorithms.

2. PROPOSED IMAGE RANKING SCHEMES

The image-ranking schemes we proposed are based on PageRank-influenced ranking scheme, enhanced co-citation based analysis, Hub-Threshold Kleinberg's algorithm and PageRank algorithm.

2.1 PageRank-influenced ranking scheme with PageRank algorithm

It uses the PageRank score of page p, instead of the authority score of page p calculated by the original Kleinberg's algorithm as the t-relevant score r_t(p) since PageRank does not have the problem of TKC effect.

2.2 PageRank-influenced ranking scheme with Hub-Threshold Kleinberg's algorithm

It uses the authority score of page p calculatied by Hub-Threshold Kleinberg's algorithm, instead of original Kleinberg's algorithm as the t-relevant score r_t(p) since Hub-Threshold Kleinberg's algorithm does not have the problem of TKC effect.

2.3 Enhanced co-citation based analysis with Hub-Threshold Kleinberg's algorithm

The enhanced co-citation based analysis is same as that suggested by the paper [4]. The difference is the t-relevant score r_t(p) used is the authority score of page p calculated by Hub-Threshold Kleinberg's algorithm, instead of original Kleinberg's algorithm since Hub-Threshold Kleinberg's algorithm does not have the problem of TKC effect.

3. EVALATION ON IMAGE RANKING SCHEMES

This section explains the evaluation of the image-ranking methods using synthetic link structures. We evaluated the image-ranking methods that may cause link sink problem or TKC problem on image search. The image-ranking methods are:

PageRank-influenced ranking scheme using the authority score of Kleinberg's algorithm as the t-relevant score. (PrinfKM)
Co-citation based analysis using Kleinberg's algorithm to calculate image scores. (CoKM)
Co-citation based analysis using Hub-Threshold Kleinberg's algorithm to calculate image scores. (CoHT)
Original enhanced co-citation based analysis using authority score from Kleinberg's algorithm as the t-relevant score. (ECKM)

Moreover, we also evaluated the three solutions to the TKC effect on image ranking to verify that they can solve TKC effect:

PageRank-influenced ranking scheme with PageRank algorithm (PrinfPR)
PageRank-influenced ranking scheme with Hub-Threshold Kleinberg's algorithm (PrinfHT)
Enhanced co-citation based analysis with Hub-Threshold Kleinberg's algorithm (ECHT)

Beside the results on link sink problem and TKC effect, we also evaluated whether the algorithms showed the image-to-image relations and used large random generated link structures to evaluate the speed of the algorithms.

3.1 Link Sink

The web structure in Figure 1 is used to evaluate the link sink problem on the image-ranking methods. In Figure 1, page 2 and page 3 form a link sink structure. Figure 2 shows the results of each method on the structure.

Figure 1: Page-image structure with link sink structure

Figure 2: The resultant score (rank) of each image in figure 1 from the image-ranking method

The result shows that PrinfPR is affected by PageRank algorithm; it has the link sink problem while the others do not. Images 3 and 4 of Figure 1 should not be highly ranked as shown by the other methods. But PrinfPR thinks that images 2, 3 and 4 should be highly ranked.

Note that PageRank-Influenced method sums the authority scores of the pages containing the image. The effects of Link Sink are the same if the page contains distinct image or multiple images.

3.2 TKC effect

The web structure in Figure 3 is used to evaluate the problem of TKC effect on the image-ranking schemes.

Figure 3: Web structure shows the relation of pages and images and the TKC structure.

In Figure 3, that the pages inside the dotted rectangle are relevant to the topic t. But the pages outside the rectangle may be little or not relevant to the topic t. The page 5 is relevant to topic t but it does not contain the keyword on t. But P4 which is a page or a small set of pages is irrelevant to topic t. Users want to find the images contained in page 5 (images 18 and 19) and the pages inside the rectangle (images 1-15). To emphasize the results and for simplification, Figure 4 shows the results of the each method on the structure considering P4 as a page.

Figure 4: The resultant score (rank) of each image in figure 3 from the image-ranking method

The results show the problem of TKC effect on PrinfKM, CoKM, CoHT and ECKM. Images 16 and 17 obtain very high scores from PrinfKM, CoKM, CoHT and ECKM. But images 16 and 17 may not be relevant to the topic.

PrinfPR, PrinfHT and ECHT solved the problem of TKC effect. The irrelevant images 16 and 17 do not obtain high scores from these methods. The other images become significant on PrinfPR, PrinfHT and ECHT.

Also, no TKC effect in page domain does not imply that there is no TKC effect in image domain. Hub-Threshold Kleinberg's algorithm does not induce TKC effect in our example, but CoHT has TKC effect and ECHT does not.

Moreover, the ranking scores of image 1 and images 2 to 15 are the same on the results produced by PrinfKM, PrinfPR and PrinfHT. But they are different on CoKM, CoHT, ECKM and ECHT. The ranking score of image 1 is higher than that of images 2 to 15. The ranking scores are same on PrinfKM, PrinfPR and PrinfHT because PageRank-influenced ranking scheme only analyzes the page-to-page relations in the link structure. It does not analyze the image-to-image relations.

On the other hand, co-citation based analysis and enhanced co-citation based analysis analyze the image-to-image relations. The ranking score of image 1 is higher because it is pointed to by page 6, which is a good image container [4] that also points to highly ranked image 19.

Likewise, for considering P4 as a small set of sites, the three ranking schemes proposed can still solve the problem of TKC effect while CoKM and CoHT cannot. Figure 5 shows the results of the each method on the structure considering P4 as a small set of sites.

Figure 5: The resultant score (rank) of each image in figure 3 from the image-ranking method

3.3 Speed

We also evaluated the speed of the three image-ranking schemes. We generated a large link structure by random and then counted the time used for performing PageRank-influenced ranking scheme, co-citation based analysis and enhanced co-citation based analysis on the structure.

The link structure generated had 1000 pages and 1000 images. The average out-degree of each page was 20 and each page contained 20 images. Figure 5 shows the calculation time of each scheme.

Figure 6: Calculation time of the three image-ranking schemes.

The result shows that enhanced co-citation based analysis and co-citation based analysis required much more time than PageRank-influenced ranking scheme. This was because the multiplication of the adjacency matrix of page-to-page relation and the adjacency matrix of page-to-image relation required a lot of time. The time complexity of multiplication of two matrices is at least O(N²), but that of PageRank-influenced ranking scheme is only about O(N), where N is the number of web pages or images. The difference is more significant if N increases.

3.4 Conclusion

PageRank-influenced ranking scheme with PageRank algorithm, PageRank-influenced ranking scheme with Hub-Threshold Kleinberg's algorithm and Enhanced co-citation based analysis with Hub-Threshold Kleinberg's algorithm can solve the problem of TKC effect. However, the first one may cause link sink problem and both first and second do not analyze the relations between images.

The speed of the third method is a problem when the number of pages or images is large because it is based on enhanced co-citation based analysis. The solution to this is to set a threshold. If the number of pages or images is less than the threshold, the search engine calculates the image scores using the third method. Otherwise, it calculates the image scores using the second method.

4. ACKNOWLEDGEMENTS

This project is supported in part by grants from the Hong Kong's Research Grants Council (RGC) under CUHK 4407/99E and #2050259.

5. REFERENCES

S. Brin and L. Page. The anatomy of a large scale hypertextual web search engine. Proc. 7th International WWW Conference, 1998. http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm .
Jon Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46, 1999. http://www.cs.cornell.edu/home/kleinber/auth.pdf .
Allan Borodin, Gareth O. Roberts, Jeffrey S. Rosenthal and Panayiotis Tsaparas. Finding Authorities and Hubs From Link Structures on the World Wide Web. Proc. 10th International WWW Conference, 2001. http://www10.org/cdrom/papers/314/index.html.
Ronny Lempel and Aya Soffer. PicASHOW: Pictorial Authority Search by Hyperlinks on the Web. Proc. 10th International WWW Conference, 2001. http://www10.org/cdrom/papers/289/index.html.
R. Lempel and S. Moran. The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect. Proc. 9th International WWW Conference, 2000. http://www9.org/w9cdrom/175/175.html.
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. 1998. http://citeseer.nj.nec.com/page98pagerank.html.
David Cohn and Huan Chang. Learning to probabilistically Identify Authoritative Documents. Preprint, 2000. http://citeseer.nj.nec.com/cohn00learning.html.