     3.对网页内部图像,网页外部图像以及缩略图,Visual Snippet进行了深入的比较。我们利用人工标注的数据比较可视化摘要在不同网页中的效果,比如,重要性得分很高的内部图像是有内部图像的网页的可靠可视化摘要,而缩略图适合作为满足“可视区域较小”,或“在截屏区域内有重要图像”,或“截屏区域内有常见网站的logo"等特点的网页的可视化摘要。另外,我们还通过用户研究分析可视化摘要在理解网页和重新寻找网页这两个应用中的实用性。
With the rapid development of Internet, search engines have been the major method for users to seek information. Beyond all of the users' needs, accuracy and quickness are the most important ones. However, the accuracy of current search engines cannot fully satisfy the users, so it becomes essential that users can quickly find the needed information with the current search technologies.
     Visual contents, such as the images, animations and videos, are contained in web pages. A picture is worth a thousand words. Information search would become much more efficient if the visual information can be shown in the search result page, since it is easier for users to get a quick understanding by seeing an image than reading texts. These visual contents, which may help users search, are called visual summarizations. Among visual summarizations, the image is the basic component of the animation and video, so we discuss the key technologies of using images as the visual summarizations.
     For a specific web page, the images in this page, which are so-called "internal images", are generally reliable as the visual summarizations. For these images, we proposed a dominance model to measure the ability of them representing the web page. The more dominant the internal images are, the more appropriate they would be to serve as the visual summarizations. However, dominant internal images are unavailable in a lot of web pages, so we proposed a scheme to obtain from the Internet the images relevant to the target web page, which are so-called "external images". Besides, we compared these two natural image based visual summarizations with the synthesized images, such as thumbnails. Based on the comparisons, we further proposed an algorithm to select the best visual summarizations from the internal and external images. The main contents and contributions of this dissertation are as follows:
     1. Proposed a dominance model for internal images. Since advertisement images, decoration images exist in the web pages, we proposed an algorithm to measure the dominance of internal images based on feature extraction and machine learning. The image features were extracted on four levels and LamdaMART algorithm, which is based on boosted tree and optimized for NDCG, was applied in our system to establish the dominance model.
     2. Proposed algorithms to obtain external images and measure the relevance between them and the target web page. Relevant external images were obtained from the Internet based on key phrase extraction and image search, and then the relevance was calculated using textual and visual information of these images. Our system can find relevant external images for almost a half of the web pages without dominant internal images and achieve a high precision.
     3. Performed comparisons between internal images, external images, thumbnails and visual snippets. With a human labeled data set, we analyzed the characteristics of the web pages which were well represented by a specific kind of visual summarization. For example, internal images with high dominance scores are reliable as visual summarizations, and thumbnails are good visual summarizations for those web pages with small page sizes or with dominant images or logos from well-known sites in the snapshot area. Besides, we conducted user studies to compare the visual summarizations in web page understanding and re-finding tasks.
     4. Proposed an algorithm to jointly select the best visual summarization from the internal images and external images. To take the respective advantages of internal images and external images, we proposed a clustering based algorithm to select the best visual summarization. This algorithm leveraged the relevance and dominance as the prior information and exhibited the typicality property using the affinity propagation clustering algorithm. The best exemplar of the clustering algorithm was selected as the best visual summarization. Experimental results have shown that our algorithm can achieve about0.6NDCG@1performance. Our user study also indicated that the images selected by our algorithm were useful as the visual summarizations of web pages.
