     1.提出一种嵌入分布信息的文档相似性度量方法。现有的Web挖掘技术大部分是基于传统的VSM(Vector Space Model)向量空间,虽然能达到一定的效果,但是忽略了Web文档中其它有用的信息。针对此问题,本文引入了文档中单词的分布信息,提出了新的相似性度量方法。实验结果表明,新相似性度量方法能较好的提高聚类效果。
With the rapid development of the internet, the information resources on the Web have covered all the fields of the society, the issue of overloading information becomes more serious day by day, which boosts the development of the Web Data Mining Technique. In this paper, from the viewpoint of web document clustering, we do our research on the representation of distribution information of a document and the corresponding similarity measurement, and multi-views clustering, and kernel based multi-views learning. The main contributions of this paper are as follows:
     1. Propose a similarity measurement method which incorporates distribution information. Most of the existing Web Data Mining techniques are based on VSM, which only achieves some effects, and does not concern other useful information contained in the web document. In this thesis, we introduce a new similarity measurement method with the distribution information of the word contained in the document, which is an extension of the traditional similarity measurement. Experiments show that, the new similarity measurement in this thesis has better clustering performance than the traditional similarity method.
     2. Propose a new mult-view algorithm. In this method, different algorithms have been applied on various views, which can express the distributional features of the document in the data set more clearly. Experimental results show that the accuracy of the classification has been improved.
     3. Propose a kernel-based co-training clustering algorithm. The different kernel functions can induce different distances of the original samples in original space. In this thesis, plenty of tests have been performed by using Polynomial Kernel and Gaussian Kernel; the results show that after adopting the kernel methods, the multi-view algorithm of clustering have been apparently improved.
