Web结构挖掘与高维数据挖掘研究

英文题名：The Research on Web Structure Mining and High Dimensional Data Mining
作者：于红
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：Web结构挖掘 ; 链接分析 ; 高维数据聚类 ; 流形聚类 ; 增量抽样
英文关键词：Web Structure Mining ; Web Link Analysis ; High Dimensional Data
英文关键词：Clustering ; Manifold Clustering ; Incremental Sampling
学位年度：2012
导师：张宪超
学科代码：081202
学位授予单位：大连理工大学
论文提交日期：2012-07-10

摘要

数据挖掘是人工智能、机器学习、模式识别和信息决策等领域的前沿研究方向之一。随着Web的迅速发展以及数据采样能力的提升,Web挖掘和高维数据挖掘逐渐成为数据挖掘的两个重要任务。
     Web是现代社会人们传播和获取信息最重要的一个平台。Web中包含的网页数量已经达到十亿的规模,并且仍在与日剧增,Web包含的信息量更是呈现爆炸式的增长。由于Web中的信息是非结构化和自组织的,传统的信息检索技术很难在实际需求中得到有效的应用。除了Web页面以外,Web中还有大量的超链接。超链接蕴含了对网页的重要性评价信息,因此Web结构挖掘(即Web链接分析)成为提高Web信息检索质量最重要的途径。
     聚类分析是数据挖掘的基本方法之一,在许多领域都有着广泛的应用。近年来很多聚类问题中的数据普遍呈现出高维特征。而已有的经典聚类方法都是基于低维数据空间的假设,不能对高维数据进行有效聚类。高维数据聚类问题成为目前聚类分析研究的重点。流形聚类是近年来发展起来并被广泛研究的一种高维数据聚类分析方法。
     本文针对数据挖掘中的Web结构挖掘和高维数据聚类两个典型问题,研究分析了基于链接分析的搜索引擎页面排序算法、Web社区发现算法、流形聚类中的有效相异度度量以及针对大规模高维数据流形聚类的低秩逼近问题,主要贡献包括：
     (1)分析了基于链接分析的页面排序算法PageRank算法和HITS算法的特点,提出了基于多级衰减模型的PageRank算法框架,根据衰减模型来分配页面间的直接链接和间接链接的权值,提高了查询的精确度；提出了基于页面相似度和链接流行度的HITS改进算法,根据页面间相对于查询主题的相似度以及页面间链接的流行度来分配链接的权值,有效缓解了HITS算法的主题漂移问题。
     (2)深入研究了基于最大流的社区发现技术中边容量与社区的规模之间的关系,从社区发现角度分析了链接结构的特征,提出利用网页的入度和出度的概率分布来分配边容量的方法,减少了噪音页面被提取出来的可能性,提高了网络社区的质量。
     (3)提出了基于邻域路径的有效相异度,强化了通过流形学习算法获得的数据低维表示的类别特征,改善了通过流形学习进行聚类的效果。分析了采用Nystrom扩展方法逼近大规模核矩阵特征向量的近似程度与抽样点之间的关系,并基于此分析提出了增量抽样策略,提高了利用Nystrom扩展方法进行加速流形聚类时的聚类质量。
Data mining is one of the frontier research directions in Artificial Intelligence, Machine Learning, Pattern Recognition and Information Decision. With the rapid development of Web and the increasing ability of data sampling, Web mining and high dimension data mining are two important branches of data mining.
     Web is an important platform for people to spread and get information. At present, there are more than one billion web pages on the Internet and the number is increasing dramatically day by day. Also, the information contained in the Web increases explosively. On the other side, Web is self-organized and non-structured, so classical information retrieval techniques could not be applied in Web data mining. Other than web pages, there are huge numbers of Hyperlinks in the Web. Since Hyperlink contains the information to evaluate the importance of web page, Web structure mining (also called Hyperlink analysis) becomes an important way to improve the performance of the Web information retrieval.
     Clustering is one of the basis methods in data mining and is widely used in a lot of domains. Recently, the data in many clustering fields appears the high dimensional characteristic, such as transaction data, file-word frequency data, users grading data, Web logs and multi-media data, etc. Most of the classical clustering algorithms are based on the assumption that the processing data are the low dimensional data, which means they could not get effective clustering result when the data is high dimensional. Now, high dimensional data clustering is one of the key research problems of clustering analysis. Manifold clustering is a high dimensional data clustering technique which has developed quickly and has been studied widely. in recent years.
     In the paper, we focus on Web link analysis and high dimensional data clustering, which are the two classical research problems of data mining. We study the page ranking algorithms based on link analysis in search engine, the maximum flow algorithms to find web communities, the efficient dissimilarity in manifold clustering algorithms and the sampling-based low-rank approximation scheme for reducing the computational burdens in large scale manifold learning. The major contributions of the paper are summarized as follows:
     (1) Analyze the characters of classical page ranking algorithms in search engine which are basing on link analysis, i.e. PageRank and HITS. With regard to PageRank which focuses on no topics, a multi-level importance propagation framework for static ranking of web pages is proposed. It fits the direct hyperlinks and indirect hyperlinks with different weight according to the given attenuation model. Experiments demonstrate that the proposed PageRank modified framework improves the accuracy of searching results. With regard to HITS which focuses on topics, we fit the links with different weights by web pages similarity and links popularity. The modified HITS algorithm alleviates the topic drift problem effectively.
     (2) Study the relation between the edge capacity and the scale of the web community in the maximum flow method of identifying communities. The characters of link structure are mined in view of identifying communities. We improve the original maximum flow algorithm by employing the power law distribution of web pages'in-degree and out-degree, differentiating the web links among pages and efficiently assigning edge capacities variably. The improved maximum flow algorithm picks up few noise pages and improves the quality of the identified communities.
     (3) Neighbor path based effective dissimilarity is proposed to enhance the clusters' characters of the low dimensional manifold obtained by the manifold learning algorithms. It improves the clustering performance consistently. We analyze how the approximating quality of the Nystrom method depends on the choice of landmark points and the impact of matrix approximation error on the clustering performance of manifold clustering algorithms. An incremental sampling scheme for the Nystrom method based manifold clustering is proposed and it improves the clustering performance of fast manifold clustering approximated by the Nystrom method.

引文

[1]Tan P N, Steinbach M, Kumar V数据挖掘导论[M].北京：人民邮电出版社,2011.
    [2]Han J, Kamber M数据挖掘：概念与技术(第2版)[M].北京：机械工业出版社,2007.
    [3]Hand D, Mannila H, Smyth H数据挖掘原理[M].北京：机械工业出版社,2003.
    [4]Tomasev N, Radovanovic M, Mladenic D, et al. The Role of Hubness in Clustering High Dimensional Data[J]. Advances in Knowledge Discovery and Data Mining,2011,6634:183-195.
    [5]Nayak R, Tong C. Applications of Data Mining in Web Services[C]. Web Information Systems, Australia,2004:199-205.
    [6]陈建斌.高维聚类知识发现关键技术研究及应用[M].北京：电子工业出版社,2009.
    [7]Leandro A F, Manuel M. Oliveirab.A general framework for subspace detection in unordered multidimensional data[J].Pattern Recognition,2012,45(9),3566-3579.
    [8]Jain A K, Murty M N, Flynn P J. Data Clustering: A Review[J]. ACM Computing Surveys,1999, 31(3):264-323.
    [9]Srivastava A, Sahami M. Text Mining:Classification, Clustering, and Applications[M]. Chapman & Hall,2009.
    [10]费尔德曼.文本挖掘(英文版)[M].北京：人民邮电出版社,2009.
    [11]Sebastiani F.Text Categorization[J]. ACM Computing survey s,2002,34(1):147.
    [12]Cai D, He X F. Manifold Adaptive Experimental Design for Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering,2012,24(4):707-719.
    [13]Lee I, On B W.An effective web document clustering algorithm based on bisection and merge[J].Artificial Intelligence Review,2011,36(1):69-85.
    [14]Hansen P, Jaumard B. Cluster analysis and mathematical Programming[J]. Mathematical Programming,1997,79(1-3):191-215.
    [15]Lydia B C, Teevan J. Addressing people's information needs directly in a web search result page[C].Proceedings of the 20th international conference on World wide web, India,2011:27-36.
    [16]Ren P,Wilson R C, Edwin R. Hancock.High Order Structural Matching Using Dominant Cluster Analysis[C].Image Analysis and Processing, Italy,2011:1-8.
    [17]Parsons L, Haque E, Liu H. Subspace Clustering for High Dimensional Data:A Review[J]. ACM SIGKDD Explorations Newsletter,2004,6(1):90-105.
    [18]Fang G, Pandey G, Wang W, et al. Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data [J].IEEE Transactions on Knowledge and Data Engineering, 2012,24(2):279-294.
    [19]Kamakura W A.Sequential market basket analysis[J].Marketing Letters,2012,23(3):505-516.
    [20]Fu G, Shih F Y, Wang H. A kernel-based parametric method for conditional densityestimation[J]. Pattern Recognition,2011,44(2):284-294.
    [21]Ye Y F, Li T, Zhu S H, et al. Combining file content and file relations for cloud based malware detection[C]. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, USA,2011:222-230.
    [22]Hu M Q, Chen Y Q, Kwok J T. Building Sparse Multiple-Kernel SVM Classifiers[J]. IEEE Transactions on Neural Networks,2009,20(5):827-839.
    [23]罗会兰.聚类集成理论与其在图像分类中的应用[M].北京：科学出版社,2012.
    [24]Belkin M, Niyogi P. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [J]. Neural Computation,2003,15(6):1373-1396.
    [25]He Q, Xie Z X, Hu Q H, et al. Neighborhood based sample and feature selection for SVM classification learning[J]. Neurocomputing,2011,74(10):1585-1594.
    [26]Kong X N, Fan W, Yu P S. Dual active feature and sample selection for graph classification[C].Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, USA,2011:654-662.
    [27]Chu W, Zinkevich M, Li L H, et al. Unbiased online active learning in data streams[C].Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, USA,2011:195-203.
    [28]Kriegel H P, Kroger P, Ntoutsi I, et al.Density Based Subspace Clustering over Dynamic Data[C].Scientific and Statistical Database Management, USA,2011:387-404.
    [29]王晓宇,周傲英.万维网的链接结构分析及其应用综述[J].软件学报,2003,14(10)：1768-1780.
    [30]孟涛,闫宏飞,李晓明.一种评价搜索引擎信息覆盖率的模型及其验证[J].电子学报,2003,31(8)：1168-1172.
    [31]Craswell N, Hawking D.Information Retrieval:Searching in the 21st Century[M]. USA:Wiley, 2009.
    [32]Lawrence S, Giles C L. Searching the world wide web[J]. Science,1998,280:98-100.
    [33]Lawrence S, Giles C L. Accessibility of information on the Web[J]. Nature,1999,400:107-109.
    [34]Kleinberg J, Lawrence S. The structure of the Web[J]. Science,2001,294:1849-1850.
    [35]Zhang Q Y, Segall R S. Web mining: a survey of current research, techniques, and software[J].International Journal of Information Technology & Decision Making,2008, 7(4):683-720.
    [36]Getoor L. Link mining: a new data mining challenge[J].ACM SIGKDD Explorations Newsletter, 2003,5(1):84-89.
    [37]Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine[C]. Proceeding 7th ACM-WWW International Conference, New York,1998:107-117.
    [38]Kleinberg J M. Authoritative sources in a hyperlinked environment [J]. Journal of the ACM, 1999,46(5):604-632.
    [39]Kolari P, Joshi A.Web Mining:Research and Practice[J]. IEEE Computing in Science and Engineering,2004,6(4):49-53.
    [40]Poulovassilis A, Levene M. Web Dynamics[M]. Springer Verlag,2004.
    [41]Getoor L, Diehl C P. Link mining:a survey[J].ACM SIGKDD Explorations Newsletter,2005, 7(2):3-12.
    [42]Dong A L, Chang Y, Zheng H H,et al.Towards recency ranking in web search[C].Proceedings of the third ACM international conference on Web search and data mining, New York,2010:11-20.
    [43]Qi X G, Brian D.Web page classification:Features and algorithms[J].ACM Computing Surveys,2009,41(2):1-31.
    [44]Haveliwala T H. Topic-sensitive PageRank[C].Proceedings of the Eleventh International Word Wide Web Conference. Hoho Lulu Hawaii,2002:517-526.
    [45]Castillo C, Brian D. Adversarial Web Search[J].Foundations and Trends in Information Retrieval,2011,4(5):377-486.
    [46]Delbru R, Toupikov N, Catasta M, et al.Hierarchical Link Analysis for Ranking Web Data[C].The Semantic Web:Research and Applications, Greece,2010:225-239.
    [47]Broder A, Kumar R, Maghoul F, et al. Graph structure in the web:Experiments and models[C]. Proceedings of the 7th International World Wide Web Conference, Amsterdam,2000:309-320.
    [48]Flake G W, Lawrence S, Giles C L. Efficient Identification of Web Communities[C]. Proceedings of the 6th International Conference of Knowledge Discovery and Data Mining, USA, 2000:150-160.
    [49]Flake G W, Tsioutsiouliklis K, Zhukov L. Methods for Mining Web Communities-Biliometric, Spectral and Flow[M]. Berlin:Springer Verlag,2004.
    [50]Yin X, Han J, Yu P S. LinkClus:Efficient Clustering via Heterogeneous Semantic Iinks[C]. Proceedings of the 32nd international conference on Very large data bases, Korea,2006:427-438.
    [51]AlSumait L, Domeniconi C.Survey of Text Mining II:Clustering, Classification, and Retrieval[M]. Berlin:Springer Verlag 2008.
    [52]Mahdavi M, Abolhassani H. Harmony K-means algorithm for document clustering[J].Data Mining and Knowledge Discovery,2009,18(3):370-391.
    [53]Beyer K, Goldstein J, Ramakrishnan R. When is "Nearest Neighbor" Meaningful[C]. Proceedings of the 7th International Conference on Database Theory, Russia,1999:217-235.
    [54]De Vries T, Chawla S, Houle M E. Finding Local Anomalies in Very High Dimensional Space[C].2010 IEEE 10th International Conference on Data Mining, Australia,2010:128-137.
    [55]Fan M Y, Qiao H, Zhang B. Intrinsicdimension estimation of manifolds by incising balls[J]. Pattern Recognition,2009,42(5):780-787.
    [56]Mo D Y, Huang S H.Fractal-Based Intrinsic Dimension Estimation and Its Application in Dimensionality Reduction [JJ.IEEE Transactions on Knowledge and Data Engineering,2012, 24(1):59-71.
    [57]Bouveyron C, Celeux G, Girard S. Intrinsicdimension estimation by maximum likelihood in isotropic probabilistic PCA[J].Pattern Recognition Letters,2011,32(14):1706-1713.
    [58]Zhao L, Yang Y. Theoretical Analysis of Illumination in PCA-Based Vision Systems[J]. Pattern Recognition,1999,32(4):547--564.
    [59]Basri R, Jacobs D. Lambertian Reflectance and Linear Subspaces[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2003,25(2):218-233.
    [60]Costeira J P, Kanade T. A Multibody Factorization Method for Independently Moving Objects[J]. International Journal of Computer Vision,1998,29(3):159-179.
    [61]Yan J Y, Pollefeys M. A Factorization-based Approach for Articulated Nonrigid Shape, Motion, and Kinematic Chain Recovery from Video[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2008,30(5):865-877.
    [62]Seung H S, Lee D D. Cognition-The Manifold Ways of Perception[J]. Science,2000,290(5500): 2268-2269.
    [63]Tenenbaum J B, de Silva V, Langford J C. A Global Geometric Framework for Nonlinear Dimensionality Reduction[J]. Science,2000,290(5500):2319-2323.
    [64]Roweis S T, Saul L K. Nonlinear Dimensionality Reduction by Locally Linear Embedding[J]. Science,2000,290(5500):2323-2326.
    [65]Borodin A, Roberts G O, Rosenthal J S, et al.Finding authorities and hubs from link structures on the World wide web[C].Proceedings of the 10th International World Wide Web Conference, Hong Kong,2001:415-429.
    [66]Bharat K, Henzinger M. Improved algorithms for topic distillation in a hyperlinked environment [C]. Proceedings of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval. Melbourne,1998:104-111.
    [67]Wang X, Wu H, Wei L, Zhou A. A similarity-based analysis model for topic distillation[J]. International Journal of Computational Intelligence and Application,2002,2(3):267-275.
    [68]Chakrabarti S, Dom B, Gibson D, et al. Automatic resource compilation by analyzing hyperlink structure and associated text[C]. Proceedings of the 7th ACM-WWW International Conference, Brisbane,1998:65-74.
    [69]Bianchini M, Gori M, Scarselli F. Inside pagerank[J]. ACM Transactions on Internet Technology, 2005,5(1):92-128.
    [70]Langville A, Meyer C. Deeper inside pagerank[J]. Internet Mathematics,2005, 1(3):335-380.
    [71]Gibson D, Kleinberg J M, Raghavan P. Inferring Web Communities from Link Topology[C]. Proceedings of the ninth ACM conference on Hypertext and hypermedia:links, objects, time and space, New York,1998:225-234.
    [72]Chakrabarti S, Dom B, Gibson D, et al. Experiments in topic distillation[C]. Proceedings of the ACM SIGIR workshop on Hypertext Information Retrieval on the Web, Melbourne,1998:13-21.
    [73]Kumar R, Raghavan P, Rajagopalan S, et al. Trawling the web for emerging cyber-communities[J].The International Journal of Computer and Telecommunications Networking,1999,31 (11-16):1481-1493.
    [74]Kumar R, Raghavan P, Rajagopalan S, et al. Extracting large-scale knowledge base from the web[C]. Proceedings of the 25th International Conference on Very Large Data Bases, Scotland, 1999:639-650.
    [75]Reddy P K, Kitsuregawa M. Inferring Web communities through relaxed-cocitation and power-law[R]. Kitsuregawa Lab, Annual Report,2001.
    [76]陈维桓.微分流形初步(第二版)[M].北京：高等教育出版社,2001.
    [77]Lang S.Introduction to Differentiable Manifolds Second Edition[M]. Berlin:Springer-Verlag, 2010.
    [78]Hotelling H. Analysis of a Complex of Statistical Variables into Principal Components [J]. Journal of Educational Psychology,1933,24:417-441.
    [79]Jolliffe I. Principal Component Analysis[M]. Berlin:Springer-Verlag,2002
    [80]Cox T, Cox M. Multi-dimensional Scaling[M]. London, UK:Chapman and Hall,2001.
    [81]Tenenbaum J,Silva V, Langford J. A global geometric framework for nonlinear dimensionality reduction[J]. Science,2000,290(5500):2319-2323.
    [82]Roweis S, Saul L.Nonlinear dimensionality reduction by locally linear embedding[J]. Science, 2000,290(5500):2323-2326.
    [83]Souvenir R, Pless R. Manifold Clustering[C]. Proceedings of the Tenth IEEE International Conference on Computer Vision, Washington,USA,2005:648-653.
    [84]Wang Y, Jiang Y, Wu Y, et al. Multi-Manifold Clustering[C]. Proceedings of the Eleventh Pacific Rim International Conference on Artificial Intelligence, New York,2010:280-291.
    [85]Hartigan J A, Wong M A. A K-Means Clustering Algorithm[J]. Applied Statistics,1979, 28:100-108.
    [86]Fukunaga K. Introduction to Statistical Pattern Recognition[M]. USA:Academic Press Professional, Inc.,1990.
    [87]Vidal R, Ma Y, Sastry S. Generalized Principal Component Analysis (GPCA)[C]. Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington,2003:621-628.
    [88]Vidal R, Ma Y, Piazzi J. A New GPCA Algorithm for Clustering Subspaces by Fitting, Differentiating and Dividing Polynomials[C]. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA,2004:510-517.
    [89]Vidal R, Ma Y, Sastry S. Generalized Principal Component Analysis (GPCA)[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(12):1945--1959.
    [90]Shi J B, Malik J. Normalized Cuts and Image Segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
    [91]Ng A, Jordan M, Weiss Y. On Spectral Clustering: Analysis and an Algorithm[C]. Advances in Neural Information Processing Systems 13, USA,2000:849-856
    [92]Cao W B, Haralick R. Nonlinear Manifold Clustering by Dimensionality[C]. Proceedings of the Eighteenth International Conference on Pattern Recognition, Washington,2006:920-924.
    [93]Page L, Brin S, Motwani R, et al.The PageRank citation ranking:Bringing order to the Web Technical report[R]. CA:Stanford University, Stanford,1998.
    [94]Haveliwala T H. Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web search[J]. IEEE Transactions on Knowledge and Data Engineering,2003,15(4):784-796.
    [95]Richardson M, Domingos P. The intelligent surfer:Probabilistic combination of link and content information in PageRank[J]. Advances in Neural Information Processing Systems,2002,14: 673-680.
    [96]Diligenti M, Gori M, Maggini M. Web page scoring systems for horizontal and vertical search[C]. Proceedings of the 1 lth international conference on World Wide Web, USA,2002:508-516.
    [97]Franceschet M. PageRank: standing on the shoulders of giants[J].Communications of the ACM, 2011,54(6):92-101.
    [98]Fagin R, Kumar R, Sivakumar D. Comparing top k lists[C]. Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms, USA,2003:28-36.
    [99]Lichter S, Friesz T L. Networks and Dynamics:The Structure of the World We Live In[J]. International Series in Operations Research & Management Science,2007,102:7-52.
    [100]Chung K L概率论教程[M].北京：机械工业出版社,2010.
    [101]Newman M E J. The Structure and Function of Complex Networks[J]. SIAM Review,2003, 45(2):167-256.
    [102]Lempel R, Moran S. The Stochastic Appoarch for Link-Structure Analysis(SALSA) and the TKC Effect[J]. Computer Networks.2000,33(1):387-401.
    [103]Bharat K, Mihaila G A. When experts agree:Using non-affiliated experts to rank popular topics[J]. ACM Transactions on Information Systems,2002,20(l):47-58.
    [104]Zhang M, Gao J F, Ma S H. Anchor Text and Its Context Based Web Information Retrieval[J]. Journal of Computer Research and Development,2004,40(1):221-226.
    [105]Erk E, Pado S. A structured vector space model for word meaning in context[C]. Proceedings of the Conference on Empirical Methods in Natural Language Processing, USA,2008:897-906.
    [106]Ford L R, Fulkerson D R. Maximum flow through a network[J]. Canadian Journal of Mathematics,1956:399-404.
    [107]Cormen T H, Leisersion C E, Rivest R L. Introduction to algorithms[M].USA:MIT Press and McGraw-Hill Book Company,1992.
    [108]Imafuji N, Kitsuregawa M. Finding a web community by maximum flow algorithm with HITS score based capacity[C]. Proceedings of the Eighth International Conference on Database Systems for Advanced Applications, Japan,2003:101-106.
    [109]Barabasi A L, Albert R. Emergence of Scaling in Random Networks[J]. Science,1999,286 (5439):509-512.
    [110]Albert R, Jeong H, Barabasi A L.Diameter of the World Wide Web[J].Nature,1999,401: 130-131.
    [111]Joshila Grace L K, Maheswari V, Nagamalai D. Web Log Data Analysis and Mining[J]. Communications in Computer and Information Science,2011,133(5):459-469.
    [112]Dean J, Henzinger M R.Finding related pages in the world wide web[C].TProceedings of the eighth international conference on World Wide Web,Canada,1999:1467-1479.
    [113]Zhang Y C, Xu G D.Using Web Clustering for Web Communities Mining and Analysis[C]. International Conference on Web Intelligence and Intelligent Agent Technology, Australia,2008: 20-31.
    [114]Flake G W, Lawrence S, Giles C L, et al.Self-organization of the web and identification of communities[J]. IEEE Computer,2002,35 (3):66-71.
    [115]盛骤,谢式千,潘承毅.概率论与数理统计[M].北京：高等教育出版社,2008.
    [116]Hopcroft J. Future Directions in Computer Science[R]. The 2nd International Frontiers of Algorithmics Workshop.Changsha,China,2008.
    [117]Spearman C. General Intelligence:Objectively Determined and Measured[J].American Journal of Psychology,1904,15:201-293.
    [118]He X F, Niyogi P. Locality Preserving Projections[C]. Advances in Neural Information Processing Systems, Canada,2003:153-160.
    [119]He X F, Cai D, Yan S C, et al. Neighborhood Preserving Embedding[C]. Proceedings of the Tenth IEEE International Conference on Computer Vision, China,2005:1208-1213.
    [120]Scholkopf B, Smola A J, Muller K. R. Nonlinear component analysis as a kernel eigenvalue problem[J]. Neural Computation,1998,10:1299-1319.
    [121]Zha H Y, Zhang Z Y. Isometric embedding and continuum ISOMAP[C]. Proceedings of the Twentieth International Conference on Machine Learning, Washington,2003:864-871.
    [122]Aizerman M A, Braverman E M, Rozono er L I. Theoretical foundations of the potential function method in pattern recognition learning[J]. Automation and Remote Control,1964,25: 821-837.
    [123]Ham J, Lee D, Mika S, Scholkopf B. A kernel view of the dimensionality reduction of manifolds[C]. Proceedings of the Twenty-First International Conference on Machine Learning, 2004.
    [124]Scholkopf B, Smola A J. Learning with Kernels[M]. USA:MIT Press,2002.
    [125]Grimes C, Donoho D L. When does isomap recover the natural parameterization of families of articulated images[R]. Technical Report 27, Stanford University,2002.
    [126]Kondor I, Lafferty J. Diffusion kernels on graphs and other discrete structures[C]. The Nineteenth International Conference on Machine Learning, Sydney,2002:315-332.
    [127]Fischer B, Zoller T, Buhmann J M. Path Based Pairwise Data Clustering with Application to Texture Segmentation[C]. Proceedings of the Third International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, France,2001:235-250.
    [128]Yan D, Huang L, Jordan M I. Fast approximate spectral clustering [C]. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, Slovenia 2009:907-916.
    [129]Zelnik-Manor L, Perona P. Self-tuning spectral clustering [C]. Advances in neural information processing systems, Cananda,2005:1601-1608.
    [130]Feldman D, Pasadena C, Langberg M. A unified framework for approximating and clustering data [C]. Proceedings of the 43rd annual ACM symposium on Theory of computing, California, 2011:569-578.
    [131]Song Y, Chen W Y, Bai H, et al.Parallel spectral clustering[C]. Machine Learning and Knowledge Discovery in Databases, Belgium,2008:374-389.
    [132]Liu T, Moore A W, Gray A, et al. An investigation of practical approximate nearest neighbor algorithms [C]. Advances in neural information processing systems, Cananda,2004:825-832.
    [133]Zhang K, Kwok J T. Clustered Nystrom Method for Large Scale Manifold Learning and Dimension Reduction [J]. IEEE Transactions on Neural Networks,2010,21(10):1576-1587.
    [134]Fine S, Scheinberg K. Efficient SVM training using low-rank kernel representations[J]. Journal of Machine Learning Research,2002,2:243-264.
    [135]Drineas P, Mahoney M W. On the Nystr"om method for approximating a Gram matrix for improved kernel-based learning[J]. Journal of Machine Learning Research,2005,6:2153-2175.
    [136]Williams C K I, Seeger M. Using the Nystr"om method to speed up kernel machines[C]. Advances in Neural Information Processing Systems, Canada,2001:682-688.
    [137]Zhang K, Kwok J T. Block-quantized kernel matrix for fast spectral embedding[C]. Proceedings of the 23rd international conference on Machine learning, USA,2006:1097-1104.
    [138]Zhang K, Kwok J T. Density-weighted Nystrom method for computing large kernel eigensystems[J]. Neural Computation,2009,21(1):121-146.
    [139]Kumar S, Mohri M, Talwalkar A. On sampling-based approximate spectral decomposition[C]. Proceedings of the 26th Annual International Conference on Machine Learning, Canada, 2009,553-560.
    [140]Achlioptas D, McSherry F. Fast computation of low rank matrix approximations[C]. Proceedings of the thirty-third annual ACM symposium on Theory of computing, Greece, 2001:611-618.
    [141]Drineas P, Drinea E, Huggins P S. An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition[C]. Proceedings of the 8th Panhellenic conference on Informatics, Cyprus,2003:279-296.
    [142]Belabbas M A, Wolfe P J. Spectral methods in machine learning and new strategies for very large datasets [J]. Proceedings of the National Academy of Sciences of the United States of America,2009,51(6):369-374.
    [143]Zhang K, Tsang I W, Kwok J T. Improved Nystrom low-rank approximation and error analysis [C]. Proceedings of the 25th international conference on Machine learning, Helsinki, 2008:1232-1239.
    [144]Li M, Kwok J T, Lu B L. Making Large-Scale Nystrom Approximation Possible [C]. Proceedings of the 27th International Conference on Machine Learning, Israel,2010:1-8.
    [145]Hunter B, Strohmer T, Performance Analysis of Spectral Clustering on Compressed, Incomplete and Inaccurate Measurements [R]. Arxiv preprint,2010.
    [146]Davis C, Kahan W.The rotation of eigenvectors by a perturbation [J]. SIAM Journal on Numerical Analysis,1970,7(1):1-46.
    [147]Estrada F, Jepson A. Benchmarking image segmentation algorithms [J]. International Journal of Computer Vision,2009,85(2):167-181.
    [148]Martin D, Fowlkes C, Malik J, Learning to detect natural image boundaries using local brightness, color, and texture cues[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(5):530-549.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700