Web 2.0环境下互联网信息过滤理论与方法研究

英文题名：Research on Theories and Methods of Information Filtering under Web 2.0
作者：李东方
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：Web ; 2.0 ; 信息过滤 ; 广告检测 ; 大规模聚类算法 ; 谱聚类 ; 热度扩散模型 ; 热点话题检测
英文关键词：Web 2.0 ; Information Filtering ; Advertising detection ; Large-Scale Clustering ; Spectral Clustering ; Heat Diffusion Model ; Hot topic detection
学位年度：2009
导师：俞能海
学科代码：081002
学位授予单位：中国科学技术大学
论文提交日期：2009-05-10

摘要

互联网近年来得到了迅猛发展,伴随着Web 2.0等技术的不断进步,互联网承载的应用与信息活动越来越多,人们对互联网的依赖程度也越来越高。在Web 2.0时代,一方面,互联网上的媒体类型呈现多样化特点。多媒体信息携带的听觉和视觉信息与传统的文本信息成互补,有效的丰富了互联网上的信息内容和用户浏览体验。如何针对多种媒体信息进行有效过滤是Web 2.0下信息过滤的重要任务。另一方面,在Web 2.0时代,用户为互联网的中心。互联网呈现出社会性与动态特性,大量动态的数据涌现。这些数据极大的丰富了互联网内容,给人们提供了众多的信息来源。如何从这些用户创造的数据中学习用户的习惯并过滤其中的热点信息成为互联网的重要的研究课题。此外,海量的用户参与为互联网带来了海量数据,如何改进传统算法以适应这些海量数据成为重要的研究课题。本文的研究重点是Web 2.0下信息过滤。本文分析了Web 2.0下信息过滤任务面临的挑战,我们分别对多种媒体信息综合过滤、应用于海量数据的学习算法和挖掘Web 2.0用户丰富的反馈数据进行了研究,并提出了应对这些问题的理论与方法。
     论文的主要研究内容与创新成果如下:
     本文针对Web 2.0时代多种媒体信息并存的特点提出了综合多种媒体特征的信息过滤方法。并针对互联网中广告图片过滤问题,综合利用网页中文本信息、图片内容信息等,结合SVM和AdaBoost学习算法,有效的实现了对广告图片的过滤。本文提取了丰富的媒体内容特征、相关的页面布局特征和文本特征。并基于AdaBoost提出了特征选取办法,对特征集合进行筛选和有机的整合。本文还构建了一个大规模的实验数据集来对算法进行验证。验证结果证实了算法特征集选取的合理性及特征选取算法的可行性。本文还对比了各种特征的分类效果及分类有效性。
     本文基于Normalized Cut提出了一种快速谱聚类算法FSC来对互联网上的海量的文本数据进行快速聚类。本文中分析了谱聚类算法应用到大规模文本聚类中的难点,并给出了解决办法。FSC首先利用GSASH算法将大规模的高维文本数据快速表示为图,并利用AMG数值分析方法将谱分析对应的大规模特征值系统迭代化简为较小规模特征值系统,进而取得近似解。本文还从理论角度分析了这种近似的有效性。实验结果表明,FSC保持了谱聚类算法优点,并且成功的将算法复杂度降低到O(nlogn),进而可以应用到大规模文本聚类问题上来。
     本文基于热量扩散模型提出了一种针对Web 2.0环境下的信息热度评价与挖掘算法。本文针对Web 2.0时代互联网呈现出的社会性与动态特性,对Web 2.0时代的互联网进行建模。本文将互联网上用户的信息活动看作为热度活动,建立互联网热量扩散模型,利用用户反馈信息对互联网上的信息进行热度评估,并挖掘其中的热点。本文对热度模型进行了详细的定义,并证明了其稳定性和算法收敛性。实验结果表明本文的算法能很好的模拟互联网上的信息活动。
Rapid development has been achieved of Internet in recent years. As the technologies such as Web 2.0 advance, more and more information activities and applications are carried on Internet, people becomes more and more dependent on internet than ever.In Web 2.0 era, on one hand, there are diversified media format on Internet. The auditoryand visual information combined with traditional text information, greatly enriched contents of Internet and improved user experience. To filter the multimedia information becomes the important task in Web 2.0 information filtering. On the other hand, users become the center of the Ineternet. The vast amount of information is consumed and created by users. Those user-created information enriched the contents of the Internet and provided people many information sources.
     Besides, the huge amount of users and user actions has bring Internet vast amounts of data. How to modified traditional machine learning algorithms to fit large scale computing circumstances is a difficult research topic.
     We focus on the study of information filtering in Web 2.0 era. We analysed the challenges of information filtering in Web 2.0, and studied the problems on filtering of various media types, large-scale machine learning algorithms and mining user feedbacks. We proposed theory analysis and solutions to these problems. The main research contents and innovation achievements of this paper as follows:
     1. We proposed a unified information filtering algorithm based on multiple features of multiple media types in Web 2.0 era. Specific to advertising image detection problem, we utilize the features like image content and image’s surrounding text feature, and integrate machine learning algorithms like SVM and AdaBoost. The filtering results demonstrate the effectiveness of our algorithm. The feature set combines of media content feature, web page visual layout feature and text feature. These features are verified to be useful in classifying advertising images. Moreover, we proposed a feature selection algorithm based on AdaBoost, which can select useful features out of the original full feature set. We construct a large dataset to verify our algorithm. The experiment results demonstrate that our feature selection algorithm is feasible and reseanable. In addition, we compared the effectiveness in classification of each feature.
     2. We proposed a fast spectral clustering algorithm(FSC) based on Normalized Cut, which can peform clustering on large scale text corpus. We analysed the bottleneck of utilizing spectral clustering algorithm on large scale text corpus, and proposed solutions. Firstly, FSC uses GSASH methods to build a graph from large-scale text corpus. Secondly, FSC utilized AMG method to iteratively reduce a large-scale eigenvalue system into a samller one, and obtained an approximating solution. We perfomed verification of FSC from both theory and experiment aspects. The experiment results demonstrate that the complexity of FSC reduces down to O log while keeping the good performance of spectral clustering.
     We proposed a hot topic evaluation and mining algorithm based on heat diffusion model under Web 2.0 environment. First, we model the Internet under Web 2.0 according its dynamic and social property. Second, we regard the information activities on Internet as heat acitivities, then we use heat diffusion model to model these activities. We use the feedback of web users as heat input, and evaluate the hot degree of information on Internet and mining the hot topics. This paper makes a detailed definition of heat diffusion model, and proved its stability and convergence. The experiment results demonstrate that our algorithm can simulate information activities on Internet.

引文

A.Kolcz and J.Alspector,SVM-based Filtering of E-mail Spam with Content-specific Misclassificafion Costs[A].In:Proc.ICDM-2001 Workshop on Text Mining (TextDM 2001)[C].Nov.2001.
    Andeas Hotho, Robert Jaschke, Christoph Schmitz, Gerd Stumme, Trend Detection in Folksonomies[C], SAMT 2006, LNCS 4306, pp. 56-70, 2006.
    A. Hotho, R. J?schke, C. Schmitz, and G. Stumme, FolkRank: A Ranking Algorithm for Folksonomies[C], In Proc. of FGIR '06, Germany, 2006.
    A. Hotho, R. J?schke, C. Schmitz, and G. Stumme, Information Retrieval in Folksonomies: Search and Ranking[C], ESWC 2006, LNCS 4011, pp. 411-426, 2006.
    A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm[C]. In Proceedings of NIPS, pages 849-856, 2001.
    A. Brandt, S. F. McCormick and J. W. Ruge. Algebraic multigrid (AMG) for sparse matrix equations[C]. In D. J. Evans editor, Sparsity and Its Applications. Cambridge University Press, Cambridge, 1984. Algebraic Multigrid, http://www.lrr.in.tum.de/Par/appls/apps/amg.html
    A. Guttman. R-trees: A dynamic index structure for spatial searching[C]. ACM SIGMOD Int’Conf.on Management of Data, Boston,M A , 1984.
    Adam Mathes, Folksonomies - Cooperative Classification and Communication Through Shared Metadata[C], Computer Mediated Communication, LIS590CMC (Doctoral Seminar), Graduate School of Library and Information Science, University of Illinois Urbana-Champaign, December 2004.
    A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme.FolkRank: A Ranking Algorithm for Folksonomies[C]. In Proc. FGIR 2006, 2006
    Angeletou, S., Sabou, M., Specia, L., Motta, E., (2007) Bridging the Gap Between Folksonomies and the Semantic Web: An Experience Report. Workshop: Bridging the Gap between Semantic Web and Web 2.0, European Semantic Web Conference.
    A. Hinneburg, D. A. Keim. An efficient approach to clustering in large multimedia databases with noise[C]. Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD'98), 58-65
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications[C]. In Proc.of the ACM SIGMOD Conference,Seatle, WA , 1998:94-105.
    Barbara D, Chen P. Using the fractal dimension to cluster datasets[C]. In Proc. of the 6th ACM SIGKDD, Boston, MA, 2000:260-264.
    Berchtold S, Ertl K B, Kriegel H P. The pyramid-technique: towards breaking the curse of dimensionality[C]. In Proc. ACM SIGMOD, Seattle, 1998, 142153.
    Brian Hayes, Cloud computing[J], Communications of the ACM, Volume 51, Issue 7, pp. 9-11, July 2008.
    Baeza-Yates R, Ribeiro-Nero B. Modern Information Retrieval [M]. Addison-Wesley, New York, 1999.
    Brandt A, McCormick S, Ruge J. Multigrid methods for differential eigenproblems [J]. SIAM Journal on Scientific Computing. 1983, 4: 244-260.
    Belkin, N. J. and Croft, W. B. 1992. Information fltering and information retrieval: two sides of the same coin? [J] Communications of the ACM 35,12,29-38.
    Batsch G., Onoda T, Muller K.R., Soft Margins for AdaBoost[J], Machine Learning, 2001, 42(3):287-320.
    Batsch G., Scholkopf B, Mika S, Muller K.R., SVM and Boosting: One Class, Technical Report, No.119, GMD FIRST, Berlin, 2000.
    B. Duran and P. Odell, Cluster Analysis: A Survey[J]. New York: Springer-Verlag, 1974.
    Benjamin Markines et al, Evaluating Similarity Measures for Emergent Semantics of Social Tagging[C], WWW 2009, April 20-24, 2009, Madrid Spain.
    Bingjun Sun, Prasenjit Mitra, C. Lee Giles , John Yen , Hongyuan Zha, Topic segmentation with shared topic detection and alignment of multiple documents[C], ACM SIGIR’07, July 23-27, 2007, Amsterdam, The Netherlands.
    Cullum J, Willoughby R. Lanczos algorithms for large symmetric eigenvalue computations [M]. Society for Industrial and Applied Mathematics, 2002.
    C. Ding, X. He, P. Husbands, H. Zha, H. Simon, PageRank, HITS and a Unified Framework for Link Analysis[C], Proc. ACM Conf. on Research and Develop. Info. Retrieval (SIGIR) 2002.
    Canhui Wang, Min Zhang, Shaoping Ma, Liyun Ru, Automatic Online News Issue Construction in Web Environment[C], WWW 2008, May 8--12, 2008, Beijing, China
    Chung Fan R.K. Spectral Graph Theory[M]. American Mathematical Society, 1997.
    Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    Chan P.K, Schlag M.D.F, Zien J.Y. Spectral k-way ratio-cut partitioning and clustering [C].Proceedings of DAC, 1993, 749-754.
    Ding CHQ, He X, Zha H, Gu M. Simon HD. A min-max cut algorithm for graph partitioning and data clustering [C]. In Proceedings of ICDM, 2001, 107-114.
    David Cohn, Huan Chang, Learning to Probabilistically Identify Authoritative Documents[C], Proc. of 7th ICML, pp. 167-174, June 29- July 02, 2000.
    Douglas W. Oard, et al. User Modeling for Information Filtering. http://www.ee.umd.edu/medlab/filter/papers/umir.html.
    Douglas W. Oard, et al., A Conceptual Framework for text Filtering, http://www.clis.umd.edu/dlrg/filter/paper.ps.
    Douglas W. Oard, Adaptive Filtering of Multilingual Document Streams, http://www.ee.umd.edu/medlab/filter/paperslsmc95.ps
    D Cai, S Yu, JR Wen, WY Ma, VIPS: a visionbased page segmentation algorithm, Technical Report, 2003, ftp://ftp.research.microsoft.com/pub/tr/tr-2003-79.pdf
    Deng Cai, Xiaofei He, Ji-Rong Wen, Wei-Ying Ma, Block-level analysis[C], In proceedings of SIGIR 2004, pp. 440-447, Sheffield, UK.
    Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, Block-based Web Search[C], In proceedings of SIGIR 2004, pp. 456-463, Sheffield, UK.
    Duffy N, Helmbold D, A Geometric Approach to Leveraging Weak Learning[C], In: Proc of the 4th European Conference on Computational Learning Theory, Nordkirchen, Germany, 1999, 18-33.
    D.J. watts,S.H. Strogatz,.Collective dynamics of“small-world”networks[J]. Nature 393,440-442, 1998.
    Edward M. Housman. Survey of current systems for selective dissemination of information[TR]. Technical Report SIG/SDI-1,American Society for Information Science Special Interset Group on SDI, Washington DC, June1969
    Edward Y. Chang, Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, 2007, Parallelizing Support Vector Machines on Distributed Computers[C], Neural Information Processing Systems (NIPS), 2007.
    Freund Y, Schapire R E. 1997, A decision theoretic generalization of online learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1) : l19-139 Flickr, http://www.flickr.com
    Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu, Hongjun Lu, Parameter Free Bursty Events Detection in Text Streams[C], Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005.
    G.Sakkis,I.Androutsopoulos,G.Paliouras,V.Karkaletsis,C.D.Spyropoulos,andP.Stamatopoulos,A Memory-Based Approach to AntiSpam Filtering for Mailing Lists,Information Retrieval[J].Vo1.6,No.1,PP.49-73,Khwer Academic PublisheItS,2003.
    Gianna M. Del Corso, Antonio Gulli, Francesco Romani, Ranking a Stream of News[C], WWW 2005, May 10-14, 2005, Chiba, Japan.
    Gengxin Miao, Yangqiu Song, Dong Zhang, Hongjie Bai, Parallel Spectral Clustering Algorithm for LargeScale Community Data Mining[C], WWW2008, April 21–25, 2008, Beijing, China.
    Golub G, Van L, Charles F. Matrix computations [M]. Johns Hopking University Press, 1996.
    G. Karypis, E.-H. Han, and V. Kumar, Chameleon: Hiearchical clustering using dynamic modeling[J], IEEE Computer, vol. 32, issue 8, pp. 68-75, 2002.
    H.Dmcker,D.Wu,and V.N.Vapmk,Support Vector Machines for Spam Categorization[J].IEEE Transactions on Neural Networks,Vo1.20,No.5,PP.1048-1054,Sep.1999.
    H.P. Luhn. A business intelligence system[J]. IBM Journal of Researc and Development,2(4):October 1958,314-319
    He Xiaofei, Cai Deng, Liu Haifeng, Ma Wei-Ying. Locality preserving indexing for document representation [C]. Proceedings of SIGIR, 2004, 96-103.
    Hiroshi Uejima, Takao Miura, and Isamu Shioya,Giving Temporal Order to News Corpus,CIS 2004, LNCS 3314, pp. 947–953, 2004.
    Hao Ma, Haixuan Yang, Irwin King, Michael R. Lyu, Learning Latent Semantic Relations from Clickthrough Data for Query Suggestion[C], CIKM’08 , October 26–30, 2008, Napa Valley, California, USA
    Haixuan Yang, Irwin King, Michael R. Lyu, DiffusionRank: A Possible Penicillin for Web Spamming[C], SIGIR’07, July 23-27, 2007, Amsterdam, The Netherlands.
    Houle M.E, Sakuma J. Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets [C]. Proceedings of ICDE, 2005, 619-630.
    H. Ferhatosmanoglu, E. Tuncel, D. Agrawal. Vector approximation based indexing for non-uniform high dimensional data sets[C]. In ACM International Conference on Information and Knowledge Management (CKIM2000). McLean: ACM Press.
    Hagen L, Kahng A.B. Newspectral methods for ratio cut partitioning and clustering[J].IEEE Trans. Compute Aided Design,1992,11(9):1074-1085.
    Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis, Weighted Graph Cuts without Eigenvectors: A Multilevel Approach[J], IEEE Transaction on Pattern Analysis and Machine Intelligence, vol.29, no. 11, Nov. 2007
    I.Androutsopoulos,J.Koutsias,K.V.Chandrinos and C.D.Spyropoulos,An Experimental Comparison of Na?ve Bayesian and Keyword Based AntiSpam Filtering with Encrypted Personal E-mail Messages[A].In:proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000)[c],Athens,Greece,PP.160-167,2000.
    I.Androutsopoulos,G.Paliouras and E.Miehelakis,I. Learning to Filter Unsolicited Commercial E-Mail[EB].Technical report 2004/2,NCSR“Demokritos”,2004.
    Ivan Berlocher, Kyung-il Lee, Kono Kim, TopicRank: Bringing Insight to Users[C], SIGIR 2008, July 20-24, 2008, Singapore.
    J. M. Kleinberg, Authoritative Sources in a Hyperlinked Environment[J], Journal of the ACM, 48:604-632, 1999.
    Jinyi Yao, Jue Wang, Zhiwei Li, Mingjing Li, Wei-Ying Ma, Ranking Web News via Homepage Visual Layout and Cross-Site Voting[C], ECIR 2006, LNCS 3936, pp. 131-142, 2006.
    J. Hu and A. Bagga, 2004, Categorizing images in web document[J], IEEE Trans. on Multimedia, vol. 11, issue 1, pp. 22--30, Jan./March 2004.
    J. Huang, R. Kumar, and M. Mitra, Image Indexing Using Color Correlograms[C], Proc. CVPR, pp. 762-768, 1997.
    James Gorman, James R. Curran, Scaling Distributional Similarity to Large Corpora[C]. Proceedings of 21st ICCL and 44th Annual Meeting of ACL, pp. 361-368, Sydney, July 2006.
    J. Ruge and K. Stuben. Algebraic multigrid, in Multigrid Methods[M], S. McCormick, ed., SIAM, Philadelphia, PA, 1987.
    Jarvelin, K., and Kekalainen, J., Cumulated Gain-based Evaluation of IR Techniques[C]. ACM Transactions on Information Systems (ACM TOIS), 20(4), 422-446, 2002.
    James Allan, Ron Papka, Victor Lavrenko. Online New Event Detection and Tracking[C]. In: the proceedings of SIGIR 98[c].University of Massachusetts: Amherst, 1998, 37-45.
    Konstan J, Miller B, Maltz D et al. GroupLen: Collaborative filtering for usenet news[J]. Communications of the ACM, 1997, 40(3), 77-87
    K. Tieu and P. Viola, 2000, Boosting image retrieval[J],Proc. Comput. Vision Pattern Recognition, vol. 1, pp. 228–235, 2000.
    Kuan-Yu Chen, Luesak Luesukprasert, Seng-cho T. Chou, Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling[J], IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.8, August 2007.
    Liyan Zhang, Kai Zhang, Chunping Li, A Topical PageRank Based Algorithm for Recommender Systems[C], SIGIR 2008, July 20-24, 2008, Singapore.
    L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: An ntroduction to Cluster Analysis[C]. New York: John Wiley&Sons, 1990.
    Masaki Mori, Takao Miura, Isamu Shioya, Topic Detection and Tracking for News Web Pages[C], Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 338-342, 2006.
    M. Stricker, and M. Orengo, Similarity of color images[C], in Proceedings of SPIE Storage and Retrieval for image and Video Databases Conference, pp. 381-392, 1995.
    MarkusWeimer, Iryna Gurevych and Max Muhlhauser, Automatically Assessing the Post Quality in Online Discussions on Software[C], Proceedings of the ACL 2007 Demo and Poster Sessions, pages 125–128
    Marc Smith, Vladimir Barash, Lise Getoor, Hady W. Lauw, Leveraging social context for searching social media[C], In Proc. Of CIKM’08, pp. 91-94, 2008
    M. Ester, H. P. Kriegel, J. Sander, X.Xu A density-based algorithm for discovering clusters in large spatial databases[C]. Knowledge Discovery and Data Mining (KDD'96), Proc. 1996 Int. Conf. 226-231
    M. Ankerst, M. Breunig, H. P. Kriegel and J. Sander. OPTICS: Ordering points to identify the clustering structure[C]. Proc. 1999 ACM-SIGMOD Int. Conf. Management of data (SIGMOD'99), 49-60
    Nayer Wanas, Motaz El-Saban, Heba Ashour, Waleed Ammar, Automatic Scoring of Online Discussion Posts[C], WICOW’08, October 30, 2008, Napa Valley, California, USA.
    N. C. Rowe , J. Coffman , Y. Degirmenci , S. Hall, S. Lee , C. Williams, 2002, Automatic removal of advertising from web-page display[C], Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, Portland, Oregon, USA, July 14-18, 2002.
    N. Kushmerick, 1999, Learning to remove Internet advertisements[C], Proceedings of the third annual conference on Autonomous Agents, p.175-181, Seattle, Washington, United States, April 1999.
    N. Beckmann, H. P. Kriegel, R. Schneider, et a1. The R*-tree: An efficient and robust access method for points and rectangles[C]. The SIGMOD Conf., Atlantic City, NJ, 1990.
    N. Katayama, S. Satoh. The SR-tree: An index structure for high dimensional nearest neighbor queries[C]. The ACM SIGMOD Int’l Conf. Management of Data, Tucson, Arizon, USA, 1997.
    O' Reilly,T. What Is Web 2.0: Design Patterns and Business Models for the Next Generation ofSoftware[EB/OL].http://~.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html?page=all 2006-11- 13.
    Olsson T. Decentralized social filtering based on trust[C]. In:Recommender System Workshop Papers. Tech Rep, WS-98-08, AAAI Press, 1998. 84-88
    Peter J.Denning, Electronic junk[J]. Communications of the ACM,25(3) March 1982,163-165.
    P Vanek, J.Mandel and M.Brezina. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems[J]. Computing, 1996, 56(3):179-196.
    Paek, S. and Smith, J. R. 1998, Detecting image purpose in World-Wide Web documents[C], In Proceedings of the IS&T/SPIE Symposium on Electronic Imaging: Science and Technology, Document Recognition (San Jose, CA). 1998.
    P. Berkhin. (2001) Survey of clustering data mining techniques. [Online]. Available: http://www.accrue.com/products/rp_cluster_review.pdf http://citeseer.nj.nec.com/berkhin02survey.html
    Richard F. Taflinger, A Definition of Advertising [Online], paper at http://www.wsu.edu:8080/~taflinge/addefine.html
    R-H Song , H-F Liu , J-R Wen , W-Y Ma, 2004, Learning Block Importance Models for Web Pages[C], Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, May 17-20, 2004.
    Rui Xu, Donald Wunsch II, Suvey of Clustering Algorithms[J], IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 3, MAY 2005
    Robert Jaschke, LeandroMarinho, Andreas Hotho, Lars Schmidt-Thieme, and Gerd Stumme, Tag Recommendations in Folksonomies[C], PKDD 2007, LNAI 4702, pp. 506–514, 2007.
    R.Kumar R et al, On the Bursty Evolution of Blogspace[C], The 12th International World Wide Web Conference, 2003.
    Segal R, Kephart J. Mailcat, An intelligent assistant for organizing e-mail[C]. In: Proc of the 3`rd Int'1 Conf on Autonomous Agents. ACM Press, 1999, 276-282
    S. Brin, L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine[C], Proc. Of 7th WWW Conference, 1998.
    Shenghua Bao, Xiaoyuan Wu, Ben Fei, Guirong Xue, Zhong Su, Yong Yu, Optimizing Web
    Search Using Social Annotations[C], WWW 2007, May 8-12, 2007, Banff, Alberta, Canada.
    Shengliang Xu, Shenghua Bao, Ben Fei, Zhong Su, Yong Yu, Exploring Folksonomy for Personalized Search[C], SIGIR 2008, July 20-24, 2008, Singapore.
    S-H Lin, J-M Ho, 2002 Discovering Informative Content Blocks from Web Documents[C], International Conference on Knowledge Discovery and Data Mining, p. 588-593, 2002.
    Schapire R E. 1990, The strength of weak learn ability [J]. Machine Learning, 1990, 5(2): 197-227.
    Sheikholeslami G, Chaterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases[C]. In Proc. of the 24th Conference on VLDB, New York, NY, 1998:428-439.
    S. Guha, R. Rastogi, and K. Shim, Cure: An efficient clustering algorithm for large database[J]. Information System. Vol. 26, 2001.
    Sharon E, Brandt A, Basri R. Fast multiscale image segmentation [C]. Proceedings of CVPR, 2000, 70-77.
    Shi Jianbo, Malik Jatendra. Normalized cuts and image segmentation [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905.
    T. Zhang, R. Ramakrishnan and M. Livny, Brich: A New Data Clustering Algorithm and its applications[J], Data Mining and Knowledge Discovery, vol 1, 1997.
    T.Furukawa et al. Analyzing Reading Behavior by Blog Mining[C], Twenty-Second Conference on Artificial Intelligence (AAAI-07), 2007.
    T. Sellis, N. Roussopoulos, C. Faloutsos. The R+-tree: A dynamic index for multidimensional objects[C]. The 13th Int’l Conf. Very Large Databases, Brighton, England, 1987.
    Thomas W. Malone, Kenneth R. Grant, Franklyn A. Turbak, Steven A. Brobst, and Michael D. Cohen. Intelligent information sharing systems[J]. Communications of the ACM, May 1987, 390-402
    Tak W.Yan and Hector Garcia-Molina, SIFT-A Tool for Wide-Area Information Dissemination[J], Processing of the 1995 USENIX Technical Conference, 1995
    Taher H. Haveliwala, Topic-Sensitive PageRank[C], WWW 2002, May 7-11, 2002, Honolulu, Hawaii, USA.
    Tingting He, Guozhong Qu, Siwei Li, Xinhui Tu, Yong Zhang, Han Ren, Semi-automatic Hot Event Detection[C], ADMA 2006, LNAI 4093, pp. 1008-1016, 2006.
    T. Maekawa, T. Hara, 2006, S. Nishio, Image Classification for Mobile Web Browsing[C], Proc’WWW 2006, Edinburgh, Scotland, May 23-26, 2006.
    URI HANANI, BRACHA SHAPIRA, and PERETZ SHOVAL, 2001, Information Filtering: Overview of Issues, Research and Systems[J], User Modeling and User-Adapted Interaction 11: 203-259, 2001.
    Ulrike von Luxburg, A Tutorial on Spectral Clustering[J], Stat. Comput. (2007) 17: 395-416.
    V.E.Henson and PS.Vassilevski. Element-free AMGe:General Algorithm for computing Interpolation Weights in AMG[M]. SIAM J. on science computing, 2000, 23(2).
    V Vapnik著,张学工译,统计学习理论的本质[M],北京,清华大学出版社,1999
    V. Athitsos, M. J. Swain, and C. Frankel, Distinguishing Photographs and Graphics on the World Wide Web[C], in IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 10-17, June 1997.
    Verrma D. Meila M. A comparison of spectral clustering algorithms[TR]. Technical report, 2003. UW CSE Technical report 2003-05-01
    Wu Z, Leahy R. An optimal graph theoretic approach to data clustering: theory and its application to image segmentation[J]. IEEE Trans on PAMI, 1993, 15(11):1101-1113.
    Weber R, Schek H J, Blott S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces[C]. Proc. 24th Int.l IEEE Conf. VLDB[C], New York, 1998, 194-205.
    Wang W, Yang J ,Muntz R. STING: a statistical information grid approach to spatial data mining.[C] In P roc.of the 23rd ConferenceonVLDB , Athens,Greece,1997:186-195
    Wang W, Yang J, Muntz R. STING+: An approach to active spatial data mining[C]. In Proc.15th ICDE, Sydney, Australia, 1999:116-125.
    Xu Wei, Liu Xin, Gong Yihong. Document clustering based on non-negative matrix factorization [C]. Proceedings of SIGIR, 2003, 267-273.
    Xu Wei, Gong Yihong. Document clustering based on concept factorization [C]. Proceedings of SIGIR, 2004, 202-209
    Xiang Wang, Kai Zhang, Xiaoming Jin, Dou Shen, Mining Common Topics from Multiple Asynchronous Text Streams[C], WSDM’09, February 9-12, 2009, Barcelona, Spain.
    Xuanhui Wang, ChengXiang Zhai, Xiao Hu, Richard Sproat, Mining Correlated Bursty Topic Patterns from Coordinated Text Streams[C], KDD'07, August 12.15, 2007, San Jose, California, USA.
    Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng, Information Flow Modeling based on Diffusion Rate for Prediction and Ranking[C], WWW 2007, May 8-12, Banff, Alberta, Canada
    Yang Hu, Mingjing Li, Zhiwei Li, Wei-Ying Ma, Discovering Authoritative News Sources and Top News Stories[C], AIRS 2006, LNCS 4182, pp. 230-243, 2006.
    Yuting Liu, Bin Gao, Tieyan Liu, Yin Zhang, Zhiming Ma, Shuyuan He, Hang Li, BrowseRank: Letting Web Users Vote for Page Importance[C], SIGIR 2008, July 20-24, 2008, Singapore.
    Yangqiu Song, Wen-Yen Chen, Hongjie Bai, Chih-Jen Lin, Edward Chang, 2008, Parallel Spectral Clustering[C], European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2008, pp. 374-389.
    Y. Chen, Z. Li, M. Li, and W.Y. Ma, 2006, Automatic Classification of Photographics and Graphics[C], ICME’06.
    Y. Song, W.-Y. Chen, H. Bai, C.-J. Lin, and E. Y. Chang. Parallel spectral clustering[C]. In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2008.
    Y. Wang, Y. Liu, M. Zhang, S. Ma, Identify Temporal Websites Based on User Behavior Analysis[C], In Proceedings of 3rd International Joint Conference on Natural Language Processing, Hyderabad, India, 2008. Youtube, http://www.youtube.com
    Zhou Shuigeng, Zhou Aoying, Jin Wen etc, FDBSCAN: A fast DBSCAN algorithm[J], Journal of Software, 2000, 11(6): 735-744
    Zhichen Xu, Yun Fu, Jianchang Mao, and Difu Su, Towards the Semantic Web: Collaborative Tag Suggestions[C], WWW2006, May 22–26, 2006, Edinburgh, UK.
    Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(2): 283-304.
    陈远浩,2008,非监督的结构学习及其应用[D]:[博士],合肥:中国科学技术大学。
    董道国,刘振中,薛向阳,VA-Trie:一种用于近似k近邻查询的高维索引结构[J],计算机
    研究与发展,42(12):2213-2218, 2005.
    邓建国,2007,Web 2.0时代的互联网使用行为与网民社会资本之关系考察[D]:[博士]。上海:复旦大学。
    范欣,2006,针对移动设备的跨媒体网络信息检索及自适应信息显示研究[D]:[博士]。合肥:中国科学技术大学。
    孔敏,2006,关联图的谱分析及谱聚类方法研究[D]:[博士]合肥:安徽大学
    姜园,张朝阳,仇佩亮,周东方,用于数据挖掘的聚类算法[J],电子与信息学报,vol. 27, No.4 2005.
    李宏宇,2008,谱学习与聚类的研究与应用[D]:[博士]上海:复旦大学。
    裴继法,谢维信,聚类的密度函数方法[J],西安电子科技大学学报,1997, 24(4): 463-467。
    王勇,刘奕群,张敏,马少平,茹立云,基于用户兴趣分析的网页生命周期建模[J],中文信息学报,第22卷,第2期,2008年3月。
    王斌,潘文峰,2004,基于内容的垃圾邮件过滤技术综述[J],中文信息学报,第19卷,第5期。
    徐洪波,2003,大规模信息过滤技术研究及其在Web问答系统中的应用[D]:[博士]。北京:中国科学院计算技术研究所。
    夏迎炬,2003,文本过滤关键技术研究[D]:[博士]。上海:复旦大学。
    张亮,2007,基于机器学习的信息过滤和信息检索的模型和算法研究[D]:[博士]。天津:天津大学。
    张敏,于剑,基于划分的模糊聚类方法[J],软件学报,vol. 15, No.6, 2004。
    张军旗,周向东,施伯乐,基于查询采样的高维数据混合索引[J],软件学报,Vol. 19, No.8, 2008.8, pp. 2054-2065.
    张学工,关于统计学习理论与支持向量机[J],自动化学报,2000. 26(1) } 32-42。
    赵艳厂,谢帆,一种新的聚类算法:等密度线算法[J],北京邮电大学学报2002, 25(2): 8-13。
    周水庚,周傲英,一种基于密度的快速聚类算法[J],计算机研究与发展,2000, 37(11): 1287-1292。
    周水庚,周傲英,基于数据分区的DBSCAN算法[J],计算机研究与发展,2000, 37(10): 1153-1159。
    中国互联网中心(CNNIC),2009年1月,第23次中国互联网络发展状况统计报告,http://www.cnnic.net.cn/uploadfiles/pdf/2009/1/13/92458.pdf
    [美]克里斯·安德森.长尾理论[M].乔江涛,译.北京:中信出版社,2006:35-39.
    优酷,http://www.youku.com