Web文本挖掘中若干问题的研究

英文题名：Research on Key Problems in Web Text Mining
作者：王占一
论文级别：博士
学科专业名称：信号与信息处理
中文关键词：文本分类 ; 垃圾短信过滤 ; 文本检索 ; 复杂网络 ; 激活力
英文关键词：text categorization ; SMS filtering ; text retrieval ; complex
英文关键词：network ; activation force
学位年度：2012
导师：郭军
学科代码：081002
学位授予单位：北京邮电大学
论文提交日期：2012-04-26

摘要

随着互联网和电信网的飞速发展,网络文本成为信息的重要载体及不可或缺的主要来源。Web文本挖掘采用数据挖掘、模式识别、信息检索、自然语言处理等学科的知识,解决如何从纷繁复杂的文本信息中获取可理解、易用的知识的问题。本文针对Web文本挖掘中文本分类、短信过滤、信息检索和复杂网络等若干关键问题进行了如下的研究：
     (1)多类文本分类。本文针对纠错输出编码法ECOC (Error Correcting Output Code)在解码方面的不足,提出了一种基于支持向量机和概率纠错输出编码的多类文本分类算法。通过构造合适的编码矩阵训练多个两类分类器,并采用Sigmoid函数使其决策函数值概率化。提出两种判别测试文本类别的解码方式：类序列概率计算法和求编码矩阵伪逆法。在标准中英文数据集上的实验结果表明,本文的方法优于ECOC法传统的解码方法及其他经典分类算法。在样本类别分布不均的情况下,该算法仍保持较稳定的准确率。
     (2)演进式垃圾短信过滤。针对垃圾短信过滤中存在的内容变化快、用户个性强、训练样本少等问题,本文提出了一种演进式垃圾短信过滤算法和系统。首先提出了基于朴素贝叶斯分类器的演进式基本过滤算法和系统,主要创新点在于灵活的用户反馈方式、自适应学习和进化学习。根据用户使用手机的习惯,提出三种个性化反馈训练样本和类别标签的方式。自适应学习和进化学习的功能分别是更新短信模型中各特征项的权重及特征项本身。为了解决短信训练样本少且精度要求高的问题,提出一种基于中间层映射的垃圾短信过滤算法。实验结果表明,演进式短信过滤方法能够有效接收以数据流传入的短信,并自动更新过滤器。基于中间层映射的过滤算法精度收敛迅速,且在训练样本充足后可与传统分类算法结合使用,继续提高过滤精度。
     (3)面向Web实体的搜索。本文以参加的文本检索会议TREC(Text REtrieval Conference)评测的实体追踪(Entity Track)任务为主线,针对网页中的实体提出了一系列挖掘和检索的算法。实体抽取采取了手工辅助自动、规则结合统计的方法,创建了包含多个类型的实体词典。为实体排序提出了文档中心模型DCM(Document-Centered Model)和实体中心模型ECM (Entity-Centered Model),并在此基础上引入语义类别标签,提高检索的精度。另外,基于网页中实体应存在唯一标识的设定,提出了基于规则的主页分配算法。排名第一的评测结果验证了算法的有效性。另一方面,在半结构化的英文维基百科数据集上测试,引入语义类别标签将原有两种模型算法的NDCG指标分别提升了12.1%和25.6%。
     (4)基于激活力和亲和度的复杂网络建模与应用。本文以自然语言文本为例,通过词频、共现、距离等统计量模拟生物学和心理学上的词激活效应,计算词激活力WAF (Word Activation Force)。基于WAF计算词的亲和度,建立无向的词网络,研究词的语义相似性在此基础上,将WAF和亲和度用于文本表示、特征选择和文本分类。本算法还可以用来对蛋白质相互作用网络建模,分析蛋白质的关联性除此之外,实体的亲和度还有助于改善实体检索的排序效果。实验结果表明基于激活力和亲和度的复杂网络建模对Web文本挖掘具有重要意义。
With the rapid development of Internet and telecommunication network, web text becomes the important carrier of information and indispensable source. Web text mining depends on the theories in the fields of data mining, pattern recognition, information retrieval, natural language processing, etc. It aims to get comprehensible and easy-to-use knowledge from numerous and complicated texts. This dissertation focuses on several key problems in web text mining, such as text categorization, SMS filtering, information retrieval, complex network, etc.
     (1) Multiclass text categorization. This dissertation aims at the lack of Error Correcting Output Code (ECOC) in decoding, and proposes a method of multiclass text categorization based on Support Vector Machine (SVM) and probabilistic ECOC. Several binary classifiers are trained according to appropriate encoding matrix. Values of decision functions are transformed to probabilities by a sigmoid-style function. Two decoding algorithms are introduced for classifying samples. One is calculating the probabilities of each classes, the other is solving the pseudo-inverse of the encoding matrix. Experiments on standard Chinese and English datasets show that the methods are superior to traditional ECOC and other classic algorithms. Moreover, our methods keep stable precision in the condition that samples of each class are not evenly distributed.
     (2) Evolutionary SMS filtering. This dissertation proposes a series of algorithms and systems of evolutionary SMS filtering for difficulties of fast updates, personality and lack of training samples. First, a basic evolutionary system is introduced based on Naive Bayes classifier. Its innovations lie in flexible feedback for users, adaptive learning and evolutionary learning. Three types of personalized feedback are put forward according to the uses'habits. Evolutionary learning and adaptive learning are used to update features and their weights. Moreover, this dissertation proposes an interlayer mapping-based SMS filtering algorithm to address the problem in not only high precision but also few training samples. Experimental results show that the proposed method can effectively receive the stream of short messages and update the filter automatically. Interlayer mapping-based filtering algorithm achieves required accuracy with rapid convergence. It can be combined with traditional methods for boosting the performance when samples are enough for training.
     (3)Web entity-oriented search. This dissertation proposes a set of algorithms and systems for entity mining and retrieval based on the Entity Track at Text REtrieval Conference (TREC). Entity lexicons including dozens of types for entity extraction are established through semi-automatic, rule-based and statistic-based methods. Document-Centered Model (DCM) and Entity-Centered Model (ECM) are proposed for entity ranking. In addition, semantic category labels are introduced for improving the accuracy. Considering entities in web pages should be identified uniquely, a rule-based algorithm of homepage allocation is presented. Ranking first in official assessment testifies the effectiveness of the proposed methods. Besides, testing on the semi-structured English Wikipedia dataset indicates that semantic category labels improve DCM and ECM by12.1%and25.6%at NDCG, respectively.
     (4)Modeling and applications of complex network based on activation force and affinity measure. Taking natural language text as an example, Word Activation Force (WAF) like activation effect in biology and psychology is proposed by merging some statistics, such as word frequency, co-occurrence, distance, etc. Then word affinity measure and undirected network used for studying the semantic similarity between words are generated by WAR On this basis, WAF and word affinity measure are applied to text representation, feature selection and text categorization. These methods are also suitable for PPI (Protein-Protein Interaction) network modeling and protein association analysis. In addition, entity affinity measure contributes to the re-ranking in entity retrieval. Experimental results demonstrate complex network modeling based on activation force and affinity measure is of great significance for web text mining.

引文

[1]中国互联网络信息中心.第29次中国互联网络发展状况统计报告.2012年1月.
    [2]李荣陆.文本分类及其相关技术研究.复旦大学.2005.
    [3]詹川.反垃圾邮件技术的研究.电子科技大学.2005.
    [4]王斌,潘文锋.基于内容的垃圾邮件过滤技术综述.中文信息学报.2005.
    [5]黄文良.垃圾短信过滤关键技术研究.浙江大学.2008.
    [6]王秀娟.文本检索中若干问题研究.北京邮电大学.2006.
    [7]王继成,潘金贵,张福炎.Web文本挖掘技术研究.计算机研究与发展.2000.
    [8]程显毅,朱倩.文本挖掘原理.科学出版社.2010.
    [9]数据挖掘维基百科：http://en.wikipedia.org/wiki/Data_mining
    [10]Web挖掘维基百科http://en.wikipedia.org/wiki/Web_mining
    [11]毛国君,段立娟,王实等.数据挖掘原理与算法.清华大学出版社.2000.
    [12]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.中国科学院计算技术研究所.2001.
    [13]何慧.WEB文本挖掘中关键问题的研究.北京邮电大学.2009.
    [14]王知津,贾福新,郑红军等.现代信息检索(第二版),机械工业出版社,2005.
    [15]Fuhr N. Probabilistic Models in Information Retrieval:[PH.D thesis].University of Twente,2001.
    [16]RobertsonSE,JonesSK. Relevance weighting of search Terms. JASIS,1976, 27:129-14.
    [17]Van RiisbergenCJ. A Theoretical Basis for the Use of Co-occurrence Data in Information Retrieval. Journal of Doeumentation,1977,33:106-119.
    [18]黄昌宁.统计语言模型能做什么.语言文字应用2002(2)：77-84.
    [19]G. Salton, A. Wong, C. Yang. A Vector Space Model For Automatic Indexing. Communications of the ACM,18(11):613-620,1975.
    [20]鲁松,李晓黎,白硕等.文档中词语权重计算方法的改进,中文信息学报,14(6),2000,pp.8-13.
    [21]Y Yang, J O Pedersen. A Comparative Study on Feature Selection in Text Categorization [C]. The 14 th International Conference on Machine Learning, San Francisco:Morgan Kaufmann Publishers,1997.
    [22]Monica Rogati, Yiming Yang. High-Performing Feature Selection for Text Classification [C]. CIKM'02,New York:ACM Press,2002.429.
    [23]代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究.中文信息学报,2004,18(1).
    [24]胡佳妮.文本挖掘中若干关键问题的研究.北京邮电大学.2008.
    [25]张琪.手机短信：第五媒体的崛起.传媒.2005年第一期.
    [26]周立柱,林玲.聚焦爬虫技术研究综述.计算机应用.2005.
    [27]李晨.网络搜索引擎与专家检索系统框架和模型研究.北京邮电大学.2009.
    [28]王斌,郎皓.信息检索鲁棒性研究综述.创新求实.2006.pp50-70.
    [29]Zhai Chengxiang, Lafferty J. Model-based Feedback in the Language Modeling Approach to Information Retrieval. Proceedings of the 10th International Conference on Information and Knowledge Management,2001:403-410.
    [30]Jun Zhu, Zaiqing Nie, Xiaojing Liu, et al. StatSnowball:a Statistical Approach to Extracting Entity Relationships.the International World Wide Web Conference (WWW).2009.
    [31]Xiaojing Liu, Zaiqing Nie, Zaiqing Nie, Xiaojing Liu, et al. BioSnowball: Automated Population of Wikis. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD).2010.
    [32]胡海波,复杂网络拓扑结构的研究.西安理工大学.2006.
    [33]Watts, Duncan J.; Strogatz, Steven H. Collective dynamics of "small-world' networks. Nature, Volume 393, Issue 6684, pp.440-442.1998.
    [34]A. Barabasi, E. Bonabeau. "Scale-Free Networks". Scientific American:50-59. 2003.
    [35]Wu, J. et al. Integrated network analysis platform for protein-protein interactions. Nat. Methods 6,75-77.2009.
    [36]Weston J, Watkins C. Multi-class support vector machines.Royal Holloway, University of London,1998.
    [37]Van Rijsbergen C J, Information Retrieval, London:Butterworths,1979.
    [1]庞剑锋,卜东波,白硕.基于向量空间模型的文本自动分类系统的研究与实现.中国科学院计算技术研究所.2001.
    [2]苏力华.基于向量空间模型的文本分类技术研究.西安电子科技大学：2006.
    [3]秦志光,罗琴,张凤荔.一种混合的垃圾邮件过滤算法研究,电子科技大学学报,36(3),2007,pp.485-488.
    [4]刘冬雪,文本分类技术在信息检索中的应用.科技资讯.2010.
    [5]刘斌,数字图书馆中基于统计的自动文本分类方法研究.中国科学院研究生院计算技术研究所.2002.
    [6]卢苇,彭雅.几种常用文本分类算法性能比较与分析,湖南大学学报(自然科学版),34(6),2007,pp.67-69.
    [7]李莹,张晓辉,王华勇等.一种应用向量聚合技术的KNN中文文本分类方法,小型微型计算机系统,25(6),2004,pp.993-996.
    [8]Joachims T. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In proceedings of ICML-97,14th international conference on machine learning, Nashville,1997, pp.143-151.
    [9]赵晖,荣莉莉.支持向量机组合分类及其在文本分类中的应用,小型微型计算机系统,26(10),2005,pp.1816-1820.
    [10]胡佳妮.文本挖掘中若干关键问题的研究.北京邮电大学.2008.
    [11]V. Lertnattee and T. Theeramunkong. Imroving Centroid-based Text Classification Using Term-Distribution-Based Weighting and Feature Selection. The 2nd International Conference on Intelligent Technologies,2001,349-355.
    [12]Yiming Yang and Xin Liu, A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,1999,42-49.
    [13]杨丽华,戴齐,郭艳军.KNN文本分类算法研究.微计算机信息.第22卷第7-3期.2006.7..185,269-270.
    [14]Cheeseman P., Stutz J. Bayesian classfication (AutoClass):Theory and result. Advances in Knowledge Discovery and Data Mining,1996, pp.153-180.
    [15]边肇祺,张学工.《模式识别》第二版.清华大学出版社.2004.pp296-303.
    [16]王国胜,支持向量机的理论与算法研究.北京邮电大学.2007.
    [17]Chih-Wei Hsu, Chih-Jen Lin. A Comparison of Methods for Multiclass Support Vector Machines. IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL.13, NO.2, MARCH 2002.
    [18]苟博,黄贤武.支持向量机多类分类方法.数据采集与处理.第21卷第3期.2006.9.pp334-339.
    [19]徐丽,伏玉琛,李斯.一种改进的SVM决策树Web文本分类算法.苏州大学学报(工科版),31(5),2011.pp7-11.
    [20]X. Lin, H. Liu. Chinese Question Classification Using Alternating and Iterative One-against-One Algorithm. Journal of Convergence Information Technology,5(3), pp.61-67,2010.
    [21]李昆仑,黄厚宽,田盛丰.一种基于有向无环图的多类SVM分类器.模式识别与人工智能.2003.
    [22]T. G. Dietterich, G. Bakiri. Solving Multiclass Learning Problems via Error-correcting Output Codes. Journal of Artificial Intelligence Research, vol.2, pp.263-286,1995.
    [23]E. L. Allwein, R. E. Schapire. Reducing Multiclass to Binary:a Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, vol.1, pp.113-141,2000.
    [24]G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and Randomized GACV. Advances in Kernel Methods, USA,1999.
    [25]J. C. Platt. Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. In Advances in Large Margin Classifiers, USA, 2000.
    [26]S. Escalera. On the Decoding Process in Ternary Error-Correcting Output Codes. IEEE Transactions on Pattern Analysis and Machine Intelligence,32(1), pp.120-134, 2010.
    [27]Eun Bae Kong, Thomas G. Dietterich. Probability Estimation via Error-Correcting Output Coding. Artificial Intelligence and Soft Computing.1997.
    [28]Reza Ghaderi, Terry Windeatt. Least Squares and Estimation Measures via Error Correctiong Output Code. LNCS, vol.2096, ppl48-157,2001.
    [29]Robert E. Schapire. Using output codes to boost multiclass learning problems. In Proceedings of 14th International Conference on Machine Learning. pp313-321. 1997.
    [30]Rong Jin, Jian Zhang. A Smoothed Boosting Algorithm Using Probabilistic Output Codes. In Proceedings of the 22nd International Conference on Machine Learning.2005.
    [1]工业和信息化部发布2011年8月通信业运行状况http://www.miit.gov.cn/ n11293472/n11293832/n11293907/n11368223/14179427.html.
    [2]网络不良与垃圾信息举报受理中心.2011年上半年手机短信息状况调查报告.2011.
    [3]江叶婵.手机阅读内容研究.安徽大学.2011.
    [4]黄文良.垃圾短信过滤关键技术研究.浙江大学.2008.
    [5]易阳峰.垃圾短信的监控与原理实现中兴通讯技术,2005,11(6)：49-54.
    [6]张燕,傅建明.垃圾短信的识别与追踪研究计算机应用研究,2006,23(3)：245-247.
    [7]单广玉,范晓晖,杨义先.短消息业务系统安全性分析信息网络安全,2003,11：52-54.
    [8]杨震.文本分类和聚类中若干问题的研究.北京邮电大学.2007.
    [9]关娜.基于文本分类算法的垃圾短信过滤技术研究.电子科技大学.2008.
    [10]M. Saluuni, el al. A Bayesian approach to filtering junk e-mail.in Proc. of AAAI Workshop on Learning for Text Categorization. Madison, Wisconsin. pp55-62. 1998.
    [11]边肇褀,张学工.模式识别第二版.清华大学出版社.2004：9-11.
    [12]Wen Pu, Ning Liu, Shuicheng Yan, et al. Local Word Bag Model for Text Categorization. Seventh IEEE International Conference on Data Mining.2007.10, pp625-630.
    [13]李辉,张琦,卢湖川.基于内容的垃圾短信过滤.计算机工程,2008.6,154-156.
    [14]邓维维,彭宏.移动环境下的垃圾短信过滤系统的研究.计算机应用,2007.27(1),221-224.
    [15]金展,范晶,陈峰,等.基于朴素贝叶斯和支持向量机的自适应垃圾短信过滤系统.计算机应用,2008.3,714-718.
    [16]Boykin, P.O, Roychowdhury, V.P. Leveraging social networks to fight spam. IEEEComputer,2005, V38(4):61-68.
    [17]Joseph S. Kong, Behoam Attaran Rezaei, Nima Sarshar, Vwani P. Roychowdhury P. Oscar Boykin. Collaborative Spam Filtering Using E-Mail Networks. IEEE Computer,2006, V39(8):67-73.
    [18]刘军.网页采集、净化与分类.浙江工商大学.2005.
    [19]Gordon Cormack. TREC 2006 Spam Track Overview. The Fifteenth Text REtrieval Conference (TREC 2006).2006.
    [1]丁国栋,白硕,王斌.文本检索的统计语言建模方法综述.计算机研究与发展,2006.
    [2]刘海峰,王元元.基于向量模型的文本检索若干问题研究.情报杂志,2006.
    [3]王秀娟.文本检索中若干问题研究.北京邮电大学,2006.
    [4]王晓黎,王文杰.基于向量空间模型的文本检索系统.微电子学与计算机,2006.
    [5]黄健斌,孙鹤立.基于链接路径预测的聚焦Web实体搜索.计算机研究与发展.2010.
    [6]许洋波.英文实体答案提取及主页查找研究.昆明理工大学.2010.
    [7]茹昭.企业信息检索中的对象检索方法研究.北京邮电大学.2008.
    [8]维基百科词条：实体.http://zh.wikipedia.org/wiki/%E5%AF%A6%E9%AB%94.
    [9]刘非凡,赵军,吕碧波等.面向商务信息抽取的产品命名实体识别研究.中文信息学报.2006.
    [10]张晓艳,王挺,陈火旺.命名实体识别研究.计算机科学.2005.
    [11]向晓雯,史晓东,曾华琳.一个统计与规则相结合的中文命名实体识别系统.计算机应用.2005.
    [12]Krisztian Balog. Overview of the TREC 2009 Entity Track. In Proceeding of the Eithteenth Text REtrieval Conference (TREC 2009).2009.
    [13]方慧TREC发展历程及现状分析.新世纪图书馆,pp57.2010.
    [14]Zhanyi Wang, Dongxin Liu, Weiran Xu, et al. BUPT at TREC 2009:Entity Track. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [15]Yunbo Cao, Jingjing Liu, Hang Li. Research on Expert Search at Enterprise Track of TREC 2005. In Proceeding of the Fourteenth Text REtrieval Conference (TREC 2005),2005.
    [16]Shenghua Bao, Huizhong Duan, Qi Zhou, et al. Research on Expert Search at Enterprise Track of TREC 2006. In Proceeding of the Fifteenth Text REtrieval Conference (TREC 2006),2006.
    [17]Yupeng Fu, Yufei Xue, Tong Zhu, et al. THUIR at TREC2007:Enterprise Track. In Proceeding of the Sixteenth Text REtrieval Conference (TREC 2007),2007.
    [18]Krisztian Balog. Overview of the TREC 2008 Enterprise Track. In Proceeding of the Seventeenth Text REtrieval Conference (TREC 2008).2008.
    [19]Krisztian Balog. Overview of the TREC 2011 Entity Track. In Proceeding of the Twentieth Text REtrieval Conference (TREC 2011).2011.
    [20]Krisztian Balog. Category-based Query Modeling for Entity Search. In 32nd European Conference on Information Retrieval (ECIR 2010). pp.319-331,2010
    [21]Mayssam Sayyadian, Azadeh Shakery, et al. Toward Entity Retrieval over Structured and Text Data. the first Workshop on the Integration of Information Retrieval and Databases (WIRD'04).2004.
    [22]Yi Fang, Luo Si, Zhengtao Yu,et al. Entity Retrieval by Hierarchical Relevance Model, Exploiting the Structure of Tables and Learning Homepage Classifiers. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [23]Richard McCreadie, Craig Macdonald, Iadh Ounis, et al. University of Glasgow at TREC 2009:Experiments with Terrier. In Proceeding of the Eighteenth Text REtrieval Conference (TREC 2009).2009.
    [24]Qing Yang, Peng Jiang, Chunxia Zhang, et al. Reconstruct Logical Hierarchical Sitemap for Related Entity Finding. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [25]Dong Wang, Qing Wu, Haiguang Chen, et al. A Multiple-Stage Framework for Related Entity Finding:FDWIM at TREC 2010 Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [26]Ekbal, Asif, Saha, Sriparna, et al. Multiobjective approach for feature selection in maximum entropy based named entity recognition. Proceedings of International Conference on Tools w ith Artificial Intelligence (ICTAI 2010),2010:323-326.
    [27]Borthwick A. A maximum entropy approach to named entity recognition. New York University, Department of Computer Science, Courant Institute,1999.
    [28]Lafferty J, McCallum A, Pereira F. Conditional random fields:probabilistic models for segmenting and labeling sequence data. Proceedings of the 18th International Conference on Machine Learning 2001, ICML 2001:282-289.
    [29]Zhang Zhu-yu, Ren Fei-liang, Zhu Jing-bo. A comparative study of features on CRF-based Chinese named entity recognition. Proceedings of The Fourth National Conference of Information Retrieval and Content Security,2008:111-117.
    [30]闫萍.基于规则和概率统计相结合的中文命名实体识别研究.计算机与数字工程.2011.
    [31]Zhanyi Wang, Chunsong Tang, Xueji Sun. PRIS at TREC 2010:Related Entity Finding Task of Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [32]Rianne Kaptein, Marijn Koolen. Result Diversity and Entity Ranking Experiments:Anchors, Links, Text and Wikipedia. In proceeding of the Eighteenth Text REtrieval Conference (TREC 2009),2009.
    [33]陈岳华.企业信息检索研究与实现.硕士学位论文.北京邮电大学,2006.
    [34]Balog K, de Rijke M, "Determining expert profiles (with an application to expert finding)," In IJCAI'07:Proc.20th Intern. Joint Conference on Artificial Intelligence, 2007:2657-2662.
    [35]鲁松,李晓黎,白硕等.文档中词语权重计算方法的改进,中文信息学报,14(6),2000,pp.8-13.
    [36]王斌,郎皓.信息检索鲁棒性研究综述.创新求实.2006.pp50-70.
    [37]黄昌宁.统计语言模型能做什么.语言文字应用2002(2)：77-84.
    [38]Krisztian Balog. Overview of the TREC 2010 Entity Track. In Proceeding of the Nineteenth Text REtrieval Conference (TREC 2010).2010.
    [39]S. Campinas, D. Ceccarelli, T. E. Perry, et al. The Sindice-2011 Dataset for Entity-Oriented Search in the Web of Data. SIGIR 2011 workshop.2011.
    [1]周涛,柏文洁,汪秉宏等.复杂网络研究概述.物理,34(1),2005.
    [2]Wu, J. et al. Integrated network analysis platform for protein-protein interactions. Nat. Methods 6,75-77.2009.
    [3]Girvan, M.& Newman, M. E. J. Community structure in social and biological networks. Proc. Natl Acad. Sci. USA 99,7821-7826.2002.
    [4]Radicchi, F., Castellano, C., Cecconi, F., Loreto, V.& Parisi, D. Defining and identifying communities in networks. Proc. Natl Acad. Sci. USA 101,2658-2663. 2004.
    [5]Adam Schenker. Graph-Theoretic Techniques for Web Content Mining. University of South Florida,2003.
    [6]Chuntao Jiang, Frans Coenen, Robert Sanderson, et al. Text Classification using Graph Mining-based Feature Extraction. Knowledge-Based Systems,23 (4),2010.
    [7]Adam Schenker, Mark Last, Horst Bunke. Classification of Web Documents Using a Graph Model. In Proceedings of the Seventh International Conference on Document Analysis and Recognition,2003.
    [8]Adam Schenker, Mark Last, Horst Bunke, et al. Chapter:Clustering of web documents using a graph model. Web document analysis:challenges and opportunities. World Scientific,2003.
    [9]秦玉平,孟祥娜,王秀坤.基于局部共现分析和概念语义的查询扩展研究.微计算机应用,31(6),2010.
    [10]宋枫溪.自动文本分类若干基本问题研究.南京理工大学.2004.
    [11]周昭涛,卜东波,程学旗.文本的图表示初探.中文信息学报.第19卷第2期.2005.
    [12]Harris, Zellig. Mathematical Structures of Language.Wiley, New York.1968.
    [13]Salton, G., A.Wang, and C. Yang. A vector-space model for information retrieval. Journal of the American Society for Information Science,18:613-620.1975.
    [14]Landauer, Thomas and Susan T. Dumais. A solution to Plato's problem:The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review,104(2):211-240.1997.
    [15]Lund, Kevin and Curt Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, and Computers, 28:203-208.1996.
    [16]Dawei Song, Peter Bruza. Discovering Information Flow Using a High Dimensional Conceptual Space. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2001.
    [17]Peter Bruza, Dawei Song. A Comparison of Various Approaches for Using Probabilistic Dependencies in Language Modeling.In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2003.
    [18]Leif Azzopardi, Mark Girolami, Malcolm Crowe. Probabilistic Hyperspace Analogue to Language. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,2005.
    [19]高茂庭,王正欧.基于文档标引图模型的文本相似度策略.计算机工程,34(7),2008.
    [20]Curran, James R. and Marc Moens. Scaling context space. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 231-238, Philadelphia, PA.2002.
    [21]Grefenstette, Gregory. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, Dordrecht.1994.
    [22]Lin, Dekang. Automatic retrieval and clustering of similar words. In Proceedings of the Joint Annual Meeting of the Association for Computational Linguistics and International Conference on Computational Linguistics, pages 768-774, Montreal, Canada.1998.
    [23]Sebastian Pado, Mirella Lapata. Dependency-Based Construction of Semantic Space Models. Computational Linguistics.2007.
    [24]Rubenstein, H.& Goodenough, John B. Contextual correlates of synonymy. Comm. of the ACM 8,627,1965.
    [25]Hodgson, James M. Informational constraints on pre-lexical priming. Language and Cognitive Processes,6:169-205.1991.
    [26]J. Guo, H. Guo, and Z. Wang. An Activation Force-based Affinity Measure for Analyzing Complex Networks. Sci.Rep.1,113; DOI:10.1038/srep00113.2011.
    [27]关薇,王建,贺福初.大规模蛋白质相互作用研究方法进展.生命科学.2006.
    [28]李敏,陈建二,王建新.基于复杂网络理论的PPI网络拓扑分析.计算机工程与应用.2008.
    [29]谢江.蛋白质相互作用网络的数值研究.上海大学.2008.
    [30]行花妮,刘刚,王磊.基于GN算法的快速算法在PPI网络中的实现.计算机与信息技术.2009.
    [31]M. Girvan, M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America (PNAS). vol.99, no.12.2002.
    [32]Kaufman L., Rousseeuw P. J. Finding Groups in Data:An Introduction to Cluster Analysis. John Wiley&Sons, New York,1990.
    [33]Raymond T. Ng, Han Jiawei. Efficient and effective clustering method for spatial data mining. In proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94). ppl44-155.1994.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700