基于粗集模型的聚类方法及其在文献过滤系统中的应用

英文题名：The Application of Rough-Set-Model Based Text Clustering Algorithm in the Text Filtering
作者：谷波
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息过滤 ; 用户兴趣模型 ; 文本聚类 ; 向量空间模型 ; 粗糙集
英文关键词：Information Retrieval ; User Profile ; Text Clustering ; Vector Space Modal ; Rough Set
学位年度：2004
导师：张永奎
学科代码：081203
学位授予单位：山西大学
论文提交日期：2004-06-01

摘要

信息过滤(Infonnation Filtering)是一种个性化的、主动的信息服务机制，是对传统信息检索服务的有益的补充。信息过滤包括许多内容，如声音、图像和文本等等，在本文中，我们主要指对文献的过滤。聚类(Clustering)是将一组问题空间的对象按相似度进行分类，把相似的对象归为一类，尽可能使得类内的对象间的平均距离最小，而使类间的距离最大。本质上，聚类属于一种无监督的学习，将聚类技术应用于信息过滤中可以在一定程度上提高系统的过滤效率，同时也对信息过滤的查准率与查全率有积极的作用。将聚类技术用到文本信息过滤中，本质上属于文本挖掘范畴。
     自然语言的不确定性和模糊性造成了计算机对自然语言处理的困难，由于粗糙集不仅具有描述不精确概念能力，而且还给出了对不精确度的度量，因此将粗糙集的有关理论用于对自然语言的描述有一定合理性。
     本文在粗糙集理论的背景知识下，对于文本的粗糙集表承模型和基于此模型下的聚类在信息过滤系统中的应用，进行了深入的研究。所作的工作和创新点总结如下：
     1．提出了一种新的文本表示模型，该模型基于粗糙集的对知识的等价划分的思想，试图保持文本的概念信息：定义了该模型下的粗糙相似度；并提出了基于该模型的计算文本相似度的方法。
     2．将文本聚类技术应用到信息过滤中。对文档进行了聚类，在检索的期间，对用户提出的检索词先进行和每一类的类心比较，得到与之最近的类别，仅将属于该类别中的文档与用户提出的检索词进行运算，从而缩小了检索的范围，提高了检索的效率，也在一定程度上克服了检索结果的偏差。
     3．将文本聚类技术应用到信息过滤中。借鉴了协作过滤的思想，不再把用户看成是独立的个体，而是看成按一定的相似兴趣联系的群体类，对用户模型进行了聚类，这样在发送文献时不再以单个用户模型作为计算对象，而是以用户兴趣类作为计算对象，同时进行文献推荐时也是以用户兴趣类作为推荐对象的，以期提高过滤效率和准确率。
     实验结果表明，引入本文提出的基于粗糙集的聚类方法之后的信息过滤系统较原来的系统在性能上有所提高。
Information filtering is a unique and active information service mechanism, a useful supplement to the traditional information retrieval service. Clustering makes a classification to a set of subjects of question space, and put the similar subjects into a category, which makes the average distance between the subjects within one category as minimum as possible, and while makes the distance between clusters as maximum. The application of clustering into information filtering, to a certain degree, promotes the filtering efficiency of the system, and plays an active role in the examination of the precision and recall of the text.
    The indeterminacy and vagueness of natural language cause difficulty to NLP. The rough set is capable of describing the vague concepts, and measuring to the extent of vagueness, so it is appropriate to describe the natural language through the rough set. With the rough set theory as background, this paper has studied deeply the rough set representation model and the clustering based on this model. The main innovation and work of this paper are as follows.
    (1) puts forward a new text representation model, which originates from the theory of equivalence division of the rough set, defines the similitude of this model, and proposes the approach to calculate the text similitude of this model.
    (2) puts the text clustering techniques into the practice of information filtering. After clustering of the documents, in the process of retrieval, we make a comparison between the retrieval words the users point out and cluster center of the documents, and as a result, achieve a cluster that is most similar to retrieval words. Through the calculation of both the selected documents and those retrieval words, thence the retrieval range will be reduced, the efficiency of retrieval be increased, and the retrieval deviation be overcome to a certain extent.
    (3) puts the text clustering techniques into the practice of information filtering. In virtue of the cooperation filtering theory, this paper no longer look on the user as separate, but a group of people whose interests are in common in some aspect. Besides, it makes cluster to the user profile, so that the separate user profile will no longer be taken as the calculation subject when the documents are sent out, but the user classified in terms of their interest, which can be used as the recommended subject when the documents are sent out in order to promote the filtering efficiency and precision.
    The results of the experiment demonstrate the current information filtering system based on the rough set clustering is more efficient than the previous ones in light of its operation.

引文

[1] Oard D.W., & Marchionini G., A Conceptual Framework for Information Filtering(Tech. Rep. No. CS-TR-3643). University of Maryland, Computer Science Department. Postscript version. 1996.
    [2] Belkin N J, Croft W B. Information Filtering and information Retrieval: two sides of the same coin. Communication of ACM, 1992, 35(12): 29～38
    [3] Peter W. Foltz and Suzan T. Dumais, Personalized Information Delivery An Analysis of Information Filtering Methods. Communication of ACM, 1992,35(12): 51—60
    [4] David Goldberg, David Nicholas, Brian M. Oki, and Douglas Terry. Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12): 61—70, December 1992.
    [5] 徐小琳、阙喜戎、程时端．信息过滤技术和个性化信息服务．计算机工程与应用，2003．9，182—184页
    [6] 林鸿飞．中文文本过滤的逻辑模型．东北大学博士论文，2000．5．
    [7] T.W. malone, K.R. Grant, F.A. Turbak, S.A. Brobst, and M.D. Cohen. Intelligent Information Sharing Systems. Communications of the ACM, 1987, 30(5): 390—402.
    [8] 牛伟霞．科技文献过滤中的用户兴趣模型研究．山西大学硕士学位论文，2001．6．
    [9] 黄萱菁．大规模中文文本的检索、分类与摘要研究．复旦大学博士论文，1998．5．
    [10] 林鸿飞、战学刚、姚天顺．基于概念的文本结构分析方法．计算机研究与发展，2000．3，37(3)：324—328页
    [11] 林鸿飞、李业丽、姚天顺．中文文本过滤的信息分流机制．计算机研究与发展，2000．4，37(4)：470—476页
    [12] 林鸿飞．基于混合模式的文本过滤模型．计算机研究与发展，2001．9，38(9)：1127—1131页
    [13] 王继成、萧嵘等．Web信息检索研究进展．计算机研究与发展，2001．2，38(2)：187—193页
    [14] 邹涛、王继成、张福炎等．文本信息检索技术综述．计算机科学，1999．9，26(9)：72—75页
    [15] 邹涛．网络信息发现与检索技术．计算机世界，1999．4．19
    [16] 专题报道邹涛、王继成、张福炎等．基于Web的资料搜集系统的设计与实现．情报学报，18(3)：195—201页
    [17] 王继成、邹涛等．基于Internet的信息资源发现技术与实现．计算机研究与发展，1999．11，36(11)：1369—1374页
    [18] 雷鸣、刘建国等．一种基于词典的搜索引擎系统动态更新模型．计算机研究与发展，2000．10，37(10)：1265—1270页
    [19] Jianguo Liu, Ming Lei, Jianyong Wang, and Baojue Chen, Digging for gold

    on the Web: Experience with the WebGather. Accepted by the HPC/Asia 2000 Conference., IEEE Computer Society Press, May 2000, Beijing, P.R. China.
    [20] 韩客松、王永成．中文全文标引的主题词标引和主题概念标引方法．情报学报，2001．2，20(2)：212—216页
    [21] 李晓黎、史忠植、董明楷．搜索引擎实现个性化服务研究．人工智能进展，清华大学出版社，2001．2，141—146页
    [22] Salton, G. Automatic Text Processing. Addison WesleyPublishing Company, 1988.
    [23] Pavel Berkhin. Accrue Softwate, Inc. Survey of Clustering Data Mining Techniques. 2002
    [24] 孙即祥等．现代模式识别．长沙：国防科技大学出版社．2002：13v45页
    [25] Zhou Shui-Geng, Zhou Ao-Ying, Cao Jing, Hu Yun-Fa. A Fast Density-Based Clustering Algorithm. Journal of Software, 2000, 37(11): 1287v1292
    [26] Chen Ning, Chen An, Zhou Long-xiang, An Incremental Grid Density-Based Clustering Algorithm. Journal of Software, 2002, 13(1): 1—7
    [27] 张莉、周伟达、焦李成．核聚类算法．计算机科学，2002，25(6)：587—590页
    [28] 张伟、廖晓峰、吴中福．一种基于遗传算法的聚类新方法．计算机科学，2002，29(6)：114—116页
    [29] 吴立德．大规模文本处理．复旦大学出版社，1997．102—110页
    [30] 杨建林．信息检索模型与逻辑理论．情报学报，2000年10月第19卷第5期，514—519页
    [31] 史忠植．知识发现．清华大学出版社．2002，143—169页
    [32] Pawlak Z. (1997). Rough set approach to knowledge-based decision support. European Journal of Operational Research, 99(1): 48—57.
    [33] 张焕炯、李玉鉴、钟义信．文本相似度计算的一种新方法．计算机科学，2002，29(6)：92—93页
    [34] Pawan Lingras. Rough Set Clustering for Web Mining, http://citeseer. nj. nec. com/531368. html
    [35] Slowinski & Vanderpooten(2000). A generalized definition of rough approximations based on similarity. IEEE Transactions on Knowledge and Data Engineering, 12, 331—336
    [36] 第一届中文信息处理发展国际研讨会研讨提纲(讨论稿)．上海，2001．4．
    [37] Salton G. et al. A Vector Space Model for Automatic Indexing. Communications of ACM, Nov. 1975, Vol. 18(5): 613—620
    [38] Jinxi Xu, W. Bruce. Croft, Improving the effectiveness of information retrieval with local context analysis, ACM Transactions on information systems, 18(1): 79—112, 2000.
    [39] 刘开瑛著．中文文本自动分词与标注．商务印书馆，2000
    [40] 王素格．现代汉语词性标注知识获取方法研究，山西大学硕士学位论文，2000．5


    [41] 吴斌、傅伟鹏、郑毅、刘少辉、史忠植．一种基于群体智能的Web文档聚类算法．计算机研究与发展，39(11)，1429—1435页
    [42] 林鸿飞、马雅彬．基于聚类的文本过虑模型．大连理工大学学报，42(2)，249—253页
    [43] 王汉萍、孟庆春．文本自动分类在搜索引擎上的应用．计算给与信息技术，2003年第5期(总第116期)5-25页
    [44] Prem Melville, Raymond J. Mooney, Ramadass Nagarajan. Content-Boosted Collaborative Filtering for Improved Recommendations. In Proceedings of the Eighteenth National Conference on Artificial In telligence(AAAI-2002), Edmonton, Canada, July 2002.
    [45] 谷波、张永奎．文本聚类方法的分析与比较．电脑开发与应用，2003 Vol．16 No．11，4—7页

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700