文本挖掘及其在UDDI Registry智能检索中的应用

作者：谭德坤
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本挖掘 ; 智能检索 ; UDDI ; Regitry ; 概念空间 ; 概念检索 ; 个性化
英文关键词：Text Mining ; Intelligent Retrieval ; UDDI Registry ; Concept Space ; Concept Retrieval ; Personalization
学位年度：2004
导师：丁志强
学科代码：081203
学位授予单位：昆明理工大学
论文提交日期：2004-05-21

摘要

随着Web Services技术的不断成熟和发展，存储在UDDI Registry中的Web Service信息将会变得越来越庞大，如何从UDDI Registry浩如烟海的信息资源中为用户快速、方便、准确地检索出满足需求的Web Service，将变得十分重要。而传统的基于关键词匹配的检索技术已不能满足用户准确而全面定位信息的要求，因此，本文就以Web Service的文本描述信息为研究对象，提出了应用于UDDI Registry的智能信息检索技术。
     对文档集进行特征化表示是文本挖掘和信息检索的前提和基础。本文用频繁序列模式挖掘算法挖掘出扩展短语，用扩展短语代表文档的特征项，并用概念秩算法和HITS算法挖掘出文档的主题概念，文档的特征就用主题概念加以表示。
     智能检索的核心是概念检索和个性化服务。为了对文档进行概念检索，必须发现某个领域内的概念及其之间的关系，即构建出概念空间。本文通过文本挖掘相关技术挖掘用户访问文档信息，从而构建出用户私有的概念空间，核心算法是改进的K—Means文档聚类算法和FP-树频繁模式发现算法。由于概念空间是通过挖掘用户访问文档信息生成的，它也包含用户的个性化信息，在概念检索时候，也实现了个性化服务的目的。
     概念检索是智能检索的具体体现。在概念检索过程中，为了帮助用户更加准确的表达自己的查询意图，本文采用Hopfield神经网络算法对用户的检索关键词集进行概念联想，将联想的结果供用户再次反馈。对用户反馈后的查询表示与文档特征表示，本文给出了概念匹配运算的方法，并讨论了检索结果如何组织的方法。
     最后，为验证本文的研究结果，提出了一个将上述几个方面有机结合起来的智能检索系统模型，并给出了一个具体的检索验算。
With the constant development of Web Services technology, then the Web Service information stored in UDDI Registry will become huger and huger, how to fast, conveniently , accurately search out the Web Service which meet the users' need from voluminous information resources stored in UDDI Registry will become very important. But the traditional information retrieval method based on keyword matching can't meet the users' need any more, therefore, this paper regard the text description information of Web Service as the research object, presents a intelligent information retrieval technology applying to UDDI Registry.
    The document characteristic representation is the prerequisite and foundation of information retrieval and text mining. This paper uses the frequent sequences algorithm to discover the expanding phrase, the document characteristic then represented by it, and it uses concept rank algorithm and HITS algorithm to extract the theme concepts from document collections. Then the document characteristic representation is represented by these theme concepts.
    The core of the intelligent retrieval technology is concept retrieval and personalized service. To realize concept retrieval on documents, it need to discover those concepts and the relations among them in related fields of these documents, namely building the concept space. This paper uses relative methods of text mining to build the user's private concept space through mining the user's access pattern, the kernel algorithms are improved K-Means clustering algorithm and Frequent-Pattern growth algorithm. Because the concept space is generated by mining the user's access pattern.it also includes the user's individualized information, when we retrieve documents based on concept retrieval ,the system has realized the purpose of the personalized service too.
    The concept retrieval is the concrete embodiment of intelligent information retrieval. In the process of concept retrieval, in order to help user express his query intention accurately, this paper uses the Hopfield neural network algorithm to search the association keywords which are related to the keywords that user input, the associated result is returned to user to select again. For the user's query expression which is the user's feedback and document characteristic representation, this paper



    gives a calculational method based on concept matching for them, and discusses the method how to organize the retrieval result.
    Finally, in order to verify the studying result of this paper, we design a model of intelligent information retrieval system which is the comprehensive application of above-mentioned several respects and give a concrete computation sample.

引文

[1] Jiawei H，Micheline K著.数据挖掘概念与技术[M].范明，孟小峰等译.北京：机械工业出版社，2001.8
    [2] 朱明，数据挖掘.合肥：中国科技大学出版社，2002.5
    [3] 韩客松，王永成，文本挖掘、数据挖掘和知识管理，情报学报，2001，20(1)：100-104
    [4] 徐妙君，顾沈明，面向Web的文本挖掘技术研究，控制工程，2003，10(J)：44-50
    [5] 何儒云，汤艳莉，智能化信息检索研究，图书馆，2003(3)：34—37
    [6] R Feldman and I Dagan.KDT—Knowledge Diseovery in Textual Databases. In Proceedings of the 1st Annual Conference on Knowledge Discovery and Data Mining,112—117,Montereal 1995
    [7] 张晓刚，李明树，智能搜索引擎技术的研究与发展，计算机工程与应用，2001(24)：67—70
    [8] Hearst M A and Pederson J. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval.76—84, Zurich 1996
    [9] Salton G. Developments in Automatic Text Retrieval. Science August,1991,253:974—979
    [10] Salton G et al. Automatic Analysis, Theme Generation and Summarizationof Maehine-Readable,Text Seience, 1994,64:1421—1426
    [11] 郑毅.文本挖掘及其在文本检索中的应用：[硕士学位论文].北京：中国科学院计算技术研究所，2002
    [12] 周强，冯松岩，构建知网关系的网状表示，中文信息学报，2000，14(6)：21—27
    [13] Salton G, Wong A,Yang C S.A vector space model for automatic indexing. Communications of ACM, 1975,18(11):613—620
    [14] 陈群秀，一个在线义类词库：词网WordNet，语言文字应用，1998(2)：93—99


    [15] 陈晓明，王虹，张仰森，“知网”的知识扩展和推理研究，贵州大学学报(自然科学版)，2001，18(2)：97—101
    [16] 胥桂仙，朴泰雄，杨丹丹等，中文文本挖掘中最长频繁序列的发现算法，中央民族大学学报(自然科学版)，2004，13(1)：36—42
    [17] 许欢庆，王永成，基于加权概念网络的用户兴趣建模，上海交通大学学报，2004，38(1)：34—38
    [18] 杨海东，张莉，PageRank技术分析与搜索引擎检索效率研究，淮阴师范学院学报，2003，2(3)：230—233
    [19] 曹军，Google的PageRank技术剖析，情报杂志，2002(10)：15—18
    [20] Salton G, Gerard.Introduction to modern information retrieval[M]. Auckland:McGraw—Hill, 1983
    [21] 景丽萍，黄厚宽，石洪波，用于文本挖掘的特征选择方法TF IDF及其改进，广西师范大学学报(自然科学版)，2003，21(3)：142—145
    [22] 张磊.个性化信息分发及概念检索的研究：[博士学位论文].北京：中国科学院计算技术研究所，2002
    [23] 贾崇，陆玉昌，鲁明羽，一种支持高效检索的即时更新倒排索引方法，计算机工程与应用.2003(29)：198-201
    [24] Mehmed K著.数据挖掘—概念、模型、方法和算法[M].闪四清，陈茵，程雁译.北京：清华大学出版社，2003.8.
    [25] H Chen,K J Lynch.Automatic construction of networks of concepts characterizing document databases.IEEE Transactions on systems,Man and Cybernetics, 1992,22(5)
    [26] 朱晓华，基于概念空间方法的信息检索技术研究，大学图书馆学报，2003(2)：47—53
    [27] 邓珞华，概念空间—定义、意义和局限，情报学报，2003，22(4)：393—397
    [28] 郑毅，吴斌，史忠植，基于概念空间的文本检索系统，计算机工程与应用，2002(12)：69—70
    [29] 宋爱波，胡孔法，董逸生，Web日志挖掘，东南大学学报(自然科学版)，2002，32(1)：15—19
    [30] 蒙祖强，蔡自兴，个性化数据聚类的研究，计算机工程与应用，2003(33)：25—27


    [31] 陈宁，陈安，周龙骧，基于模糊概念图的文档聚类及其在Web中的应用，软件学报，2002，13(8)：1598—1606
    [32] 宋益波.中文文本挖掘系统的设计与实现：[硕士学位论文].哈尔滨：哈尔滨工业大学，2001
    [33] 蒋秀英，牛犇，关联规则算法及其优化，洛阳大学学报，2003，18(2)：42—45
    [34] 李琳.在Web文本集中进行关联规则挖掘及相关算法的研究：[硕士学位论文].西安：西安交通大学，2002
    [35] 田萱，刘希玉，孟强，实现Web页面的智能个性化检索，计算机工程与应用，2003(1)：195—197
    [36] Hopficld J J.Neural network and physical systems with collective computational abilities. Proceedings of the National Academy of Science, 1982,79(4): 2554—2558
    [37] 李萍.基于Web文本挖掘的中文智能检索研究：[硕士学位论文].北京：北京科技大学，2003
    [38] 罗威，基于向量空间的中文概念检索技术研究，情报理论与实践，2003，26(3)：226—229
    [39] 程立倩，基于知识库的概念检索，山东农业大学学报(自然科学版)，2003，34(2)：230—233
    [40] 何绍义，概念信息检索的理论与实践，情报学报，1995，14(2)：134—141
    [41] 李源，何清，史忠植，基于概念语义空间的联想检索，北京科技大学学报，2001，23(6)：577—580
    [42] 宋玲，基于神经网络的概念联想和概念聚类，情报学报，2002，21(2)：167—172
    [43] 李源.网页概念语义空间的建立和联想检索的研究：[硕士学位论文].合肥：中国科技大学，2003
    [44] 黄洪钟，黄文培，陈新，机械优化设计的Hopfield神经网络算法，机械科学与技术，1998，17(2)：206—208
    [45] 陶跃华，基于向量的相似度计算方案，云南师范大学学报，2001，21(5)：17—19
    [46] 刁力力，王丽坤，陆玉昌等，计算文本相似度阈值的方法，清华大学学报(自然科学版)，2003，43(1)：108—131


    [47] 袁占亭，张爱民，张秋余，基于概念的Web信息检索，计算机工程与应用，2003(36)：173—175
    [48] Nie J Y.An information Retrieval Model based on Model Logic.Inf.Process.Manage, 1989,25(5)
    [49] 梅伟.基于检索环境的可信度信息检索模型：[硕士学位论文].昆明：云南大学，2002
    [50] 田萱，刘希玉，孟强，实现Web页面的智能个性化检索，计算机工程与应用，2003(1)：195—197
    [51] 韩立新，陈贵海，谢立，一个面向Internet的个性化信息检索系统模型，电子学报，2002，30(2)：240—244
    [52] 石晶，龚震宇，裘杭萍等，基于用户兴趣模型的智能信息检索系统技术与实现，情报学报，2003，22(3)：282—286
    [53] 谭德坤，王力红，基于模糊语言方法的信息检索系统的研究，计算机仿真，已录用

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700