基于web文献的数据挖掘研究应用

英文题名：Research and Application of Data Mining Based on web Literature
作者：龚真平
论文级别：硕士
学科专业名称：教育技术学
中文关键词：数据挖掘 ; 增量聚类算法 ; 文献聚焦爬虫 ; 文本相似度
英文关键词：Data mining ; Incremental clustering algorithm ; Literature focused crawler ; Text Similarity
学位年度：2011
导师：黄文培
学科代码：040110
学位授予单位：西南交通大学
论文提交日期：2011-05-01

摘要

随着高等教育的大众化,高校人数由过去的几十万上升到几百万,国家也会提供大量的资金资助大量科研项目,每年都会有数以万计的文献产生。由于Web文献的大量累积,人们很难从海量的文献数据中寻找到有用的信息,也就起不到提高工作效率的作用。本文的主要目的就是利用数据挖掘技术从大量的文献数据中找到有用的信息,以便进一步的指导工作。
     为了选择适合大量文献数据的数据挖掘算法,本文首先对数据挖掘的理论知识做了简要的介绍,给出了文本相似度计算的一般流程和公式,对几种聚类算法进了分析比较,发现一些不足的地方。根据聚类效果的评估原则和增量聚类算法的思想,设计了一个基于内聚度的增量聚类算法,弥补了上面几种算法的不足,然后通过相关实验对该聚类算法的参数进行了优选。查阅相关文献和分析PaperPass软件的检测结果,得出了一个计算文献相似度的计算方法,以便对文献抄袭现象进行检查。根据采用空间向量计算文本相似度的方式,改进了计算相似度的算法。为了获取大量的Web文献数据,本文研究了爬虫的相关知识,设计并实现了一个文献聚集爬虫。
     本文为了应用上面的算法和为用户提供可操作的平台,设计了一个基于Web文献的数据挖掘系统。本文对该系统的目标和特点进行了分析,选择了相关的技术路线,完成了系统架构、功能及主要模块的划分与设计,设计了系统数据库。最后,给出了系统的运行部署方法和相关功能的演示。
With the development of higher education, the number of university students has been increased from hundred thousand to several million during the past few years, the government will provide substantial fundings, and thus a large number of research projects are generated each year. Due to the accumulation of a large number of Web documents, it is difficult to find useful information from the mass of literature data, let alone improve the efficiency. The main purpose of this thesis is to find useful information from a large number of literature data for further guidance by using data mining technology.
     To find data mining algorithms suited for a large number of literature datas, firstly, this thesis gives a brief introduction to theoretical knowledge of data mining, and gives a general similarity calculation process and formula of the text, where we present an analysis of several clustering algorithms and find some deficiencies. According to the principles of clustering effect sassessment and the thinking of incremental clusterings, we design a cohesion-based incremental clustering algorithm, which makes up the deficiency of several above-mentioned algorithms. Then the parameters of the clustering algorithm are optimized by some relevant experiments. By referring to relevant literatures and analysizing the test results of PaperPass software, a method for caculating the similary degree is obtained, which contributes to the examination of the phenomenon of plagiarized documents. Moreover, the algorithem of calculating the similarity degree is improved based on the way of space vector. Finally, according to the relevant knowledge of the web cralwer, a literature focused crawler is designed and implemented so as to obtain an overwhelming of web documents data.
     In order to apply the above-mentioned algorithms and provide users with an operational platform, a Web-based data system of data mining is designed. This paper analyzes the goal and characteristics of the system, and selects the relevant technical line, then completes the system structure, function and division of main modules's divide, and finally designs the system database. In the end, the methods of the operation and deployment for our system are given, and the demos of some relevant functions are presented.

引文

[1]黄珍.基于数据挖掘的文献自动推荐研究[硕士论文].华中师范大学,2009年9月
    [2]胡光林,李雪萍.电子文献检索教程.北京理工大学出版社.2010
    [3]张承明.基于Web的数据挖掘研究[硕士论文].山东科技大学,2003年12月
    [4]Kantardzic, M. Data Mining Concepts. IEEE,2009
    [5]林丽.数字图书馆数据挖掘研究[硕士论文].武汉大学,2004年9月
    [6]邵峰晶,于忠清,王金龙,孙仁诚.数据挖掘原理与算法.第二版.科学出版社,2009
    [7]田宏桥,吴斌.基于Web的科技文献分析工具综述.数字图书馆论坛,2010,(8)
    [8]Youdong Zhang. Multi Relational Rules Mining in Data Warehouse. IEEE,2009
    [9]郭军华.数据挖掘中聚类分析的研究[硕士论文].武汉理工大学,2003年11月
    [10]张庐颖.基于粗糙集的K-means研究[硕士论文].北京交通大学,2010年12月
    [11]Latha, Ramaraj. Algorithm for Efficient Data Mining. Conference on Computational Intelligence and Multimedia Applications,2007.
    [12]李娜,黄孝彬,李琴,姜攀.数据挖掘软件产品综述.大众科技,2010年1期
    [13]孙迪.基于关联规则的数据挖掘算法研究[硕士论文].电子科技大学,2010年11月
    [14]M. Koster. Robots in the web:threat or treat OII Spectrum. Technology Appraisals, 1995
    [15]Junghoo Cho, Hector Garcia-Molina. Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems (TODS),2003
    [16]倪贤贵.聚焦爬虫技术研究[硕士论文].江南大学,2008年3月
    [17]Kassim, J.M. Rahmany, M. Introduction to Semantic Search Engine. Electrical Engineering and Informatics,2009. ICEEI'09
    [18]龚秋艳.并行网络爬虫设计与实现[硕士论文].华东师范大学,2010年3月
    [19]Yang, Y. Personalized Search Strategies for Spatial Information on the Web. Intelligent Systems IEEE,2010,1
    [20]Shan Lin, Yanzhong Hu. An Approach of Extracting Web Information Based on HtmlParser. Information Technology and Computer Science (ITCS),2010
    [21]王春元,张韬.一种获取网页主要中文信息的方法.全国计算机安全学术交流会论文集(第二十四卷),2009年9月
    [22]Jinzhu Hu, Xing Zhou, Jiangbo Shu, Chunxiu Xiong,Jun Zhu. Research of Active Information Service System Based on Intelligent Agent. Education Technology and Computer Science,2009
    [23]陈俊彬,曹树金.基于Heritrix的Web信息抽取.图书情报工作,2009年5月
    [24]尹辉.基于Nutch的搜索系统的研究[硕士论文].电子科技大学,2008年4月
    [25]Guojun Yu, Xiaoyao Xie, Zhijie Liu. The design and realization of open-source search engine based on Nutch. Anti-Counterfeiting Security and Identification in Communication (ASID),2010
    [26]Bing Liu. Web数据挖掘.清华大学出版社,2009
    [27]朱明.数据挖掘.第二版.中国科学技术大学出版社,2008
    [28]Kantardzic, M. DataMining Concepts. IEEE,2009
    [29]李芳.文本挖掘若干关键技术研究[博士论文].北京化工大学,2010年10月
    [30]Tianxia Gong, Chew Lim Tan, Tze Yun Leong, Cheng Kiang Lee, Boon Chuan Pang, Tchoyoson Lim. Text Mining in Radiology Reports Data Mining. Eighth IEEE International Conference,2008
    [31]王锦,王会珍,张俐.基于维基百科类别的文本特征表示.中文信息学报,2011年3月
    [32]刘坤朋.数据挖掘中聚类算法的研究[硕士论文].长沙理工大学,2010年3月
    [33]万志华,欧阳为民,张平庸.一种基于划分的动态聚类算法.计算机工程与设计,2005,26(1)
    [34]王国伟,闫丽,陈桂芬.一种加权的空间模糊动态聚类算法.计算机工程与应用,2010年17期
    [35]郝洪星,朱玉全,陈耿,李米娜.基于划分和层次的混合动态聚类算法.计算机应用研究,2011年01期
    [36]Sun Corporation. The J2EE turtorial DEB/OLD, http://java.sun.com/j2ee/tutorial/
    [37](美)罗曼(Ed Roman)精通EJB.第三版.罗时飞译.电子工业出版社,2005
    [38]刘洋,魏飞等.精通JBoss-EJB与Web Services开发精解.电子工业出版社,2004
    [39]Ulrike von Luxburg. A Tutorial on Spectral Clustering. This article appears in Statistics and Computing,2007,17(4)
    [40]郝占刚,王正欧.基于潜在语义索引和遗传算法的文本特征提取方法.情报科学2006年01期
    [41]赵俊杰.基于文本挖掘技术的论文抄袭判定研究[硕士论文].合肥工业大学,2009年3月
    [42]尹江,尹治本,黄洪.网络爬虫效率瓶颈的分析与解决方案.计算机应用,2008年05期
    [43]Deng Cai, Xiaofei He, Jiawei Han. Document Clustering Using Locally Preserving Indeiing. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL.2005
    [44]Pang-Ning Tan, Michael Steinbach,, Vipin Kumar. Introduction to Data Mining. Pearson Education,2005
    [45]张兆中.WEB文本挖掘的聚类分析[硕士论文].山东科技大学,2005年5月
    [46]朱良峰.主题网络爬虫的研究与设计[硕士论文].南京理工大学,2008年6月
    [47]杨丹波.应用Web数据挖掘的主题元搜索引擎设计与实现[硕士论文].清华大学,2008年12月
    [48]Bogorny, V, Shekhar, S. Spatial and Spatio-temporal Data Mining. Data Mining (ICDM),2010
    [49]赵鹏,蔡庆生.一种基于《知网》的中文文本聚类算法的研究.计算机工程与应用,2007年12期
    [50]李桃迎,陈燕,秦胜君,李楠.增量聚类算法综述.科学技术与工程,2010年35期

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700