基于XML的Web文本挖掘及关联算法的研究

英文题名：Research on Web Text Mining Based on XML and Association Rule Mining Algorithm
作者：王燕
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：XQuery ; Apriori ; XML文档 ; 关联规则 ; 数据挖掘
英文关键词：XQuery ; Apriori ; XML documents ; association rules ; data mining
学位年度：2011
导师：苏勇
学科代码：081203
学位授予单位：江苏科技大学
论文提交日期：2011-01-05
答辩委员会主席：张再跃

摘要

近年来,随着计算机技术的发展和互联网的普及,各级网站服务器中的数据量越来越庞大,数据的种类也越来越繁杂,如何更好地有效利用这些数据,从中挖掘出对各个领域有价值的信息成为现如今的热点研究。
     尽管传统的数据库技术和数据挖掘技术已取得了飞速的发展且也在日益完善,但由于Web数据的数据类型是半结构化或无结构化,传统技术对Web数据的信息挖掘而言,就存在诸多的困难。XML是一种半结构化的数据模型,随着XML的不断发展,用XML表示Internet上的信息开始广泛应用。XML具有可扩展性、平台无关性、灵活性等特点,还具有强大的数据表达能力,这使得XML能够在信息数据的表示和交换方面的作用日渐增强。因此,对于数量巨大的XML数据,如何能够有效提取其中有价值的信息迫在眉睫。
     Apriori算法是关联规则挖掘的经典算法,在关联规则领域有很大的影响力,然而由于其需要过于频繁的扫描数据库及较大的空间消耗,许多人已经通过多种方法对其进行改进。现有的基于XQuery的Apriori算法仍存在需要改进的地方,例如,某些情况下由于XML文档的数据量太大,相关的数据就被存放在多个文档中,这些文档又没有必然的联系。而目前的关联规则算法则主要是对单个XML文档进行挖掘,若要对多个文档进行挖掘,就必须对算法进行改进。
     本文将XML的查询语言XQuery与关联规则挖掘算法结合起来实现了基于XQuery的Apriori算法,对多个XML文档的关联规则挖掘进行研究。在不降低挖掘效率的前提下,通过对算法进行改进,引入XQuery语言中的collection函数,由于此函数具有可以访问多个XML文档集合的特点,实现了对多个XML文档进行挖掘的目标。将改进的算法运用在基于XML的Web文本挖掘模型中,验证了其可行性及有效性。
In recent years, with the development of computer technology and the popularity of the Internet, the data quantity in all levels of website server is getting more and more huge, the data type is also getting more and more numerous and diverse, how to use these data more effectively and dig out valuable information in all areas now become a hotspot research.
     Although traditional database technology and data mining technology has acquired rapid development and also consummates day by day, but because the data type of Web data is semi-structured or unstructured, traditional technology have many difficulties in mining information of Web data. XML is a semi-structured data model, with the continuous development of XML, more and more Internet information are indicated by using XML. XML have the Characteristics of extendibility, platform independency, flexibility and so on, also has strong data expression skills, which make XML have stronger role in representing and exchanging information day after day. Therefore, regarding the huge quantity of XML data, how to effectively extract valuable information is imminent.
     The Apriori algorithm is a classical algorithm for mining association rules and has great influence in association rules domain, however, as a result of its need to scan database frequently and the large space consumption, many people have made the improvement with it through many kinds of methods. Existing Apriori algorithms realized by the XQuery language still have the place needs to be improved, for example, in certain circumstances, because of the XML documents’large data quantity, the related data is stored in many documents which have no inevitable relation. But the present association rule mining algorithms are mainly mining the single XML document, the algorithms must be improved if they mining several documents.
     This article unifies XQuery which is XML’s query language and the association rule mining algorithm to realize the Apriori algorithm based on XQuery as to study mining association rules of several XML documents. It makes the improvement to the algorithm through introducing the collection which belongs to the XQuery language and has the characteristics of accessing sereral XML documents, which realizes the aim of mining several XML documents on the premise without reducing the efficiency of mining. The improved algorithms will be used in Web text mining model based on XML and its feasibility and validity will be verified.

引文

[1]邵峰晶,于忠清.数据挖掘原理与算法[M].北京:中国水利水电出版社,2003.
    [2]张学冰.Web数据挖掘中XML应用及关联算法改进[D].山东大学硕士学位论文,2008,4.
    [3]曹春静.基于XML和Web数据的关联规则挖掘研究[D].华东师范大学硕士学位论文,2007,10.
    [4]李健,徐超,谭守标.一种Web数据挖掘系统的设计和研究[J].计算机技术与发展,2009,19(2):70-73.
    [5]马强,陶导,钱卫宁,周傲英.基于图模型的Web数据分析性查询语言[J].广西师范大学学报::自然科学版2009,27(1):121-124.
    [6]常勇,王亮,姚增利,袁方.基于领域知识和决策树的Deep Web数据标注[J].广西师范大学学报:自然科学版,2009,27(1):129-132.
    [7]刘杰.Web数据抽取技术研究[D].哈尔滨工程大学硕士学位论文,2008,12.
    [8]李华虎.基于语义的web数据挖掘在在线阅读网站应用的研究[D].东华大学硕士学位论文,2009,3.
    [9]旷玲丽.Web挖掘相关问题的研究[D].西南交通大学硕士学位论文,2009,5.
    [10]王礼刚.基于XML的Web文本数据挖掘研究[D].西南大学硕士学位论文,2007,5.
    [11]廖鹏.基于XML的Web数据挖掘及关联算法的研究[D].西南大学硕士学位论文,2009,5.
    [12]马宏伟.基于XML的Web文本挖掘应用研究[D].合肥工业大学硕士学位论文,2008,12.
    [13]苗玲玲.基于XML面向Web的数据抽取技术研究[D].长春理工大学硕士学位论文,2009,3.
    [14]李姝.基于XML的Web数据挖掘研究[D].大连海事大学硕士学位论文,2007,3.
    [15]陈炳超.基于XML的Web数据挖掘研究[D].暨南大学硕士学位论文,2008,6.
    [16]谢祥明.基于XML的Web数据挖掘[D].华中师范大学硕士学位论文,2008,6.
    [17] Rakesh Agrawal, Tomasz Iniielinski, and Arun N.Swami. Mining association rules between sets of items in large databases[C]. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C., USA.ACM Press, 1993:207-216.
    [18] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases[C].In Jorge B.Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, Santiago de Chile, Chile, MorganKaufmann.1994:487-499.
    [19]谷长勇,徐志伟,褚兴军.XML结构和关系数据库的一种形式化映射[J].计算机工程,2001,27(11):16-17.
    [20]吴文辉,殷建平,姚丹霖等.关系模式到XML模式的转换研究[J].计算机工程与科学,2004,26(2):94-96.
    [21] Jacky W.W.Wan, Gillian Dobble. Mining Association Rules from XML Data using XQuery[C]. Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationlisation, New Zealand, 2004.
    [22] D.Braga, A.Campi, S.Ceri, M.Klemettinen, PL.Lanzi.Mining Association Rules from XML Data [C]. Proceedings.of the 5th Data wearhousing and knowledhe discovery, LNCS 2454, Aix-en-Provence, France, 2002,1.
    [23] Jiawei Han, Micheline Kamber著;范明,孟晓峰译.数据挖掘概念与技术[M].北京:机械工业出版社,2007.3:5-6,151-154.
    [24] Jiawei Han. Data Mining Concepts and Techniques. Morgan Kaufmann Publisher,2001.
    [25]谢丹夏.Web上数据挖掘技术和工具设计[J].计算机工程与应用,2001,6:85-87.01.
    [26]陈玉哲,代术成,庄成三.基于XML数据模型的Web数据库查询系统[J].计算机应用,2002,22(3):41-43.
    [27]宋爱波等.Web日志挖掘[J].东南大学学报(自然科学版),2002,32(1):15-18.
    [28]郑东飞.基于XML的Web数据挖掘技术研究与实现[D].山东大学硕士学位论文,2005.
    [29] Soumen Chakrabarti,Byron Dom,Ravi Kumar,Prabhakar Raghavan,Srid-liar Rajagopalan,Andrew Tomkins,David Gibson,and Jon M.Kleinberg.Mining the web’s link structure.IEEE Computer, 1999,32(8):60-67.
    [30] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pangning Tan.. Web usage mining: Discovery and applications of usage patterns from web data.SIGKDD Explorations, 2000,1(2):12-23.
    [31]苏大威.基于关联规则发现的Web挖掘[D].河海大学硕士学位论文,2002.
    [32]杨晓雪,衡红军.一种对XML数据进行关联规则挖掘的方法研究[M].计算机科学增刊,2005,32(7):297-300.
    [33] R.Agrawal, T.Imielinski, and A.Swami. Mining association rules between sets of items in large databases[C]. Proceedings of the ACM SIGMOD Conference on Management of data.1993:207-216.
    [34] Jong Soo Park, Ming-Syan Chen and Philip S.Yu. An Effective Hash-Based Algorithm for Mining Association Rules. IBM Thomas J.Waston Research Center.
    [35] A.Savasere, E.Omiecinski, and S.Navathe. An efficient algorithm for mining association rules in large databases[C]. Proceedings of the 21st International Conference on Very large Database.
    [36] H.Toivonen. Sampling large databases for association rules[C]. Proceedings of the 22nd Internationl Conference on Very Large Database, Bombay, India, 1996,9.
    [37] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent Patterns without candidate generation[C]. In Weidong Chen, Jeffrey F. Naughton, and Philip A. Bernstein, editors, Proceedings.of the 2000 ACM SIGMOD International Conference on Manageinent of Data, Dallas, Texas, USA. ACM Press. 2000:1-12.
    [38]杨科,赖朝安,赵阳.基于XML数据的FP-growh算法挖掘研究[J].计算机工程与应用,2008,44(19):150-152.
    [39]况莉莉.Apriori算法与FP-tree算法的探讨[J].淮北煤炭师范学院学报(自然科学版),2010,31(2):44-49.
    [40]贺艳蓉.基于FP-tree最小无冗余关联规则挖掘[D].华中科技大学硕士学位论文,2008.
    [41]杨云,罗艳霞.FP-Growth算法的改进[J].计算机工程与设计,2010,31(7):1506-1509.
    [42]李刚.疯狂XML讲义[M].北京:电子工业出版社,2009.
    [43]徐振航,刘莉芹.基于XML的Web数据挖掘技术[J].计算机系统应用,2001,1:39-42.
    [44] http://www.w3.org/XML/.
    [45]潘有能,邓三鸿.基于XML和关联规则的Web挖掘研究[J].现代图书情报技术,2004,112(7):30-34.
    [46] World Wide Web Constortium XQuery 1.0 and Xpath 2.0 Functions and Operators[EB/OL] http://www.w3.org/TR/xquery-operators.
    [47] Priscilla Walmsley著,王银辉译.XQuery权威指南[M].电子工业出版社,2009.
    [48]朱春磊.基于结构向量空间和树路径模型的XML文档聚类技术研究[M].南开大学硕士学位论文,2008.
    [49]任庆东,苏斐,李井辉.利用XML实现异源数据库中的数据交换[J].计算机应用研究,2001,12:129-130.
    [50] QinDing, Kevin Ricords, Jeremy Lumpkin. Deriving General Association Rules from XML Data[C]. Proceedings of the ACTS Fourth International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD’03), Germany, 2003.
    [51] Amnon Meisels, Michael Orlov, Tal Maor. Discovery Associations in XML Data[C]. Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops), Singapore, 2002.
    [52]曹春静,王新伟.基于XQUERY和XSLT的不规则XML文档的关联规则挖掘[J].计算机应用,2007,27:251-253.
    [53]崔建群,何炎祥,郑世压,吴黎兵.基于XML的Web数据挖掘关键技术的研究[J].计算机工程,2006(20):43-44.
    [54]董树明.半结构化Web信息抽取技术及其应用研究[D].东南大学硕士学位论文,2004.
    [55]庞景安.Web文本特征提取方法的研究与发展[J].信息系统,2006(3):338-340.
    [56]钱小军.Web文本挖掘技术研究及其实现[D].浙江大学硕士学位论文,2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700