数据挖掘中关联规则的研究与应用

英文题名：Research and Application of Association Rule in Data Mining
作者：张友志
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：知识发现 ; 数据挖掘 ; 关联规则 ; Web用户访问日志数据挖掘 ; 浏览路径
英文关键词：Knowledge Discovery from Database ; Data Mining ; Association Rule ; Web Usage Data Mining ; Traversal Path
学位年度：2004
导师：周熙襄 ; 钟本善
学科代码：081002
学位授予单位：成都理工大学
论文提交日期：2004-05-01

摘要

随着计算机技术的发展和Internet的普及，Web和用户对Web访问的信息的爆炸式增长与人们注意力的有限性之间的矛盾也随之加剧，Web数据挖掘是解决这一矛盾的有效手段，但由于Web数据及应用的特殊性，使得传统的技术不能直接应用在Web的信息挖掘中。Web日志数据是记录用户对Web站点访问信息的数据，保存有大量的路径信息，对这类信息的分析有利于设计人员掌握用户访问Web的行为特征，并可以用来对网站的结构进行优化和页面的重组。
传统的关联规则挖掘技术是从包含一组事务记录的数据库中发现一些事务项目间关系的信息。本文的工作，是致力于将关联规则的概念引入到Web挖掘系统中，将用户的访问路径以关联规则的形式表现出来，其目的在于从用户访问超文本系统的行为中发现用户的访问模式。文中对数据挖掘中的关联规则进行了系统的探讨，在综述Web数据挖掘的分类、研究内容和目前的研究现状的基础上，给出了从原始日志数据如何初步分析出用户路径的启发性规则和形式化描述。在此基础上，给出了两种方法来发现用户的访问关联规则。一种是采用最大向前路径(MF)方法，最后的步骤类似于传统数据挖掘中的Apriori算法。另一种方法是将超文本系统看成是一种有向加权图，经过对可信度和支持度的重新定义，使之适合于用来表示用户的访问路径，并引出复合关联规则挖掘算法。
With the development of computer technology and the popularization of Internet, Web and Web usage information is becoming the largest information Warehouse with the exploding rise of www. So the conflict between the limited human attention and the unlimited information is notable. Web data mining is a useful method to solve such problem, but the www data and application on Web have their own characters so that the traditional technology cannot apply to the information mining on www directly. The Web log contains the visit information of all users, especially the path information. The analysis of this kind of information is useful for the Website designer to know the users Web usage pattern. The designer can use the result of analysis to optimize the structure of Website and reorganize the structure of Webpage.
Traditional association rule techniques aim to mine some relations between transaction items from databases consisting of a set of transaction records. In this work, we try to introduce the notion of association rule into the Web mining system and represent the user traversal path in the form of association rule. The aim is to discover the visit patterns from the Web log. In the paper we have a systematic research into the association rules of data mining. Web mining categories, study content and aim are introduced at first. We propose the methods to get the user path from the Original log data. Then we give two methods to mine the user s access association rule. One is the Maximal Forward References(MF)method, and it is like the traditional data mining methods, Apriori. The other method is regard the hypertext system as a weighted directed graph. After the redefinition of the confidence and support, we propose the composite association rule mining method.

引文

[1] 冯玉才，冯剑林．关联规则的增量式更新算法[J]．软件学报，1998，9(4)，301—306．
    [2] 张朝晖等．发掘多值属性的关联规则[J]．软件学报，1998．11．
    [3] 陆丽娜等．Web日志挖掘中的数据预处理的研究[J]．计算机工程，2000．4，26(4)．
    [4] 陆丽娜等．Web日志挖掘中的序列模式识别[J]．小型微型计算机系统，2000．5，21(5)．
    [5] 杨怡玲等．一个简单的Web日志挖掘系统[J]．上海交通大学学报，2000．7，34(7)．
    [6] 韩世聪，黄兴国．Web上路径模式发掘的研究[J]．微型电脑应用，2000，16(6)．
    [7] Jiawci Han，Micheline Kamber著，范明孟小峰等译．数据挖掘概念与技术[M]．机械工业出版社．2001
    [8] 宋擒豹，沈钧毅．Web日志的高效多能挖掘算法[J]．计算机研究与发展，2001．3，38(3)．
    [9] 王熙照等．Web用户访问模式挖掘[J]．河北大学学报，2002．12，22(4)．
    [10] 肖立英等．Web日志挖掘技术的研究与应用[J]．计算机工程，2002．7，28(7)．
    [11] 朱明．数据挖掘[M]．中国科学技术大学出版社．2002．5．
    [12] 王玉珍．Web使用模式挖掘研究[J]．计算机应用，2003．7，23(7)．
    [13] 陈新中等．Web用户访问模式挖掘研究[J]．计算机科学，2003，30(3)．
    [14] DavjdHand Heikki Mannila Padhraic Smyth著，张银奎等译．数据挖掘原理[M]．机械工业出版社．2003．4
    [15] [美]Richard J．Roiget Michael W．Geatz著，翁敬农译．数据挖掘教程[M]．清华大学出版社．2003．11
    [16] 邵峰晶，于忠清．数据挖掘<原理与算法>[M]．中国水利水电出版社．2003．8．
    [17] [美]Mehmed Kantardzic著，闪四清等译．数据挖掘—概念、模型、方法和算法[M]．清华大学出版社．2003．8．
    [18] [美]TomSoukup Ian Dayidson著，朱建秋等译．可视化数据挖掘[M]．电子工业出版社．2003．10．
    [19] [美]Gordon S．Linoff Michael J．A．Berry著，沈钧毅，宋擒豹等译．Web数据挖掘：将客户数据转化为客户价值[M]．电子工业出版社．2004．3．
    [20] U.M. Fayyad, G. Piatetsky_Shapiro, P. Smyth, andR. Uthurusamy. Advances in Knowledge Discovery and Data mining[J].AAAI/MIT Press, 1996, pages 154-161.
    [21] Sam Jose .M. Houtsma and A_Swami_Set-oriented mining of association rules[R].Research report RJ9567, IBM Almaden Research Center, California, October 1993, pages 32-44
    [22] R. Agrawal, T. lmiclinski & A. Swami.Mining Association Rules Between Sets of Items in Large Database[M].Proceedings of

    ACM SIGMOD, pp. 207-216, May 1993.
    [23] R. Agrawal & R. Srikant. Fast Algorithms for Mining Association Rules in Large Database[R].Proceedings of the 20~(th) International Conference on very Large Data Bases, pp. 478-499, September 1994.
    [24] Mining-Syan Chen Long Soo Park and Philip S. Yu. Data Mining for Path Traversal Patterns in a Web Enviroment[R]. Proceedings of the 16~(th) International Conference on Distributed Computing Systems, pages 385-392, May 27-30 1996.
    [25] B. Mobasher, N. Jain, E. Han & J. Srivastava. Web Mining:Pattern Discovery from World Wide Web transactions[R].Technical Report TR. 96-050, University of Minnesota, Dept of Computer Science, Minneapolis,1996.
    [26] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan. Web Usage Mining: Discovery and Applications of Usage Patterns From Web Data[J].SIGKDD Exploration, Vol. 1, Issue 2,2000.
    [27] Mulvennav,M.D. etal.,eds.,Personalization on the Net using Web Mining:A collection of articles, CACM 43, no. 8,2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700