基于Web使用挖掘的用户消费模式发现研究

英文题名：Research on Web Usage Mining Based Users Consuming Patterns Discovery
作者：曲义飞
论文级别：硕士
学科专业名称：系统工程
中文关键词：Web使用挖掘 ; 用户消费模式 ; 页面聚类 ; 用户最大频繁访问路径 ; 用户聚类
英文关键词：Web usage mining ; users consuming patterns ; pages clustering ; user most frequent navigation patterns ; users clustering
学位年度：2006
导师：周宽久
学科代码：081103
学位授予单位：大连理工大学
论文提交日期：2006-04-22

摘要

随着Web在信息共享、电子商务和提供在线服务方面的广泛应用,许多企业投入大量资金建立自己的网站用于发布信息,或在别人的网站上为自己的产品和服务作广告,或在网上开展电子商务活动,它们迫切需要了解这些投资产生的效益和作用,以便改进企业的策略,获取更多的商业机会,为用户提供更优质的服务。因此,理解用户的行为对这些企业来说至关重要。
     本文以Web日志记录为基础对Web使用挖掘过程进行系统的分析和研究,在前人研究模型的基础上提了四个新的模型方法,并将这些模型引入到Web使用挖掘过程中,设计和实现了一个Web使用挖掘系统(WUMS),从而挖掘出用户的消费模式。
     因此,本文的工作主要有以下几个方面:
     1、对Web使用挖掘进行了可行性分析,指出了目前所存在的难点,详细介绍了数据预处理的各个步骤,并在路径补充方面提出了一个新的算法——觅父节点补充法。
     2、在Web页面聚类方面,本文提出一个新的建立网页相似矩阵的模型,该模型在计算页面的引用相似性方面充分考虑了用户的浏览过程,从而使页面聚类更为合理。
     3、在寻找用户最大频繁访问路径方面,传统的Web使用挖掘模型大多都只考虑网页的距离而忽视结构层次,致使数据挖掘精度不高,达不到满意的Web挖掘效果。本文提出一种新的Web用户频繁浏览路径挖掘模型,充分考虑了Web网站结构层次特征,克服传统的挖掘的模型存在的问题。
     4、在基于马尔可夫的用户聚类方面,本文在传统的模型的基础上提出了一种新的建立用户马尔可夫转移矩阵模型,新的模型充分考虑了网站的拓扑结构,从而提高了Web使用挖掘的精度。并且成功的将用户聚类的结果与用户最大频繁访问路径相结合,发现用户组的兴趣、爱好,从而为商家做商业决策提供了有力的保障。
     最后,将提出的新的模型引入到Web使用挖掘活动中,结合关系数据库的特点设计并实现了一个具有可视化功能的Web使用挖掘系统(WUMS)。本文针对本试验室的网站(http://202.118.69.137:8000)的日志记录,通过对本网站近一个月的日志数据进行挖掘测试,验证了本文提出的新的模型的可行性和有效性。
Web technologies have found many and wide applications like information sharing, E-Business and online service. Many companies invest a great deal of money in constructing their Websites to issue their messages, making Ads for their products and providing services on other's Websites or doing E-Business on the Internet. These companies are wondering their fund utilization rate urgently in order to improve their business strategies to catch more business chances and provide better services to their users. Hence, it is very important for companies to acquire and understand the consuming behaviors of their users.Having made analysis and researches on web usage mining systemically based on Web log records and according to some conventional Web mining models, four new data mining models are issued and applied to Web usage mining. A web usage mining system is designed and implemented to find a user's personal consuming patterns.The works in this paper are arranged as follows:1. Feasibility analysis is made and difficulties of research on Web usage mining are pointed out. The process of data pretreatment is introduced and a new algorithm called 'father nodes finding for complement' is issued for path complement.2. A new model considering user browsing process adequately is issued to construct similarity matrix of web pages which shows similarity degree for any two pages in a website and makes pages clustering more rationally.3. Conventional Web mining models for user most frequent navigation patterns almost only consider the distance between pages and ignore the structure of the Web site, so they can't do data mining accurately. A new web mining model for user frequent navigation patterns which considers the Web site framework adequately is issued. It is proved that the model figures out the shortages of conventional Web mining models for user frequent navigation patterns by experimentations.4. For users clustering, a new model is issued to construct user transition matrix for Markov based on conventional model, and it considers the Web site structure adequately and makes Web usage mining more accurately. The results of users clustering and user most frequent navigation patterns are integrated successfully to find interests and taste of a user group, and then it provides ample warranty for companies when they make business strategy.Finally, a visual Web usage mining system is implemented with JBuilder. And the four new models are applied in this practical system. The test results show that the new models are very feasible and effective by experimenting based on the Web log records in Website
    fhttp://202.118.69.137:8000) in our Lab.

引文

[1] 邓英.Web数据挖掘的技术及工具研究.《计算机工程与应用》,2001.2,
    [2] 高毅龙.Web服务器访问日志的保存方法及其实现.《计算机工程》,1999.9,
    [3] 申瑞民,舒蓓,张同珍.个性化数字服务模型.《微电子学与计算机》,2001.1,
    [4] 宋擒豹,沈钧毅.Web日志的高效多能挖掘算法.《计算机研究与发展》,2001.3,
    [R] 刘振宇,阳小华.基于WWW用户浏览模式的路径提示算法.《计算机工程》,2001.3,
    [6] M. S. Chen, J. S. nark, mS. vu. "bfficient data mining for path traversal patterns". Ibbb Trans. h nowledge a ata bngng, 1998, 10(2): 209-221.
    [7] A. Nanopoulos, v. Manolopoulos. "Finding generalized path patterns for web log data mining", moceedings of the bast-b uropean Conference on Advances in a atabases and Information Systems, 2000: 21 R-228.
    [8] 宋擒豹,沈钧毅.Web页面和客户群体的模糊聚类算法.《小型微型计算机系统》,2001,
    [9] 宋爱波,胡孔法,董逸生.Web日志挖掘.《东南大学学报》,2002,32(1):1R-18.
    [10] 陈才扣,金远平.挖掘基于Web的访问路径模式.《小型微型计算机系统》,2001,22(1):107.108.
    [11] 徐宝文,张卫峰.数据挖掘技术再Web预取中的应用研究.《计算机学报》,2001.4,
    [12] 杨怡玲,管旭东,尤晋元.Web日志挖掘预处理中的Frame页面过滤算法.《计算机工程》,2001,27(2):76-77.
    [13] 王实,高文,李锦涛.路径聚类:在Web站点中的知识发现.《计算机研究与发展》,2001,38(4):482-48R
    [14] 煊李,庄镇泉.Web访问挖掘预处理的用户识别算法.《计算机工程与应用》,2002.7:172-176.
    [1R] 张娥,冯秋红,宣慧玉等.Web使用模式研究中的数据挖掘.《计算机应用研究》,2001.3,
    [16] 潘登,董小社,杨麦顺.从Web数据中挖掘频繁访问模式.《西安交通大学学报》,2002,36(6):631-634.
    [17] 宋敏青.数据挖掘在Web中的研究与应用.《现代情报》,2002,(3):R9-61.
    [18] 李晓.《Web挖掘技术》南京:河海大学,2001
    [19] 宋擒豹,沈钧毅.Web日志的高效多能挖掘算法.《计算机研究与发展》,2001.3,
    [20] 周斌,刘亚萍,吴泉源.一个面向电子商务的数据挖掘系统的设计与实现.《计算机工程》,2000,26(6):18-20.
    [21] 张戈.《Web访问信息挖掘研究》武汉:武汉大学,2001
    [22] Joshi A, h rishnapuram o. "o obust Fuzzy Clustering Methods to Support Web Mining". In 1998 ACM SId Ml a Workshop on o esearch Issues in a ata Mining and h nowledge a iscovery, 1998,
    [23] 李岩,陈新中,杨炳儒.基于Web挖掘的智能门户搜索引擎的研究.《计算机工程与应用》,2002.4,
    [24] h ohavi o, Masand B, Spiliopouiou M. "Web mining", a ata Mining and h nowledge a iscovery, 2002, 6(1): R-8.
    [2R] 薛进,张新谊,岳训.电子商务中的Web使用模式挖掘研究.《微型机与应用》,2001.12,
    [26] merkowitz M, btzioni 1. "Towards adaptive Web sites: conceptual framework and case study". Computer Networks, May 1999, 31(11-16): 124R-12R8.
    [27] 高飞,谢维信.互联网上的数据挖掘.《计算机科学》,2001.R
    [28] 赵畅,杨冬青,唐世渭.Web日志序列模式挖掘.《计算机应用》,2000.9,
    [29] 王实,高文,郎金文等.在线零售站点的自适应和商业智能的发现.《计算机科学》,2002.1,
    [30] h. -i. Wu, mS. vu A. Ballman. "A Web usage mining and analysis tool". IBM System Journal, 1998, 37(1): 89-104.
    [31] Nasraoui, Frigui e, Joshi A. "Mining Web Access i ogs Using o elational Competetive Fuzzy Clustering" nroc, bight International Fuzzy Systems Association World Congress, August 1999, 99
    [32] Nasraoui 1, h rishnapuram o, Joshi A. "Mining Web Access i ogs Using a oelational Clustering Algorithm Based on a oobust bstimator" m-oceedings of the bighth International World Wide Web Conference (poster), May 1999.,
    [33] e athaway o J, Bezdek J C. "NboF c-Means:Non-buclidean oelational Fuzzy Clustering". mattem oecognition, 1994, 27(3): 29-43.
    [34] Joshi A, Joshi h. "1 n Mining Web Access iogs" moc. SId M1 a 2000 Workshop on oesearch Issues in a am Mining and h nowledge a iscovery, 2000,
    [3R] 邢永康,马少平.一种基于Markov链模型的动态聚类方法.《计算机研究与发展》,2003.2,40(2):129-13R
    [36] 苏中,马少平,杨强等.基于Web-iog Mining的N原预测模型.软件学报,2002.1,
    [37] 张峰,常会友.Web使用挖掘系统研制中的主要问题和应对策略.《计算机科学》,2003,30(6)
    [38] 陆东梅.Web Usage Mining在远程教育中的应用.《开放教育研究》,2003,(6)
    [39] 李煊,庄镇泉.Web访问挖掘预处理的用户识别算法.《计算机工程与应用》,2002.7:172-176.
    [40] 娥张,冯耕中,战子玉.Web数据应用的利器——Web数据挖掘.《情报学报》,2002.2,20(6)
    [41] Cooley o, Srivastava J. "drouping web page references into transaction for mining world wide web browsing patterns", nreceedings ofh a bu, 1997. 2: 2-7.
    [42] Chen M S, rmrk J S, vu mS. "a ata mining for path traversal patterns in a web environment". In moceedings of the 16th International Conference on a istributed Computing Systems, 1996: 38R-392.
    [43] 施建生,伍卫国,陆丽娜.Web日志挖掘中一种事务识别方法的改进.《小型微型计算机系统》,2002.1,23(1)
    [44] Mobasher B, Cooley o, Srivastava J. "Automatic mersonalization Based on Web Usage Mining". Communications of the As M, 43(8): 142-1R1.
    [4R] 史忠植.《知识发现》.清华大学出版社,2002.1.
    [46] 刘琦.模糊聚类的最大树法在WbB页面分类中的应用.《计算机应用研究》,2004.11:286-287.
    [47] 王石.路径聚类在WbB站点中的知识发现.《计算机发展与研究》,2001.4:4R-47.
    [48] 钟茂生.WbB页面的模糊聚类.《华东交通大学学报》,2004.10,21(R):R9-62.
    [49] 郝先臣.模糊聚类挖掘方法在电子商务中的应用.《东北大学学报》,2001.8,28(1):33-3R
    [R0] 王众托.《系统工程》.大连理工大学出版社,1990.
    [R1] uing a. "bfficient data mining for web navigation patterns". INFl oMATIl N ANa Sl FTWAob TbCe Nl i l d v. 2004. 46: RR-63.
    [R2] 刘次华.《随机过程》.武昌市:华中科技大学出版社,2001.6.
    [R3] e eckerman a, d eiger a, Chickering M. "i earning Bayesian networks: The combination of knowledge and statistical data". Machine i earning. 199K (20): 197-243.
    [R4] 邢永康,马少平.一种基于Markov链模型的动态聚类方法.《计算机研究与发展》,2003.2,40(2):129-13R

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700