Web日志挖掘中数据预处理算法的研究

英文题名：Research on Data Pre-processing Algorithm in Web Log Mining
作者：朱鹤祥
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web日志挖掘 ; 数据预处理 ; 用户识别 ; 会话识别
英文关键词：Web Log Mining ; Data Pre-processing ; User Identification ; Session Identification
学位年度：2010
导师：李瑞
学科代码：081203
学位授予单位：大连交通大学

摘要

Internet的迅猛发展,尤其是Web的全球普及,使得Web上信息量无比丰富。通过对Web的挖掘,可从Web页面中提取所需的知识:对总的用户访问行为、频度、内容的分析,可得到关于群体用户访问行为和方式的普遍知识,用以改进Web服务设计。更重用的是,通过对这些用户特征的理解和分析,有助于开展有针对性的电子商务活动。
     Web日志挖掘利用数据挖掘技术分析和挖掘网络日志,获取网站使用情况的有价值模式,应用于个性化服务、网站设计和商业决策等方面。而数据预处理在Web日志挖掘过程中起着至关重要的作用,其中用户识别和会话识别是主要环节,也是整个过程的基础和关键步骤。本文将对提高用户识别和会话识别算法进行研究。
     本文系统地阐述了从数据挖掘、Web数据挖掘到Web日志挖掘整个过程,重点研究了Web日志挖掘技术及其步骤,研究了数据预处理的过程和方法,包括用户识别技术和会话识别技术等。本文的主要工作是,首先提出了一种以活动用户为基础的用户识别算法,它使用IP地址和用户访问截止时间去识别日志中的不同用户,实验结果表明,该算法比基本用户识别算法有着更好的性能,甚至对于小型日志文件系统也适用。其次,给出了会话识别的定义,并对传统的预先设定时间间隔方法进行了优化,在给出算法数据结构的基础上具体描述了算法,实验证明会话质量得到了提高。
The swift and violent development of Internet, especially the whole worlds of Web popularizes and Web incomparably abundant amount of information.Through Web mining, we can draw necessary knowledge from Web page:to analyze the contents to total user receive and visit behavior and frequentness, we can get the general knowledge of behavior and mode of users, and use that to improve our web serve.And more importantly, through the understanding and analyzing of user’s characteristic, it can help and develop the electronic commercial activities.
     Web log mining utilizing the technology of data mining to analyze and mining the data of network, obtains the visited the valuable patterns of information about Web.It is applied to personalization, improving Web sites and business.And data preprocessing plays an essential role in the process of Web log mining.User and sessions’identification is a basal and pivotal process in the data preprocessing.This paper will research how to improve the accuracy of user and sessions’identification algorithm.
     In this thesis, the process of data mining, web data mining and web log mining was reported, the technologe and process of web log mining was focused on, the method of data pre-processing is researched, including user and session’s identification technologies.The mostly work of this paper is: Firstly, an active user-based user identification algorithm is presented. The algorithm uses both an IP address and a finite users’inactive time to identify different users in the web log. Our experiments result prove that the active user based algorithm shows much better performance over the basic algorithm even for small web log sizes. Secondly, the definition of session identification is given, the traditional method of pre-established time interval is optimized and the algorithm is described concretely based on the data structure. The empirical analysis prove that the quality of session is improved.

引文

[1]韩家炜,孟小峰,王静等.web挖掘研究[J].计算机研究与发展.2001,38(4):405-414
    [2]高毅龙.web服务器访问日志的保存方法及其实现.计算机工程.1999.
    [3]邓英,李明.web数据挖掘的技术及工具研究.计算机工程与应用.2001.2
    [4]胡和平,陈鹰,应用多维数据立方体开采web日志的多维关联规则,计算机应用研究,1999,NO10
    [5]陆丽娜,魏恒义,杨怡玲等.Web日志挖掘中的序列模式识别.小型微型计算机系统,200年5月.
    [6]陆丽娜,杨怡玲,管旭东,等Web日志挖掘中的数据预处理的研究.计算机工程,2004,26(4):66-67,72
    [7] Perkowitz M,Etzioni O.Adaptive sites:Automatically Learning From User Access Patterns.Proceedings of 6th International World Wide Web Conference Santa Clara,California,1997:1265-1278
    [8] Pitkow J.In Search of Reliable Usage Data On the WWW.Proceedings of 6th International World Wide Web Conference Santa Clara,California,1997:1343-1355
    [9] Graham-Cumming J.Hits and miss-as:A Year Watching the Web.Proceedings of 6th International World Wide Web Conference Santa Clara,California,1997:1376-1382
    [10] Natheer Khasawneh,Chien-Chung Chan.Active User-Based and Ontology-Based Web Log Data Preprocessing for Web Usage Mining.5th International Conference on Web Intelligence Hong Kong,China,2006:325-328
    [11] Khasawneh , N.“Toward Better Website Usage: Leveraging Data Mining Techniques and Rough Set Learning to Construct Better-To-Use Websites,”Ph.D. Thesis ,Department of Electrical and Computer Engineering, the University of Akron, August, 2005.
    [12] M.Zaki. SPADE:An Effeient Algorithm for Mining Frequent Sequences.Machine Learning , 2001,42 (1):31-60
    [13] Zaiane R,Xin M,Han Jiawei.Discovering Web Access Patterns and Trends by Applying OLAP and DATA Mining Techonlogy on Web Logs.Proceedings of the IEEE International Forum on Research and Technology Advances in Digital Libraries, Los Alamitos,Santa Babara,CA,1998:19-29
    [14] Paliouras G,Papatheodorou C,Karkaletsis V,et a1.Clustering the Users of Large Web Sites into Communities.Proceedings of the 17th International Conference on Machine Learning,San Francisco, USA, 2000:719-726
    [15] Nanopoulos A,Manolopoulos Y.Mining Patterns from Graph Traversals.Data and Knowledge Engineering,2001,37(3):243-266
    [16] Jiayun Guo,Vlado Keselj,Qigang Gao.Integrating Web Content Clustering into Web Log Association Rule Mining.Canadian Conference on AI,Victoria,BC,Canada,2005:182-193
    [17] Heung Ki Lee, Gopinath Vageesan, Ki Hwan Yum,et al. A PROactive Request Distribution (PRORD) Using Web Log Mining in a Cluster-Based Web Server.2006 International Conference on Parallel Processing , 2006:559-568
    [18] Jianhan Zhu,Jun Hong,John G.Hughes.PageCluster:Mining Conceptual Link hierarchies from Web Log Files for Adaptive Web Site Navigation.ACM Transactions on Internet Technology.2004,4(2):185-208
    [19] Jian Chih Ou,Chang-Hung Lee,Ming-Syan Chen:Web Log Mining with Adaptive Support Thresholds.Proceedings of the 14th international conference on World Wide Web , Chiba, Japan, 2005:1188-1189
    [20] X.Yan,J.Han.Close Graph : Mining Closed Frequent Graph Patterns.Proc.Int’l Conf on Knowledge Discovery and Data Mining,Washington,D.C.,2003:286-295
    [21] J.Srivastava,R.Cooley,M.Deshpande.Web Usage Mining:Discovery and Applications of Usage Patterns from Web Data.SIGKDD Explorations,2000,1(2):12-23
    [22] B.Mobasher,H.Dai,T.Luo.Effective Personalization Based on Association Rule Discovery From Web Usage Data.Proceedings of the 3rd ACM Workshop on Web Information and Data Management, Atlanta, Georgia, 2001:9-15
    [23] D.Vandermeer,K.Dutta,A.Datta.Enabling Scalable Online Personalization on the Web.Proceedings of the 2nd ACM Electronic Commerce Conference,New York,USA,2000:185-196
    [24] Mary F.Fernandez,Daniela Florescu,Alon Y.Levy,etc.Verifying Integrity Constraints on Web Sites.Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence. Sweden, 1999:614-619
    [25] A.Nanopoulos,D.Katsaros,Y.Manoloupolos.Exploiting Web Log Mining for Web Cache Enhancement.Proceedings of the Third International Workshop of Web Knowledge Discovery in Databases 2001-Mining Web Log Data Access All Customers Touch Points.San Francisco,CA,USA,2001:68-87
    [26] T.Lane,C.E.Brodley.Temporal Sequence Learning and Data Reduction for Anomaly Detection Lane.A CM Transactions on Information and System Security,1 999,3(2):295-331
    [27] CY.Chang,MS.Chen.A New Cache Replacement Algorithm for the Integration of Web Caching and Prefetching.Proceedings of the Eleventh International Conference on Information and Knowledge Management, New York,USA,2002:632-634
    [28] B.Lan,S.Bressan,B.C.Ooi.Rule-assisted Prefetching in Web-server Caching. Proceedings of the Ninth International Conference on Information and Knowledge Management,New York,USA,1999:178-187
    [29] Tapan Kamdar,Anupam Joshi.Using incremental Web Log Mining To Create Adaptive Web Servers. International Journal on Digital Libraries.2005,5(2):133-150
    [30] Yin-Fu Huang,Jhao-Min Hsu.Mining Web Logs to Improve Hit Ratios of Prefetching and Caching. Web Intelligence, Compiegne, France,2005:577-580
    [31] Y.Fu,M.Creado,C.Ju.Reorganizing Web Site Based on User Access Patterns.Proceedings of the Tenth International Conference on Information and Knowledge Management,New York,USA,2001:583-585
    [32] Zhenglu Yang,Yitong Wang,Masaru Kitsuregawa.An Effective System for Mining Web Log.8th Asia-Pacific Web Conference,Harbin,China,2006:40-52
    [33] O.R.Zaiane.Web Usage Mining for A Better Web-based Learning Environment. Proccedings of the Ninth Conference on Advanced Technology for Education,Alberta, Canada,2001:450-455
    [34] Chen, M.-S.,J.S. Park, and P.S. Yu, Data“mining for path traversal patterns in a web environment,”Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS'96), pp.385-393, May 27-30,1996
    [35] Cooley ,R.,P.-N. Tan, and J. Srivastava," Discovery of interesting usage patterns from Web data," presented at WEBKDD,1999
    [36] Catledge, L. and J. Pitkow,“Characterizing browsing strategies in the World-Wide Web,”Journal of Computer Networks and ISDN Systems, Vol.27, No.6,pp.1065-1073,April,1995.
    [37] Khasawneh, N.“Toward Better Website Usage : Leveraging Data Mining Techniques and Rough Set Learning to Construct Better-To-Use Websites,”Ph.D. Thesis, Department of Electrical and ComputerEngineering, the University of Akron, August,2005.
    [38] Khasawneh, N. and C.-C. Chan,“Web Usage Mining using Rough Sets,”Proc. NAFIPS 2005, Int. Conf. of the North American Fuzzy Information Processing Society, June 22-25, 2005, Ann Arbor, Michigan , pp. 580-585. ISBN 0-7803-9188-8 IEEE Catalog No. 05TH8815C
    [40] P. Clerkin, P. Cunningham, and C. Hayes, "Ontology discovery for the semantic Web using hierarchical clustering," presented at Semantic Web Mining Workshop at ECML/PKDD-2001, Freiburg, Germany, 2001.
    [41] Crave, M.,D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, "Learning to construct knowledge bases from the World Wide Web," Artificial Intelligence, vol. 118, pp. 69-113, 2000.
    [42] Maedche, A. and S. Staab, "Discovering conceptual relation from text," presented at European Conference on Artificial Intelligence (ECAI00), Berlin, 2000.
    [43] Cheung D W Efficiente mining of association rules in distributed databases[J]。IEEE Transactions on Knowledge and Data Engineering,1996,8(6):910一921
    [44] Yang Qiang , Zhang Haining , Li Tianyi. Mining Web logs for prediction models in WWW caching and prefecting[C]∥The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD’01. San Francisco: ACM SIGKDD , 2001.
    [45]陈子军,王鑫昱,李伟.一种Web日志会话识别的优化方法[J].计算机工程, 2007, 33(1):95-97.
    [46]庄力可,寇忠宝,张长水.网络日志挖掘中基于时间间隔的会话切分[J].清华大学学报,2005,45(1):115-118.
    [47]徐宝文,张卫丰.数据挖掘技术在web预取中的应用研究[J]计算机学报,2001,24(4):430一436.
    [48]朱晋华,陈俊杰.Web日志预处理中会话识别的优化[J].太原理工大学学报, 2008 , 39(2):111-114

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700