WEB日志用户会话识别及聚类分析研究

英文题名：Research on User Session Identification and Clustering Technology of Web Log Mining
作者：朱晋华
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web日志挖掘 ; 会话识别 ; 兴趣度事务 ; 聚类
英文关键词：Web log mining ; session identification ; interest degree transaction ; user cluster
学位年度：2008
导师：陈俊杰
学科代码：081203
学位授予单位：太原理工大学
论文提交日期：2008-05-01

摘要

随着Internet在流量、规模和复杂度等方面的飞速增长,网络成为人们进行信息交流和信息处理的平台。面对网络上如此巨大的信息量,如何有效地发现个性化的信息,成为困扰用户的一大难题。为此,Web挖掘技术应运而生,其中Web日志挖掘是Web挖掘研究领域中一个重要的方面,它是将数据挖掘技术应用于Web服务器日志,通过分析日志文件发现用户访问站点的浏览模式。基于Web的日志挖掘一般分为三个过程:数据预处理阶段、模式发现阶段及模式分析阶段。
     在Web日志挖掘过程中,首先要进行的是数据预处理,因为现实世界中的数据多半是不完整的、含噪声的和不一致的,而且这些数据的格式多种多样。对于数据挖掘算法而言,不正确的输入数据可能导致错误或者不准确的挖掘结果,同时数据挖掘算法通常处理的是具有固定格式的数据,现实中存在的数据各式各样,因此需要将这些数据加工处理成可以被挖掘算法使用的数据。如何修补现实世界的数据的不完整及不一致、如何剔除噪声数据、如何将现有的数据转化为挖掘算法可用的格式、如何抽取有用的数据、如何将多个数据源集成在一起,这些都是数据预处理中要完成的任务。数据预处理技术是整个数据挖掘过程的主要组成部分,数据预处理的结果是挖掘算法的输入,它直接影响挖掘的质量。因此,数据预处理技术也是Web日志挖掘中的重要研究方向。
     数据预处理是在将日志文件转换成数据库文件时进行的,它包括数据清洗、用户识别、会话识别、事务识别四个阶段。
     本文深入学习研究了数据预处理的主要任务,提出了一种新的Web日志预处理会话识别及根据用户浏览兴趣进行事务识别的方法。该方法根据用户的下载时间、用户对页面内容的兴趣度及页面的信息量及页面的链入、链出数等几个参数的综合得到每个用户对每个页面的访问时间阈值,然后根据该个性化阈值来识别用户会话。会话识别后,根据用户访问页面的时间、页面的兴趣度删除用户不感兴趣的页面和链接页面,重新定义用户的Web访问事务,成为最终有效的Web页面访问序列。
     实验证明,本文提出的方法可以识别出页面浏览时间较长的会话,也可以把小于固定阈值的页面划入下一会话,发现的真实会话比例大,贴近用户真实的访问目的,同时依据用户浏览页面的兴趣度来删除无关链接页面,形成新的Web访问事务,为下一步的聚类分析提供了良好的数据,提高了聚类的效率。
     数据经过预处理后,就可以根据具体的需求来选择聚类、分类等挖掘技术。本文研究分析了聚类技术及当前的Web聚类的内容和方法,通过聚类用户访问的Web事务,发现相似的用户群。
With the swift development of Internet in amount, scale and complexity, web has become an effective platform on which people communicate and process information. Based on so tremendous information in network, how to discover individual information effectively has become a difficulty to users. So technique of Web mining emerges as the time requires, and the technique of Web log mining is an important part in the research field of Web mining. It applies the technique of Data mining to Web server log, and analyses log files to discover users' visiting pattern of accessing sites. There are three processes in Web log mining: Data preprocessing, Pattern discovering and Pattern analysis.
     In Web log mining, the first process is Data preprocessing. Because most amounts of data are half-baked, noisy, and inconsistent, and their formats are various in real world. For algorithm of Data mining, incorrect input may result in fault or inaccurate result, at the same time, algorithm of Data mining usually process data with fixed format. There are various data in real world, so these data need to be processed into other data which can be used in mining algorithm. Data preprocessing should accomplish these tasks, such as, how to restore data's half-baked and inconsistent in real world, how to eliminate noisy data, how to transform existing data to the format can be used in mining algorithm, how to extract useful data, how to integrate multiple data source, and so on. Data preprocessing is a main part in the whole data mining process. The result of Data preprocessing is the input of mining algorithm, it can influence mining quality directly. So the technique of data preprocessing is an important research aspect in Web log mining. Data preprocessing is processed when log files are transformed to database files. It includes four phases: data cleanout, user session, session identification, transaction identification.
     This paper further studies the main task of Data preprocessing, and puts forward a new method about session identification in Web log preprocessing and transaction identification according to users' visiting interest. This method integrates such parameters as users' downloading time, the users' interest to pages, pages' information and pages linking into and out to calculate every user's visiting time for every web page, then divides sessions according to individual threshold. After session identification, according to the users' visiting time and pages' interest deletes the pages that the users are not interested in and linked pages, and redefines the Web transaction which is effective page visiting sequence.
     Experiment turns out that the method in this paper can identify session in which users take long time to visit pages, and merges pages whose threshold is less than fixed threshold to next session, discoverable real session accounts for great proportion, and be similar to users' real visiting intention. At the same time, deletes independent pages according to users' interest to pages, and forms new Web transaction. It provides valuable data for clustering analysis, and improves cluster's efficiency.
     After data preprocessing, it is time to select a mining technique such as clustering, classifying according to specific demand. This paper analyses cluster's technique and current Web cluster's content and methods. Through clustering Web transaction, we can find the similar users.

引文

[1]Zidrina Pabarskaite.Implementing advanced cleaning and end-user interpretability technologies in Web log mining.Information Technology Interfaces,2002.ITI 2002.Proceedings of the 24th International Conference on 24-27 June 2002 P109-113
    [2]Fosca Giannotti,Cristian Gozzi,Giuseppe Manco.Characterizing Web user accesses:a transactional approach to Web log clustering.Information Technology:Coding and Computing,2002.Proceedings,International Conference on 8-10 April 2002 P312-317
    [3]Beatrice Lazzerini,Francesco Marcelloni,Marco Cococcioni.A system based on hierarchical fuzzy clustering for web users profiling.Systems,Man and Cybernetics,2003.IEEE International Conference.Volume 2,5-8 Oct.2003 P1995-2000
    [4]Yu-Qing Peng,Tie-Jun Li,Mei-Na Chen,Tao Lin.Services of prediction for visiting path based on improved matrix clustering.Machine Learning and Cybernetics,2004.Proceedings of 2004International Conference on Volume 3,26-29 Aug.2004 P1723-1726
    [5]Jose Borges,Mark Levene.Data Mining of User Navigation Patterns Borges1999a)[EB/OL].http://www.informatik.unisiegen.de/-galeas/paprs/webes us age-mining
    [6]Jian Pei,JiaWei Hans Behzad Mortazavi-asl and Hua Zhu.Mining Access Patterns Effcientiy from Web Logs[EB/OL].http://www.informatik.uni-siegen.de/}galeas/paper
    [7]Shigeru Oyanagi,Kazuto Kubota and Akihiko Nakase.Application of Matrix Clustering to Web Log Analysis and Access Prediction[EB/OL].http://robotics.stanford.edu/-ronnyk/WEBKDD2001/WEBKDD2001 Acee pthtml
    [8]R.Agrawal,R.Srikant.Fast Algorithms for Mining Association Rules,Proe.20th VLDB Conf.1994:P487-499
    [9]Pang-Ning Tan,Vipin Kumar.Mining Indirect Associations in Web Data[EB/OL].http://robotics.stanford.edu/-ronnyk/WEBKDD2001/WEBKDD/2007Accepthtml
    [10]Zhexue Huang,Joe Ng,David W.Cheung,Michael K.Ng,Wai-Ki Ching.A Cube Model for Web Access Sessions and Cluster Analysis[EB/OL].http://ai.stanford.edu/} ronnyk/WEBKDD2001/huang.pdf
    [11]Chen M S,Park J S,Yu P S.Data mining for path traversal patterns in a Web environment.In: proceedings of the 16th International Conference on Distributed Computing Systems,Hong Kong,1996:P385-392
    [12]蔡自兴,徐光祜.人工智能及其应用(第三版).清华大学出版社.2004
    [13]Hahn U,Schnattinger K.Deep knowledge discovery from natural language texts.In:Proc of the Std lnt'I Conf on Knowledge Discovery and Data Mining.New port Beach,1997:P175-178
    [14]Fayyad U etal.The KDD process for extracting useful knowledge from volumes of data.Communications of the ACM,1996,39(11):P27-34
    [15]郭运宏.数据挖掘、Web挖掘与Web日志挖掘之研究.郑州铁路职业技术学院学报.2006.6
    [16]汤效琴,戴汝源,徐琪.数据挖掘中变量聚类方法的应用研究.计算机工程与应用2004年,第24期:P171-172
    [17]史忠植.知识发现清华大学出版社,2002年
    [18]Berendt B,Mobasher B,Nakagawa M,et al.The impact of site structure and user environment on session reconstruction in Web usage analysis[C]//Proceedings of the 4th WebKDD 2002Workshop at the ACM2SIGKDD Conference on Knowledge Discovery in Database.Edmonton,Alberta:ACM SIGKDD,2002.
    [19]Spiliopoulou M,Mobasher B,Berendt B,et al.A framework for the evaluation of session reconstruction heuristics in Web usage analysis[J].Informs Journal of Computing,Special Issue on Mining Web Based Data for E-Business Applications,2003,15(2):P171-190.
    [20]陈子军,王鑫昱,李伟.一种Web日志会话是别的优化方法.计算机工程2007年,第33期:P95-96
    [21]Srivastava J,Cooley R,Dehpande M,et al.Web usuage mining:Discovery and applications of usage pattern from Web data[J].SIGKDD Explorations,ACM Newsletter of SIGKDD,2000,1(2):P12-23.
    [22]庄力可,寇忠宝,张长水.网络日志挖掘中基于时间间隔的会话切分.清华大学学报,2005,45(1):P115-118
    [23]殷贤亮,张为.Web使用挖掘中的一种改进的会话识别方法[J].华中科技大学学报(自然科学版),2006,34(7):P33-35
    [24]谢艳玲.一种高效的网页聚类方法.计算机工程与设计.2007.9
    [25]董全德.用户兴趣迁移模式与个性化服务.网络通讯与安全.2007.8
    [26]孙霞.Web数据挖掘中频繁访问页组有趣性的研究.计算机与信息技术.2007.7
    [27]张海玉,刘晓霞.一种挖掘用户浏览模式的新方法。计算机应用与软件.2007.2
    [28]段隆振,秦磊,黄龙军.一种基于用户兴趣度模型的Web路径挖掘算法.微计算机信息.2007
    [29]汤效琴,戴汝源,徐琪.数据挖掘中变量聚类方法的应用研究.计算机工程与应用2004年,第24期:P171-172
    [30]Fosca Giannotti,Cristian Gozzi,Giuseppe Manco.Characterizing Web user accesses:a transactional approach to Web log clustering.Information Technology:Coding and Computing,2002.Proceedings,International Conference on 8-10 April 2002 P312-317
    [31]高新波.模糊聚类分析及其应用[M].西安:西安电子科技大学出版 2004:P49-106
    [32]张云涛等.数据挖掘原理与技术(M].北京:电子工业出版社2004:P 49-59
    [33]张敏,于剑.基于划分的模糊聚类算法.软件学报.2004年,第15卷(第6期):P859
    [34]Jianjiang Lu,Baowen Xu,Hongji Yang.Matrix dimensionality reduction for mining Web logs.Web Intelligence,2003.WI 2003.Proceedings.IEEF/WIC International Conference on 13-17 Oct.2003P405-408
    [35]彭玉青,田海山,陈美娜.基于矩阵聚类的网页预测研究.计算机工程.2004年,第30卷(第8期):P156
    [36]汤国行,赵合计.改进的基于模糊聚类的Web日志挖掘.计算机科学.2005年,第32卷(第9A 期):P28-30
    [37]邵峰晶,于忠清.数据挖掘原理与算法,中国水利水电出版社,2003年
    [38]周文勇.改进的K-均值聚类算法.光盘技术(计算技术与自动化).2007
    [39]程舒通.Web点击流的频繁模式聚类算法.计算机技术与发展.2007.9
    [40]彭艳,王小玲.基于Web浏览内容和行为的用户聚类算法研究.计算机与信息技术.2006
    [41]邓晶晶,蒋玉明,傅静涛.基于Web使用挖掘的实时聚类算法.四川大学学报.2007.8
    [42]宋清昆,郝敏.一种改进的模糊C均值聚类算法.哈尔滨理工大学学报.2007.8
    [43]谢艳玲.一种高效的网页聚类方法.计算机工程与设计.2007.9

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700