基于选择路径和浏览页面的用户聚类算法研究

作者：黄翔
论文级别：硕士
学科专业名称：软件工程
中文关键词：Web使用挖掘 ; 个性化 ; 聚类
英文关键词：Web log mining ; personalization ; clustering
学位年度：2010
导师：费洪晓
学科代码：081202
学位授予单位：中南大学

摘要

人类已经进入了网络时代,网络技术的发展为网络教学提供了一片崭新的天地。现有的网络教学系统,虽然自身信息量极其丰富,但教师对学生的学习情况缺乏了解,无法满足学生个性化的学习需求。运用Web日志挖掘技术,从学生上网学习行为中发现相似的群体以及浏览的兴趣的兴趣路径,能帮助教师及时调整教学方案更新网络站点结构。
     本文对Web日志挖掘系统进行研究。按照Web日志挖掘的步骤,首先对Web日志预处理过程进行研究,分为六个步骤：数据收集、数据清洗、用户识别、会话识别、路径补充、事务识别,研究了相关理论、算法,并在此基础上提出对事务识别算法加以改进,省略路径补充过程,直接由会话得到事务。其次,对用户聚类算法进行研究,针对现有的基于Hamming距离的聚类算法的不足,只考虑了用户访问的次数而没有考虑用户访问该URL时在该URL上停留的时间,以及在这段时间内在该页面上所执行的操作,提出了选择路径兴趣和浏览页面兴趣相结合的用户兴趣度,并在此基础上提出相应的聚类算法,并将该算法运用到用户聚类和浏览兴趣路径的获取中。
     在上述研究的基础上设计并实现了基于用户综合兴趣度的Web日志挖掘系统。该系统是由JSP实现,可以帮助管理员／教师了解学生对网站的访问情况,改进站点结构。
Mankind has entered the Internet era.The development of network technology offers a new world to the teaching of online education. Web-based teaching system has a vast amount of information.But Teachers lack understanding of the situation on students'learning. It does not meet the needs of individualized learning. With the Use of Web data mining technology, we could learn who are similar with each other from student Internet learning behavior, what is interesting path. It can help teachers adjust teaching plan and update network site structure.
     This article makes research on Web log mining system. Follow the steps for mining the Web log, we firstly made research on the Web log preprocessing, it divided into six steps:data collection, data cleaning, user identification, session identification, path supplementary, transaction identification.We researched on their theory, algorithms. And on this basis, we improved the transaction recognition algorithm which omitted to add the path. Secondly, we make research on user clustering algorithm. We focused on the clustering algorithm based on hamming distance, which only took the times of the user access into account, ignoring the users'behavior in the URL and residence time. We proposed a user interestingness, which combined the interest of choosing the path with the interest in browsing page. On the basis, we proposed a clustering algorithm, and applied it to user clustering and browsing path.
     In these studies, we designed the Web Log Mining System based on user interest rate. The system was realized by the JSP, which can help administrators/teachers to understand the behavior of students when they visits the site.It also help to improve the structure of the site.

引文

[1]李广,姜英杰.个性化学习的理论建构与特征分析.东北大学报(哲学社会科学版),2005,3：152-156
    [2]陈天云,张剑平.智能教学系统(ITS)的研究现状及其在中国的发展,2007,2：95-99
    [3]Shute, V. & Psotka, J. Intelligent Tutoring System:Past, Present, and Furture[A].D. Jonassen. Handbook of Research for Educational Communications and Technology[C]. New York:Macmillan,1996.570-600
    [4]Mobasher B, Cooley R, Srivastava J. Automatic personalization based on Web usage mining[J]. Communications of the ACM,2000,43(8):142-151
    [5]David Hand, HeikkiMannila, padhraieSmyth.数据挖掘原理[M],北京：机械工业出版社,2003,4-14
    [6]George M. Marakas.数据仓库、挖掘、可视化一核心概念[M],北京：清华大学出版社,2004,10.2-5
    [7]Raymond Kosala,Hendrik Bloekeel.Web Mining Researeh:A Survey[C].In Proc. ACM SIGKDD,2000(2):1-15
    [8]Bernard J. Jansen, Search log analysis:What it is, what's been done, how to do it[J]. Library & information Science Research(28),2006:407-432
    [9]ResulDas, Ibrahim Turkoglu. Creating meaningful data from Web logs For improving the impressiveness of a Website by using path analysis method[J]. Expert Systems with Applications,2009(36):6635-6644
    [10]C.Romero, S.Venturn. Education data mining:A survey from 1995 to 2005[J]. Expert Systems with Applications,2007(33):135-146
    [11]Cristobal Remero, Sebastian, Amelia Zafra at al. Applying Web usage mining for personalizing hyperlinks in Web-based adaptive educational systems[J]. Spain:Computers & Education,2009(53):828-840
    [12]Cristobal Remero, Sebastian Ventura, Enrique Garcia. Data mining in course management systems:Moodle case study and tutorial [J]. Spain:Computers & Education,2008(51).368-384
    [13]A Rudolf, R Pirker. E-Businesstesting:User Pereptions and performance issues[C]. In:Proc of the 1st Asia-Pacific Conf on Quality Software. LosAlamitos:IEEE ComPuter Soeiety Press,2000:315-323
    [14]Jiawei Han, Micheline Kamber. Data Mining:Concept and Techniques[M]. 北京：机械工业出版社,2001
    [15]Mannila,H.Toivone.Discovering frequent episodes in Mining[J]. Portland, Oregen,1996.146-151
    [16]M.S.Chen,J.S.Park,Yu,P.S.Effieient Data Mining for Path Traversal Patterns in a Web environment[J]. IEEE Trans. on Knowledge and Data Engineering,1998,10(2):209-221
    [17]TakYan, Mattewjaeobsen, HectorGareia-Molina and Umeshwar Daryal From User Access Patterns Dynamic HyPertext Linking[C]. In Proeeedings of the 5th International World Wide Web Conference. Paris, Franee,1996
    [18]D. S. W. N gu and X. Wu. SitehelPer:A localized agent that helps incremental exploration of the World Wide Web[C]. In 6th International World Web Conference. Santa, Clara, CA,1997.691-700
    [19]Mike Perkowitz and Oren Etzioni.Adaptive Web Sites:Automatically Synthesizing Web Pages[C]. In Proceedings of Fifteenth National Conference on Artificial Intelligence. Madison. WI,1997
    [20]J. Han,0. R. Zaiane, M. Xin. Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs[C]. Proc. Advance in Digital Libraries Conf., Santa Barbara, CA, April 1998
    [21]S. Schechter, M. Krishman, and M. D. Smith. Using Path Profiles to Predict http requests[C]. In 7th International World Wide Web Conference, Brisbane, Australia,1998
    [22]J. Borges and M. levene. Data Mining of User Navigation Patterns[C]. In Proeeedings of the WEBKDD'99 Workshop on Web usage Analysis and User Profiling, Agustl5,1999, San Diego, CA, USA,1999
    [23]Cooley R, Srivastava J. Grouping Web Page references into transactions for mining world wide Web browsing Patterns[C]. Proceedings of KDEX'97. NewPort Beaeh, CAUSA,1997.2-7
    [24]Bunchner A G, Mulvenna MD. Discoving Internet Marketing Intelligence Through Online Analytical Web Usage Mining[J]. SIGMOD Record 1998,27(4):145-156
    [25]Cyrus Shahabi,Amir Zarkesh, Jafar Adibi, et al.Knowledge Discovery from Users WebPages Navigation[C].In Proeeeding of the IEEE RIDE Workshop, April 1997.65-79
    [26]Judy Chuan and Hsipeng Lu. Towards an Understanding of the behavioural Intention to use a Web site[C]. International Journal of Information Management20,2000
    [27]Bao Jun Peng, Shen Jun-Yi, Liu Xiao-Dong, et al. Doeument copy detection based on kernel method[C]. In Proc. IEEE International Conference on Natural Language Processing and KnowledgeEngineering(NLP-KE’03), Beijing, oet, 2003:250-256
    [28]周斌,吴泉源,高洪奎.基于Bayes概率的用户访问路径及其发现算法[J].计算机工程与科学,2000,22(6)：8-10
    [29]陆丽娜,杨怡玲,管旭东.Web日志挖掘中数据预处理的研究[J].计算机工程,2000,26(4)：66-72
    [30]胡和平,陈鹰.应用多维数据立方体开采Web日志的多维关联规算机应用研究,1999,10：35-37
    [31]蔡智,岳丽华,王熙法.时序模式发现算法研究[J].计算机研究,2000,37(9)：1107-1113
    [32]王实,高文,李锦涛.路径聚类：在Web站点中的知识发现[J].计算机研究与发展,2001,38(4)：482-486
    [33]阳小华,周龙镶.基于用户访问模式的WWW浏览路径优化[J].软件学报,2001,12(6)：846-850
    [34]杨怡玲,管旭东,尤晋元.基于页面内容和站点结构的页而聚类[J].软件学报,2002,13(3)：467-469
    [35]Mostafa, J., Mukhopadhyay, S., Palakal, M. Simulation Studies of Different Dimensions of Users’Interests and their Impact on User Modeling and Information Filtering. Information Retrieval,2003,6(2):199-223
    [36]苏中,马少平,杨强.基于Web-LogMining的N元预测模型[J].软件学报,2002,13(1)：136-141
    [37]李诗诗,方寿海.基于Web使用挖掘技术的聚类算法改进.计算机工程与设计,2009,30(22)：5182-5184
    [38]Mikalsen T, Tai S,Rouvellou I, Transactional attitudes:Reliable composition of autonomous Web services[C]. International Conference on Dependendable Systems and Networks, IEEE,2002
    [39]宋擒豹,沈钧毅.Web日志挖掘的高效多能挖掘算法[J].计算机研究与发展,2003,38(3)：328-333
    [40]Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data Preparation of mining world wide Web browsing patterns[J].Knowledge Information Systems,1999,1(1):5-32
    [41]肖国强,肖轶.一种从Web日志中挖掘访问模式的新算法[J].华中科技大学学报(自然科学版),2004,32(5)：70-72
    [42]施建生,伍卫国,陆丽娜,等.Web日志中挖掘用户浏览模式的研究.西安交通大学学报,2001,35(6)：621-624
    [43]何丽,韩文秀.一种基于后缀树的Web访问模式挖掘算法[J].计算机应用,2004,24(11)：68-70
    [44]朱志国,邓贵仕.Web使用挖掘技术的分析与研究[J].计算机应用研究,2008,25(1)：29-36
    [45]谢艳玲.一种高效的网页聚类方法[J].计算机工程与设计,2007,28(17)：4229-4232
    [46]刘慧君,朱庆生,张程等.基于用户兴趣的Web日志挖掘算法.计算机集成制造系统,2009,11：2209-2214
    [47]刘立军,周军,梅红岩.Web使用挖掘的数据预处理[J].计算机科学,2007,34(5)：200-204
    [48]曾春.信息过滤的概念表示与算法研究[D].北京：清华大学,2003
    [49]胡迎松,宁海霞.一种新型的Web挖掘数据采集模型[J].计算机工程与科学,2007,29(2)：36-39
    [50]邢东山,沈钧毅,宋擒豹.从Web日志中挖掘用户浏览偏爱路径[J].计算机学报,2003,26(11)：1518-1523
    [51]陈峰.基于Web日志的用户兴趣聚类研究[硕士学位论文].合肥：合肥工业大学,2008
    [52]任晓霞.一种Web日志数据挖掘系统的设计与实现[硕士学位论文].北京：北京邮电大学,2008
    [53]胡可云,田凤占,黄厚宽.数据挖据理论与应用.北京：清华大学出版社；北京交通大学出版社,2008
    [54]元昌安,邓松,李文敬等.数据挖掘原理与SPSS Clementine应用.电子工业出版社,2009
    [55]朱明.数据挖掘.中国科技大学出版社,2002
    [56]Clavpool M, Brow D, Le Phong, et al. Inferring User Interest. IEEE Internet Computing,2001,5(6):32-39
    [57]曾春,邢春晓,周立柱.个性化服务技术综述.软件学报,2002,13(10)：1952-1961
    [58]王利强.Web客户端用户行为数据收集和分析工具的研究：[硕士学位论文].大连：大连海事大学,2004
    [59]郭岩,白硕,杨志峰等.网络日志规模分析和用户兴趣挖掘.计算机学报,2005,28(9)：1483-1496
    [60]夏敏捷,张慧档.基于Web日志挖掘的个性化服务站点.微计算机应用,2006,27(1)：35-38

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700