Web日志挖掘及其实现

英文题名：Research and Realization on Web Log Mining
作者：刘滨
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web日志挖掘 ; 关联规则 ; 数据预处理 ; Apriori算法 ; 缩减数据库
英文关键词：web log mining ; association rule ; data preprocessing ; Apriori algorithm ; reduced database
学位年度：2007
导师：杨静
学科代码：081203
学位授予单位：哈尔滨工程大学
论文提交日期：2007-02-05

摘要

伴随着Internet技术的发展，WWW的应用也越来越多，Web站点越来越普及。在当前竞争激烈的网络经济中，只有赢得用户才能获得竞争中的优势。客户浏览行为的数字化，使得通过收集大量用户浏览行为数据来深入研究客户行为变为可能。如何利用这个机会，从这些“无意义”并且繁琐的数据中得到有价值知识和信息成为目前面临的最紧要的问题之一。为了解决这个问题Web站点的数据挖掘技术诞生了。
本文重点研究了日志挖掘技术及其步骤，研究了数据预处理的过程和其中难点的解决方法，包括用户识别技术，路径补充技术等技术。详细介绍了关联规则的经典算法Apriori算法。在研究一些Apriori改进算法的基础上，本文通过缩减数据库和对连接方法进行改进实现了对Apriori算法的改进，提出了I_Apriori算法，并且在理论上证明了I_Apriori算法的空间复杂度和时间复杂度比Apriori算法小。为了验证所提出的I_Apriori算法的空间复杂度与时间复杂度，并且把所研究的技术应用到实际应用中去，本文以哈尔滨工程大学50周年校庆网站为日志挖掘对象，分别使用Apriori算法和I_Apriori算法对经过数据预处理后的日志文件进行分析。实验的结果表明I_Apriori算法的空间复杂度和时间复杂度都比Apriori算法有改善。为了使比较结果具有普遍性，在给定不同的最小支持度的情况下，把Apriori算法和I_Apriori算法分别对同样的日志文件进行挖掘，实验结果表明在给定不同的最小支持度的情况下，I_Apriori算法的效率比Apriori算法高。最后，通过采用I_Apriori算法对日志文件进行分析找到了在网站结构和内容中存在的问题，并且给出了解决方案。
With the help of the development of the technology on the field of internet, www becomes more and more popular. As a result, many websites are being built. As the violent competition in the internet economy, only the one who attracts the customers can survive. The behaviors of the customers become digital, which makes it possible to collect a lot of data in order to further investigate the behavior of the customers. It is one of the most important problems which we confront that how to find the valuable and understandable information from the "no sense" and boring data. The technology of Web data mining is the method to solve this problem.
In this thesis, the investigation of the web log mining technology and its process are focused on and the process of the data preprocess, method of this process and the solution of the problems, including identifying the users and completing the path of the users are investigated. The classic algorithm of association rule Apriori algorithm is introduced. After investigating some of the improvement of the Apriori algorithm, the IApriori algorithm is given, which is based on the the technology of reduce the scale of the database and the improvement of the process of join. The time complexity and space complexity of IApriori algorithm is less than Apriori in theory. In order to demonstrate the efficiency of IApriori algorithm and to apply the technologies which are investigated into practice, the logs of the 50th birthday of heu celebration website are processed and analysed through IApriori algorithm and Apriori algorithm respectively. The result of this experiment shows that IApriori algorithm is much better than Apriori algorithm in time complexity and space complexity. In order to make the compareion more universality, after given different minsupp, the same logs are analysed by IApriori algorithm and Apriori algorithm respectively, the result of this experiment shows that I_Apriori algorithm is more efficient than Apriori algorithm when given different minsupp. Finally, the logs of the website are analysed by I_Apriori algorithm. With the help of the result the disadvantages of the website are found and then the improvements are given.

引文

[1] H. Wang, W. Fan, P. S. Yun, J. Han. Mining concept-drifting data streams using ensemble classiiers. In ACM SIGKDD. 2003: 201-206P
    [2] Fayyad U. M., Piatetsky-Shapiro G., Smyth P.. From Data Mining to Knowledge Discovery. MIT Press, 1996:19-32P
    [3] Jiawei Han，Micheline Kamber著．数据挖掘概念和技术．北京：机械工业出版社，2001：130-134页
    [4] Yeong-Chyi Lee, Tzung-Pei Hong, Wen-Yang Lin. Mining association rules with multiple minimum supports using maximum constraints. International Journal of Approximate Reasoning. 2004, 40(2): 44-54P
    [5] Yuh-Jiuan Tsa, Ya-Wen Chang-Chien. An efficient cluster and decomposition algorithm for mining association rules. Information Sciences. 2004, 160(4): 161-171P
    [6] J. Fong, H. K. Wong, S. M. Huang. Continuous and incremental data mining association rules using frame metadata model. Knowledge-Based Systems. 2003, 16(2): 91-100P
    [7] Chunyan Liang, Li Guo, Zhaojie Xia, Fengguang Nie, Xiaoxia Li, Liang Su, Zhangyuan Yang. Dictionary-based text categorization of chemical web pages. Information Processing & Management. 2006, 42(4): 1017-1029P
    [8] 周则顺，水俊峰，夏红霞．基于Web日志挖掘的智能站点体系．武汉理工大学学报．2003，25(6)：72-73页
    [9] 何玉宝．数据挖掘在网站可用性分析上的应用研究．大连海事大学．2005：16-18，21-22页
    [10] 罗超．基于流数据关联规则的访问模式挖掘研究．辽宁工程技术大学．2004：12-14页
    [11] 张娥，冯耕中，战子玉．Web数据应用中的利器-Web数据挖掘．2002，21(6)：687-688页
    [12] Yuhjiuan Tsay and Jiunnyann Chiang. An efficient method for mining association rules. Knowledge-Based Systems. 2005, 18(3): 99-105P
    [13] Dmitri Roussinov, J. Leon Zhao. Automatic discovery of similarity relationships through Web mining. Decision Support Systems. 2003, 35(1): 149-166P
    [14] 张娥，冯耕中，郑斐峰．Web用户访问日志数据挖掘研究．2003，(9)：48-50页
    [15] 马辉民，卢益清．商务网站客户行为信息挖掘模型的设计．计算机应用研究．2002：140-142页
    [16] 戴军湘．基于Web 日志挖掘的自适应网站推荐系统框架研究．湖南大学．2005：8-20，41-43页
    [17] 王玉珍．Web使用模式挖掘中的几个关键问题研究．电脑开发与应用．2003，16(11)：18—19页
    [18] Echo Huang, Tzu-Chuan Chou. Factors for web mining adoption of B2C firms: Taiwan experience. Electronic Commerce Research and Applications. 2004, 3(3): 266-279P
    [19] 吴慧韫．利用Web同志进行CRM数据挖掘研究．科技广场．2006：47-49页
    [20] 董德民．面向电子商务的Web使用挖掘及其应用研究．2006，9(10)：83-85页
    [21] 方成效，袁可风．Web 日志挖掘的数据预处理研究．计算机与现代化．2006：79-81页
    [22] 候亚丽，袁方．Web 日志挖掘中的数据预处理技术．河北大学学报．2005，25(2)：202-205页
    [23] 章恒庆，梅清．Web日志挖掘数据预处理研究．现代计算机．2004：6-8页
    [24] T. A. Runkler, J. C. Bezdek. Web mining with relational clustering. International Journal of Approximate Reasoning. 2003, 32(3): 217-236P
    [25] 张娥，冯秋红，宣慧玉．Web使用模式研究中的数据挖掘．计算机应用研究．2001，(3)：80-83页
    [26] Jens O. Liegle, Thomas N. Janicki. The effect of learning styles on the navigation needs of Web-based learners. Computers in Human Behavior.2006, 22(5): 885-898P
    [27] 水俊峰．面向智能Web站点的数据挖掘技术研究及应用．2003：44-46页
    [28] W. Browne, L. Yao. Knowledge-elicitation and data-mining: Fusing human and industrial plant information. Engineering Applications of Artificial Intelligence. 2006, 19(3): 345-359P
    [29] Agrawal R. Srikant R.. Mining association rules between sets of items in large databases [A]. Proc ACM SIGMOD Int'l Conf Management of data[C], 1993: 207-216P
    [30] Agrawal R., Srikant R.. Fast algorithms for mining association rules [A]. Proc 20th int'1 Conf Very Large Database [C], 1994: 487-499P
    [31] Fayyad U. M., Piatetsky-Shapiro G., Smyth P. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communication of the ACM. 1996, 39(11): 27-34P
    [32] Dragos Arotaritei, Sushmita Mitra. Web mining: a survey in the fuzzy framework. Fuzzy Sets and Systems. 2004, 148(1): 5-19P
    [33] Scott Nicholson. The basis for bibliomining: Frame works for bringing together usage-based data mining and bibliometrics through data warehousing in digital library services. Information Processing& Management. 2006, 42(3): 785-804P
    [34] Showjane Yen, Yueshi Lee. An efficient data mining approach for discovering interesting knowledge from customer transactions. Expert Systems with Applications. 2006, 30(4): 650-657P
    [35] Park J. S., Chen M. S., Yu P. S. An effective hash-based algorithm for mining association rules [A]. Proceedings of ACM SIGMOD International Conference On Management of Data[C], 1995, 5(6): 175-186P
    [36] O. R. Zaiane, J. Han. Resource and knowledge discovery in global information systems: A preliminary design and experiment. In Proc First Int.Conf On Knowledce Discovery and Data Mining, 1995:628-635P
    [37] 段晓峰．网站日志的数据挖掘．重庆大学．2003：21-22，24-25页
    [38] 徐章艳，刘美玲．Apfiofi算法的三种优化方法．计算机工程与应用．2004：190-191页
    [39] 吉根林，孙志挥．数据挖掘技术[J]．中国图形图象学报．2001，6(8)：715-721页
    [40] 陈敏，欧阳一鸣，刘红樱．Web挖掘中基于RD_Apriofi算法发现用户频繁访问模式．微电子学与计算机．2005，22(5)：4-5页
    [41] 周解全，赵青，舒位光．校园网拓扑结构优化方案的探讨．中国医学教育技术．2006，20(6)：532-535页