K-均值聚类算法及其在高校图书馆日志挖掘中的应用研究

英文题名：K-means Clustering Algorithm and Its Application in the College Library Web Log Mining
作者：康耀龙
论文级别：硕士
学科专业名称：系统工程
中文关键词：Web日志挖掘 ; 聚类 ; K-均值算法 ; 图书馆
英文关键词：Web log mining ; Clustering ; K-means algorithm ; Library
学位年度：2010
导师：卢才武
学科代码：081103
学位授予单位：西安建筑科技大学
论文提交日期：2010-05-01

摘要

在网络普及化的今日,人们在使用网络时留下了大量有价值的信息可供分析。面对着日益庞大的信息库,如何从中找出有用而不易被发现的知识,已成为一个重要的研究课题。利用Web日志挖掘技术对用户访问日志进行挖掘,可以解决上述问题。
     本文根据图书馆用户访问行为的特点,采用聚类方法对高校图书馆访问日志进行数据挖掘。针对K-均值聚类算法中初始聚类中心选取的随机性导致聚类正确性与效率下降的问题,结合网格等方法,提出了一种改进的K-均值聚类算法,简称IKM算法,此算法在聚类正确性、效率与稳健性方面都有较大的改进。在日志挖掘阶段,设计并实现一个可视化日志挖掘辅助工具。针对日志挖掘的研究,此工具可直接用来生成数据输入向量表,以及对聚类挖掘后的结果进行统计。
     最后利用改进后的K-均值聚类算法,构建I-Weka挖掘工具。通过Java开发平台,对I-Weka工具进行实现,将IKM聚类算法封装到Weka工具中。使用改进的I-Weka工具,对预处理后的高校图书馆日志数据进行聚类挖掘,从最终的结果进行分析,可以获得用户对不同种类书目的兴趣度,从中发现哪些类的图书关注度比较高,而哪些书存在馆藏数量不足的现象,为高校图书馆采购部门采购图书提供参考依据,从而达到合理使用经费,完善馆藏建设,提升图书馆的服务质量的目的。
Nowdays, people are using the Internet, which can leave a lot of valuable information for analysis, along with the popularization of the network. Facing an increasingly large information base, how do we find a useful knowledge not easily found, which has become an important research topic. We can solve this problem, by mining the user access log records and useing the Web log mining technical.
     Accoding to the characteristics of library user access to, We mine the Web log of college library by the method of clustering. K-means clustering algorithm select the initial cluster centers is random, which can abate the accuracy. This paper proposed a improved K-means algorithm—IKM, combined the method of grid. This algorithm has a greater improvement in accuracy and robust of the cluster.During Web log mining, designed and implemented a visual log mining software. This tool can be used to generate a vector table of data input, and count the results of clustering mining.
     Finally, construct I-Weka mining system with the improved K-means clustering algorithm. Through the Java development platform, we add the IKM algorithm into the Weka system. We can mine the data preprocessed with clustering, using the improver I-Weka. Analysing the final results can draw what kind of books are the users interested in, find what kind of books are in a relatively high degree of concern or what kind of book collections have the incomplete phenomenon, and provide reference for purchasing books to the procurement department of college library, which can use the funds reasonable, improve collection strcture and to upgrade the library service quality.

引文

[1]Mike Perkowitz and Oren Etzioni.Adaptive Web Sites:Automatically Synthesizing Web Pages.In Proceedings of Fifteenth National Conference on Artificial Intelligence.Madison.WI,1997
    [2]Anthony S,Larry K.Websifter:An ontologieal Web-Mining Agent for E-Business, DS-9 2001:187-201
    [3]D.S.W.Ngu and X.Wu.Sitehelper:A localized agent that helps incremental exploration of the.World Wide Web.In 6th International World Wide Web conference,1997:691-700,Santa,Clara,CA.
    [4]J.Han,O.R.Zaiane,M.Xin.Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs,Proc.Advances in Digital Libraries Conf.(Adl'98),Santa Barbara,CA,April 1998.
    [5]Yao J F,Xiao Z q.Traversal Pattern Mining in Web Usage Data.EneyeloPedia of Information Science and Teehnology,2005:2857-2860
    [6]董一鸿,庄越挺.基于新型的竞争型神经网络的Web日志挖掘.计算机研究与发展,2003：661-66
    [7]陆丽娜,魏恒义,杨怡玲等.Web日志挖掘中的序列模式识别[J].小型微型计算机系统,2000,21(5),481-483.
    [8]杨怡玲,管旭东,尤晋元.基于页面内容和站点结构的页面聚类挖掘算法.软件学报,2002年3期Vol.13,No.3 Journal of Software 1000-9825/2002/13(03)0467-03
    [9]宋擒豹,沈钧毅.Web页面和客户群体的模糊聚类算法.小型微型计算机系统,2001：229-231
    [10]Jiawe Han,MichelineKamber.数据挖掘概念与技术[M].范明,孟小峰等译,机械工业出版社,2005
    [11]Michael JA berry and Gordon S. L inoff, data mining techniques:for marketing, sales,and customer support. New York:John Wiley and Sons,1997.
    [12]Pham, D.T., Dimov, S.S. and Nguyen, C.D,A Two-Phase K-Means Algorithm for Large Datasets. In Proceedings of the IMECH E Part C Journal of Mechanical Engineering Science, vol.218,2004,1269-1273.
    [13]Fayyad, U., Piatetsky-Shapiro, G, and Smyth, P. "From data mining to knowledge di s-covery:An overview." In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI/MIT Press, Ca mbridge, Mass.,1996
    [14]Fayyad, U., Piatesky-Shapiro, G. and Smyth, P. The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM,39, 11,1996.27-34, Nov.
    [15]Michael JA berry and Gordon S. L inoff, data mining techniques:for marketing, sales, and customer support. New York:John Wiley and Sons,1997.
    [16]Cooley R,Mobasher B,Srivastava J.Data Preparation for mining world wide web browsing patterns.Knowledge and Information System,1999,1(1):5-32.
    [17]Pitkow J.Insearch of reliable usage data on the WWW[C].In:Proc of 6th Int'IWorldWideWeb Conf.Santa Clara.California.1997.
    [18]Jaideep Srivastava, Robert Cooley, Mukund Deshpande,et all.Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data[J].In Proc.ACM SIGKDD,2000,1(2):12-23.
    [19]Rosa Meo, Pier Luca Lanzi,Maristella Matera.Integrating Web Conceptual Modeling and Web Usage Mining.Proceeding of the sixth WEBKDD workshop: Webmining and Web Usage Analysis, in conjunction with the 10th ACM SIGKDD conference, Seattle, Washinton,USA,Autust 22,2004.
    [20]Doru Tanasa,Brigitte Trousse.Advanced Data Preprocessing for Intersites Web Usage Mining[J].IEEE Intelligent Systems,March/April 2004:59-65.
    [21]王书舟,高中文.Web使用挖掘技术在电子商务中的应用[J].微机发展,2003,13(2),41-43.
    [22]马瑞民,李向云.Web日志挖掘中预处理技术的研究.计算机工程与设计,2007,8
    [23]Cooley, R., Mobasher, B. and Srivastava, J. Web Mining:information and pattern di scovery on the World Wide Web, in ICTAI'97, December,1997.
    [24]易敏听.基于日志定制的Web使用数据挖掘预处理研究[J].华东理工大学学报,2003,29(4),395-399.
    [25]Margaret H.Dunham,郭崇慧,田凤占,靳晓明等译.数据挖掘教程[M].北京：清华大学出版社,2005.
    [26]Kalakota, Ravi and Marcia Robinson, e-Business:Roadmap for Success,1st ed.,Mary T.O'Brien, U.S.A.,1999.
    [27]Joshi, KP, Joshi, A., Yesha, Y. and Krishnapuram, R. Warehousing and mining web logs, in:In Proc. of ACM CIKM Workshop on Web Information and Data Management,1999.63-68.
    [28]汤国行.Web日志聚类分析及应用[D].山东大学,2006年.
    [29]Van Der Laan, M.J, Pollard, K.S. and Bryan, J. "A New Partitioning Around Medoids Algorithm." Journal of Statistical and Simulation, vol.73, no.8,2003, 575-584.
    [30]Jain, A.K., Murty, M.N. and Flynn, P.J. "Data Clustering:A Review." ACM Computing Surveys, vol.31, no.3,1999,264-323.
    [31]Karypis, G., Han, E.H. and Kumar, V. "Chameleon:Hierarchical Clustering Using Dynamic Modeling." IEEE Computer, Vol.32, No.8,1999.68-75.
    [32]Dash, M., Liu, H. and Xu, X. "1+1>2:Merging Distance and Density Based Clustering." In Proceedings of the 7th International Conference on Database Systems for Advanced Applications,2001,32-39.
    [33]Alsabti, K. Ranka, S. and Singh, V. An Efficient K-Means Clustering st Algorithm. In Proceedings of the 1 Workshop on High Performance Data Mining,Orlando, FL. 1997.
    [34]http://www.cs.waikato.ac.nz/ml/weka
    [35]Richard J.Roiger,Michael W.Geatz.DATA MINING A TUTORIAL-BASED PRIMER(影印版).清华大学出版社,2003
    [36]David Hand,Heiki Mannila,Padhraic Smyth.数据挖掘原理.机械工业出版社／中信出版社,2003
    [37]Perkowitz M,Etzion Oren.Adaptive:Automatically learing from user acess patterns. Inproc of 6th In'I World Wide Web conf.Santa Clara,Califorlia,1997
    [38]http://www.china-pub.com/computers/common/info.asp?id=29304
    [39]Su T. and Dy J.A Deterministic Method for Initializing K-Means Clustering, In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, Florida,2004,784-786.
    [40]Tsai, C.F, Wu, H.C. and Tsai,C.W. A New Data Clustering Approach for Data Mining in Large Databases." In Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks,2004,278-283.
    [41]朱小娟.人工免疫聚类在web日志挖掘中的应用[D].江西：南昌大学,2008.
    [42]http://forum.wekacn.org

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700