基于增量聚类的手机病毒挖掘技术的研究与实现

英文题名：Research and Implementation Based the Incremental Clustering Mobile Phone Virus Mining Technology
作者：孟德
论文级别：硕士
学科专业名称：计算机技术（专业学位）
中文关键词：手机病毒检测 ; 数据挖掘 ; 增量聚类 ; 算法比较 ; K-均值
英文关键词：Mobile phone virus detection ; Data Mining ;
英文关键词：Incremental Clustering ; Comparison of Algorithms ; K-means
学位年度：2013
导师：宋俊德
学科代码：0852
学位授予单位：北京邮电大学
论文提交日期：2012-12-15

摘要

随着信息技术的不断进步和通信资费的不断下降,手机在人们的生活中变得越来越不可或缺。而在这光鲜的外表之下,手机病毒也随之悄悄走进了人们的生活。在计算机病毒日新月异的今天,手机病毒领域也没有停滞不前,出现了混合式感染的方式。手机使用者在取得新手机和使用手机安装新应用程序时都会有非常大的安全隐患。本课题针对这一问题,对基于聚类的手机病毒挖掘技术展开研究,力图实现一种聚类算法,提高手机病毒挖掘效率,降低算法时间复杂度,保持病毒挖掘准确性。本文选题自某大型外企的手机病毒挖掘引擎项目,主要完成项目中聚类挖掘模块的开发测试工作。
     本文首先讲述了手机病毒的基本概念和常见种类。分析了各类病毒的情况和发作机制。以及手机病毒的危害。并介绍了目前比较常见的几种病毒防治技术。之后介绍了数据挖掘的基础知识和常用聚类挖掘算法。对聚类挖掘技术进行了深入探究,说明了聚类算法通常使用的存储结构。并从逻辑、性能等方面对常见的K-means算法和DBSCAN算法进行了比较研究。为下一步的研究和实现工作进行了充分的理论和技术储备。同时也确定了本文选用基于K-means的增量算法处理手机病毒增量挖掘问题。
     本文在总结前人经验的基础上,结合手机病毒挖掘这一特定应用需求,对于K-means算法进行了改进和提升,通过对数据进行归一化处理使手机病毒挖掘准确率平均提升了15个百分点。同时提出了基于K-means算法的增量算法,可以对K-means挖掘后的数据进行有效的增量更新。并对算法内存使用等多方面进行了相应优化。同等条件下内存占用减少了50%左右,同时不改变挖掘结果。最后通过总结实验结果,提出了算法适宜应用的场景和聚类质量影响因素,为后续算法使用提供了良好的指导意见。
With the continuous advancement of information technology. Mobile phones become increasingly indispensable in people's lives. At the same time, the mobile phone virus along quietly into the people's lives. Today, the computer virus is ever-changing. The field of mobile phone virus is not standing still. The field of mobile phone viruses appear the way Hybrid infection. When users use the phone to install a new application, there will be a very big security risk. To solve this problem, it is necessary to develop a simple and efficient mining engine on the mobile phone viruses excavation. The topic from a large foreign cooperation projects. The main task is the development and testing of clustering mining module.
     This paper first describes the basic concepts and common types of mobile phone virus. Analysis of the various types of the virus and attack mechanisms.Describe the dangers of mobile phone virus. And several relatively common virus prevention technology.Introduced to the basics of data mining and commonly used clustering mining algorithms. Delve into clustering mining technology. Illustrate the clustering algorithms typically use the storage structure. Carried out a comparative study of common K-means algorithm and DBSCAN algorithm. As the full theoretical and technical reserves for the next step in the research. This work determines the incremental algorithm based on K-means to deal with mobile phone virus incremental mining problems.
     This paper summarizes the experience of their predecessors. Taking into account the specific application requirements of the mobile phone virus mining. Improve and enhance the K-means algorithm. Through the application of a normalized data, phone virus mining accuracy increased by15%on average. Designed to achieve incremental algorithm based on K-means algorithm. Can effectively incremental mining the results of K
     means mining. Corresponding optimization algorithm memory usage, and many other. A50%reduction in memory footprint under the same conditions. Finally, the experimental results are summarized. Summarizes the algorithm suitable for the application scenario. Summarizes the clustering quality influencing factors. Provide a good guidance for future algorithm uses.

引文

[1]财富财经网.世界银行：四分之三人口使用手机.[EB/OL].[2012.07.18].http://news.rmburl.com/379801.html
    [2]新浪科技.Gartner:2013年智能机和平板销量将达12亿部.[EB/OL].[2012.11.06].http://tech.sina.com.cn/it/2012-11-06/20207773987.shtml
    [3]网秦.2012年第三季度全球手机安全报告.[EB/OL].[2012.10.01]http://cn.nq.com/
    [4]M. Ester,H.-P-Kriegel, J. Sander,et al. Incremental Clustering for Mining in a Data Warehousing Environment C]. In Proc.,eings of the 24th International Conference on Very Large Data Bases, New York, Morgan Kaufinann Publishers Ine.,1998:323-333.
    [5]黄永平,邹力鸥.数据仓库中基于密度的批量增量聚类算法[J].计算机工程与应用,2004,29.
    [6]陈峰.基于聚类的增量数据挖掘研究[D].大连：大连海事大学硕士学位论文.2007.
    [7]徐新华,谢永红.增量聚类综述及增量DBSCAN聚类算法研究[J].华北航天工业学院学报,2006,16(02).
    [8]周永锋.基于密度的海量数据增量式挖掘技术研究[D].长沙：中国人民解放军国防科学技术大学硕士学位论文.2002.
    [9]刘青宝,侯东风,邓苏等.基于相对密度的增量式聚类算法[J].国防科技大学学报,2006,28(5)：73"79.
    [10]S. Asharaf,M. NarasimhaMun), S. K. Shevade Rough set based incremental clustering of interval data. Pattern Recognition Letters 27(2006)515-519.
    [11]Yihong Dong YuetingZhuangKenChenXiaoying Tai Ahierarchical clustering algorithm based on fuzzy graph eonnectedness. Fuzzy Sets and Systems 1 57(2006)1760,---1 774.
    [12]倪国元.基于模糊聚类的增量式挖掘算法研究[D].武汉：华中科技大学硕士学位论文.2004.
    [13]吴琪,高滢,王晓涛等.一种基于距离的增量聚类算法[J].解放军理工大学学报,2005,6(6)538.
    [14]冯兴杰,黄亚楼.增量式CURE聚类算法研究[J].小型微型计算机系统,2004,25(10).
    [15]王晓涛.一个增量式粮食单位信息聚类分析系统的设计和实现[D].长春：吉林大学硕士学位论文.2004.
    [16]C.-C. Hsu,and Y.P. Huang. Incremental clustering of mixed data based on distance hierarchy[J], Expert Systems with Applications:An International Journal, Volume35, Issue 3(October 2008), Pages 1177-1185
    [17]李志,王延巍,朱林.手机病毒的现状与未来[J].电信技术.2006,3：pp.87-90.
    [18]孙剑,底翔.智能手机的病毒防治[J].信息安全与通信保密.2007(1)：pp.136-138
    [19]刘功申.计算机病毒及其防范技术[M].北京：清华大学出版社.2008
    [20]陈建民,3G时代手机病毒的威胁与移动安全[J].信息网络安全.2009(09).：pp.19-20
    [21]防病毒软件如何工作：四种病毒侦测技术.[EB/OL][2011.10.26]http://net.it168.com/a2011/1025/1263/000001263812.shtml
    [22]Jau-Hwang Wang, Peter S. Deng, Yi-Shen Fan, Li-Jing Jaw, Yu-Ching Liu. Virus Detection Using Data Ming Techniques[C]. IEEE International Conference on Data Mining-ICDM, pp.362-369,2003.
    [23]Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson Education,2006. pp.136-199
    [24]孟海东,王淑玲,郝永宽.动态增量聚类的设计与实现[J].计算机工程与应用,2009(08)：pp.35-38
    [25]李雄飞,董元方,李军等.数据挖掘与知识发现[M].第二版.高等教育出版社.2010.：pp.43～64
    [26]陈安,陈宁,周龙骧等.数据挖掘技术及应用[M].科学出版社.2006.：pp.176-250
    [27]网秦.网秦手机安全百宝书[EB/OL]. [2011.02.13].http://cn.nq.com/
    [28]Margaret H. Dunham.数据挖掘教程[M].清华大学出版社.2005.：pp.107-138
    [29]网秦.2011年中国大陆地区手机安全报告.[EB/OL].[2011.12.31]http://cn.nq.com/
    [30]Jiawei Han等著.范明等译.数据挖掘概念与技术[M].机械工业出版社.2001.8
    [31]网秦.2012年上半年全球手机安全报告.[EB/OL].[2012.07.01]http://cn.nq.com/
    [32]刘威,孟彬.十二月计算机病毒疫情分析[J].信息网络安全,2011.1,(1)：73
    [33]赵夏丽,刘志龙.手机病毒与反病毒的发展趋势分析[J].内蒙古经济与科技,2011.6,237(11)：63-64
    [34]王淑玲.增量聚类算法的设计与实现[D].内蒙古：内蒙古科技大学硕士论文,2009
    [35]吴俊军,方明伟,张新访.基于启发式行为监测的手机病毒防治研究[J].计算机工程与科学,2010,31(1)：35-38,112
    [36]范茂.聚类算法在手机病毒入侵检测中的研究与实现[D].北京：北京邮电大学,2012
    [37]L.L. Liu, X.B. Wen, and X.X. Gao. Segmentation for SAR Image Based on a New Spectral Clustering Algorithm. Life System Modeling and Intelligent Computing, pages 635-643,2010.
    [38]胡志伟.增量关联规则算法在手机病毒挖掘中的应用研究与实现[D].北京：北京邮电大学,2012
    [39]Y. Liu, Z. Li, H. Xiong.Understanding of internal clustering validation measures[C]. In IEEE ICDM, pages 911-916,2010.
    [40]周鸣.面向手机病毒挖掘引擎的增量贝叶斯算法的研究与实现[D].北京：北京邮电大学,2012
    [41]Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information networks with star network schema[C]. In KDD'09,2009.
    [42]王利峰.增量文本聚类在舆情监控中的研究与实现[D].上海：东华大学,2010
    [43]A. Bifet, G. Holmes, B. Pfahringer. MOA:Massive online analysis, a framework for stream classification and clustering. In JMLR,2010.
    [44]网秦.2011年第三季度全球Android手机安全报告.[EB/OL][2011.10.01]. http://cn.nq.com/
    [45]F. Alqadah and R. Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data[J]. In CIKM'08,2008
    [46]E. Amigo, J. Gonzalo, J. Artiles. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval,12 (4):461-486,2009.
    [47]高滢.多关系聚类分析方法研究[D].吉林：吉林大学,2008
    [48]C. Cortes and D. Pregibon. Signature-based methods for data streams[J]. Data Mining and Knowledge Discovery,5(3):167-183,2001-07-01.
    [49]阳建平.聚类算法在入侵检测中的应用[D].四川：电子科技大学,2009
    [50]P. K. Agarwal and N. H. Mustafa. k-means projective clustering[J].In PODS, pages 155-165, Paris, France,2004. ACM.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700