海量电信数据的挖掘与异常分析

英文题名：The Mining and Analysis of the Aberration of Mass Telecommunications Data
作者：廖凡迪
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：异常挖掘 ; 并行计算 ; 多标签分类 ; ETL
英文关键词：outliers mining ; parallel computing ; multi-label
英文关键词：classification ; ETL
学位年度：2013
导师：吴斌
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2012-12-22

摘要

随着科学研究、通信技术、IT技术的快速发展,电信业务的数据量急剧增长,而电信行业间日益激烈的竞争也使电信运营商更加需要注重网络和服务的质量来提高行业竞争力。如何从大量数据中获取异常但有用的潜在信息是异常挖掘的主要任务,也是通信网络优化和获得良好的服务质量的关键。
     本文对相关的数据挖掘和并行计算技术展开了一系列研究,旨在从海量电信数据中挖掘异常信息,指导通信网络优化和服务质量提高。
     本文首先根据超频用户的特点,提出了结合离群点检测算法和聚类系数的异常分析算法,其中离群点检测算法改进了基于密度的LOF算法,主要体现在采用SimHash算法改进原LOF算法中的性能瓶颈K近邻查找算法。然后结合乒乓切换的特点,提出了利用多标签分类算法来进行乒乓切换解决方案预测,以随机游走图的多标签分类算法为基础,结合全概率公式和随机过程实现多标签分类算法。为了使本文中改进的算法能适用于大数据,所有的算法利用MapReduce的编程框架进行编写,并利用空间换时间的原理降低了算法的时间复杂度,实现了并行计算的目的。通过对多种实验数据的大量实验证明,本文中提出的并行超频分析算法和并行乒乓切换方案预测算法有较高的准确率和较大的性能优势。
     最后本文给出了异常分析的原型系统设计,结合Hive和MapReduce编程实现了对原始数据的预处理,并依据不同的业务逻辑进行了ETL和统计。通过并行化的不同数据挖掘算法的分析,得到具有业务意义的数据分析结果,并且在前台界面予以展示
     本文将不同的机器学习的算法引入专题应用,克服了人工进行异常检测的效率低下和正确率容易受主观因素影响等缺点。通过大量的实验说明,本文中提出的异常分析方法和系统相对传统的异常分析系统有很大的优势。
With the rapid development of scientific research, telecommunication technology and IT technology, the traffic of telecom service grows significantly. Therefore, the fierce competitions between telecom services providers make them pay more attention to the quality of network and service in order to increase their industry competitiveness. One of the most important parts of ensuring the quality of services is to obtain anomalous and useful potential information from large amount of information. This kind of mining information is also the main task of aberrant data mining.
     This paper undertakes a series of studies on both the relevant data mining and parallel computation technique in order to improve network and communication quality by miningaberrant information fromlarge amount of telecom traffic.
     According to the characteristics of those abnormal users, this paper comes up with a analysis algorithmof aberration combining outlier detection and cluster coefficience. Outlier detection algorithm is an evolvement of LOF based on capacity by substituting K-neighboursearch algorithm by SimHash algorithm. Besides, combining the peculiarity of Ping-pong switching, this paper suggests a predication using multi-label classified algorithm to solve Ping-pong switching. Based on the multi-label of Random walk diagram, this paper combines total probability and random process to accomplish the multi-label classified algorithm. In order to enable the improved algorithm to tit with large amount of traffic, all of the algorithms are implemented by MapReduce framework. What's more, this design reduces the time complexity by trading space for time and finally achieves the parallel computation. A lot of experiment results show that this aberration analyses algorithm and parallel Ping-pong switching predication algorithm prove to be relatively highly accurate and effective.
     This project finally suggests a prototype design of aberrant analyses, combining Hive and MapReduce which achieve data's preprocessing, ETL and statistic according to different service logic. By parallelly analyzing difference data mining algorithm, we can draw a meaningful result of data analyses which can be demonstrated on the user interface.
     This paper introduces different machine learning algorithm into specific utilization, overcoming the artificial detection's shortcomings which are low efficiency and easily affected by subject factors. Proved by a large number of experiments, This analyses system of aberration have an advantage over the traditional one.

引文

[1]Han Jiawei,Micheline K. Data Mining:concepts and techniques,2nd edition [M]. San Froncisco:Morgan Kaufmann Publishers,2006.
    [2]彭向华,曹正贵.移动运营商信令监测系统的应用研究.现代通信,2007
    [3]黄洪宇,林甲祥,陈崇成,樊明辉.离群数据挖掘综述[J]计算机应用研究,200623(8).
    [4]Bernard Rosner. Percentage Points for a Generalized ESD Many-Outlier Procedure[J], Vol.25, No.2,1983, pp.165-172
    [5]S Ramaswamy, R Rastogi,K Shim. Efficient Algorithms for Mining Outliers from Large Data Sets[C], Volume 29 Issue 2, June 2000 ACM New York, NY, USA
    [6]S. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule[C]. In Proceedings of SIGKDD'03, pages 29-38,2003
    [7]Amol Ghoting. Fast mining of distance-based outliers in high-dimensional datasets[C]. Springer Science+ Business Media. LLC 2008. line 4.
    [8]K Bhaduri, BL Matthews, Chris R. Algorithms for Speeding up Distance-Based Outlier Detection[C]. Giannella Proceeding of the 17th ACM SIGKDD,2011
    [9]MM Breunig, HP Kriegel, RT Ng, J Sander. LOF:identifying density-based local outliers[A]. Proceedings of the ACM SIGMOD Conference[C].Dallas:TX,2000, 93-104.
    [10]M Agyemang, CI Ezeife. Lsc-mine:Algorithm for mining local outliers. Proceedings of the 15th Information Resources Management Association International Conference, pp.5-8,2004
    [11]D. Pokrajac, A. Lazarevic and L. Latecki. Incremental local outlier detection for data streams, Proc. IEEE Symp. Comput. Intell. Data Mining, pp.504,2007.
    [12]KG Sharma, A Ram, Y Singh. Efficient Density Based Outlier Handling Technique in Data Mining[J].Communications in Computer and Information Science.2011,Volume 131. Part 4,542-550.
    [13]Yunxin Tao,Dechang Pi. Unifying Density-Based Clustering and Outlier Detection[A]. The Second International Workshop on WKDD[C].Moscow,2009, 644-647.
    [14]TsoumakasG. Multi-label classification.International Journal of DataWarehousing & Mining,2007,3(3):1-13.
    [15]Shen X, Boutell M, Luo J, BrownC. Multi-label machine learning and its application to semantic scene classification[C]. Proceedings of the 2004 International Symposium on Electronic Imaging. San Jose, California, USA, 2004:18-22.
    [16]HullermeierE, Furnkranz J, Cheng W, Brinker K. Label ranking by learning pairwise preferences[C]. Artificial Intelligence,2008,172(16):1897-1916.
    [17]JoachimS T. Text categorization with support vector machines. P Learning with many relevant features[C]. In Proc European Conference on Machine Learning, 1998,P50-55.
    [18]Clare A, KingR. Knowledgediscovery in multi-label phenotype data//Proceedings of the ECML/KDD. Freiburg, Germany,2001:42-53.
    [19]郑伟,王朝坤,刘璋,王建民.一种基于随机游走模型的多标签分类算法[J].计算机学报.2010年8月,33期.
    [20]Xu Z., He Y., Lin W. Four styles of parallel and net programming. Frontiers of Computer Science in China 3(3):290-301,2009
    [21]胡德勇.基于电信数据的模式挖掘与分析.北京邮电大学硕士学位论文,2010
    [22]Dean J., Ghemawat S. MapReduce:simplified data processing on large clusters. Commun. ACM 51(1):107-113,2008
    [23]Papadimitriou S., Sun, J. DisCo:Distributed Co-clustering with Map-Reduce. ICDM'08:2008 Eighth IEEE International Conference on Data Mining 512-521, 2008
    [24]Borathakur, D. The Hadoop Distributed File System:Architecture and Design. 2007.
    [25]李凯平.并行数据挖掘系统中SQL流程化的研究与实现.北京邮电大学硕士学位论文,2012.
    [26]廖凡迪.并行离群点检测在非法回拨电话检测中的应用[J].软件,2012年12月,33期
    [27]C Sadowski, G Levin. Simhash:Hash-based similarity detection.http:// simhash.googlecode.com,2007.
    [28]UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/.
    [29]X. Xu, J. Jager, H.P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining Knowled. Disc., vol.3, no.3, pp.263-290,1999.
    [30]姚瑜敏.网络质量优化的信令分析及应用.上海交通大学硕士论文.2011

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700