基于流聚类的网络业务识别关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网的快速发展,网络业务应用类型呈现百花齐放的状态。这在提高了社会效率和丰富了人们精神生活的同时,也使得网络环境更加复杂化,大量的P2P业务占据了带宽资源,造成网络拥塞,运营商服务质量降低,安全问题日益突出。因此,迫切地需要实施网络管理和监控,优化网络资源,解决安全问题,提高网络传输能力,并为网络规划和扩容提供科学依据。网络业务识别技术正是支持网络管理与监控的基础和有效手段。如今,过分依赖于端口和数据包负载的传统网络业务识别技术已经无法应对复杂的网络环境。基于数据挖掘的网络业务识别技术提取网络业务流的统计信息对其进行分类或聚类处理,更适用于对现今环境下复杂的网络业务流量进行识别,因此成为网络业务流识别的重点研究方向之一
     考虑到网络业务流的数据流特性,本文致力于数据流聚类算法和网络业务识别方案的研究,主要内容和创新点如下:
     网格时间权重阈值自适应的任意形状数据流聚类方法研究:网格技术具有处理快速且处理时间只依赖于网格划分粒度的优点。针对网络业务流的分布在数据空间中具有任意形状,以及其在时间和空间上的倾斜特性,本文提出一种基于网格的任意形状数据流聚类算法。该方法基于衰减函数提出了潜在密集网格和离群网格的概念,定义了具有自适应能力的网格时间权重阈值,即体现了网络业务流的时间倾斜分布特性,又考虑了其空间倾斜分布特性;设计了在线维护算法来周期性地对两类网格进行检查和更新,删除退化网格,提高了聚类时的存储效率和时间效率。实验证明,算法能够很好的从噪声数据中识别任意形状且具有空间倾斜分布特性的簇,对网络业务流数据具有较好的聚类质量和较快的聚类速度。
     基于网格密度的数据流演化聚类分析方法研究:在对网络业务流的分析研究中,运营商往往不仅想了解某个时刻下的网络业务流量特性,更想知道某个时间段或某两段时间内网络业务流特性如何变化。本文提出一种基于网格密度的数据流聚类算法,使用数据点密度系数处理网络业务流数据的时间倾斜问题,定义以网格密度为核心的网格特征向量以减少内存空间占用,使用金字塔时间框架技术按照一定规则保存在线维护的网格集合快照,以实现对当前数据的聚类、对当前时间段内数据的聚类,以及对某段时间内数据流演变特性的分析。实验表明,该算法具备良好的噪声健壮性,能够基于不同的用户请求产生任意形状的最终聚类簇,具有良好的数据流演化分析能力,对网络业务流具有较好的聚类质量和较快的处理速度。
     基于流聚类的半监督多级网络业务识别方案研究:网络业务流中长短流比例的不平衡及其各自的不同特性使得单一的网络业务识别方法无法全面地顾及所有的网络业务流量。本文对TCP协议和UDP协议承载的网络流使用不同长短流判别标准,综合多种识别技术,提出一种在线多级的网络业务分流识别体系,联合基于端口、数据包负载和数据挖掘的方法对短流进行多级识别,使用基于数据挖掘的方法对长流进行识别。对基于传统数据挖掘的识别方法进行分析,基于传统分类方法的网络业务识别技术受限于学习分类器时使用的训练数据集,不适用于实时变化的网络业务流识别;基于传统聚类方法的网络业务识别技术能够发现数据的自然特性簇,但是多次扫描数据集的方式同样不适用于动态网络业务流的识别,聚类簇的分析也是研究难点之一。在充分考虑网络业务流特性的基础上,本文提出一种基于流聚类的半监督网络业务识别方案。该方案使用双层处理框架,实现对在线实时网络业务流的一次扫描;将产生的微簇存储至离线的时间快照数据库并按照一定的规则维护。离线宏聚类根据用户请求选择聚类算法和数据,产生最终聚类簇。本文提出根据实时数据流建立定时更新和维护映射规则数据库的方法,通过其他识别技术识别抽样流并建立对应微簇与网络应用类型的映射对,以辅助识别聚类簇的网络业务应用类型。此外对长流引入子流概念,提取子流的属性特征,选择出最佳特征子集应用于识别方案中。
With the development of Internet, the number of network applications increase rapidly. It leads to the improvement of social efficiency and enrichment of people's spiritual life, and also complicates network environment. Congestion occurs as network bandwidth resources are occupied by vast amounts of P2P traffic data, service quality reduces, and network security has become a serious problem. Hence there is an urgent need for implementation of network management and monitoring, which could optimize network resources, solve the security problems, improve network transmission capacity, and provide the scientific basis for the network expansion. Network service traffic identification technique is one of the effective methods to solve the problems mentioned above. However, traditional identification technologies rely excessively on traffic information of port number and packet payload, which has a negative influence on ability to deal with complex network traffic. Data mining-based identification technology extracts statistical information of network service traffic and classifies them by supervised or un-supervised method. It is more suitable for identifying complicated network traffic, and becomes one of the key research directions.
     Considering the data stream characteristics for network service flows, our researches concentrate on study data stream clustering algorithms and network service traffic identification scheme. The main contents and innovative points of this paper are as follows:
     Clustering for data streams with arbitrary shape based on adaptive time weight threshold of grid:grid technology is featured by high processing speed and the processing time which depends only on the size of grid. Given the arbitrary shape, tilt features of time and space for network data stream, the paper proposes a grid-based clustering algorithm for data streams with arbitrary shape. The algorithm introduces the concepts of potential dense grid and outlier grid based on fading function, and defines an adaptive time weight threshold of grid, which considers both tilt features of time and space for network service data stream. Online maintain function is designed to detect and delete ineligible grids periodically, which improves the storage and time efficiency. Experiments show that the algorithm can identify clusters with arbitrary shape and space tilt feature from noise data, and clustering network data stream with higher quality and speed.
     Evolution clustering for data streams based on grid-density:actually, users may not only want to know the characteristics of network data streams at the specific time, but also characteristics in specific time horizon or evolvements of network traffic between different periods. In this paper, a grid-density based clustering algorithm for evolving data streams is proposed. Density coefficient for data record is applied to deal with time tilt problem of network traffic. Pyramid time frame technology is introduced to save snapshot of grid set at the specific time. The algorithm has abilities of clustering at specific time, clustering in time horizon, and evolution analysis clustering. Experiments show that this algorithm has good robustness of noise, and perform better in data stream analysis and processing speed.
     Semi-supervised network service identification scheme based on data stream clustering algorithm:the application of single identification technology can not analyze network service traffic comprehensively because of the imbalance proportion and different properties of mice flow and elephant flow in network traffic. In this paper, we use different elephant thresholds to judge TCP flow and UDP flow, and propose a multi-level network traffic recognition system by combining various identification technologies. In this system, identification of mice flow is based on port, payload and data mining methods step by step, while identification of elephant flow is only based on data mining method. As to data mining based identification of network service traffic, traditional supervised method is limited by the training dataset which is used to the classifier learning, and is not suitable for real-time network traffic identification. Un-supervised method can find that nature clusters in traffic, but analysis for how to map clusters to each service application efficiently remains to be difficult to accomplish. Considering the features of network traffic sufficiently, this paper presents a semi-supervised network service traffic identification scheme based on data stream clustering algorithm. The scheme applies a two-phase framework, which implements single pass scan to process online real-time network traffic. It stores the micro-clusters set periodically to the offline time snapshots database. In response to user requests, offline component chooses clustering algorithm and related data from time snapshots database, and generates clusters. This paper maintains an offline mapping rules database, which is obtained through identifying sampled real-time traffic flows based on port number or payload identification techniques, and mapping the related micro-cluster to application type. In addition, the paper also using different elephant thresholds to get sub-flow from TCP/UDP elephant flow. Features of sub-flow are extracted, and the best feature subset is chosen by feature selection algorithm.
引文
[1]Internet World Stat. http://www.internetworldstats.com.
    [2]Internet World Stat. http://www.internetworldstats.com/stats.htm.
    [3]中国互联网络信息中(CNNIC).第31次中国互联网络发展状况统计报告.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201301/t20130115_38508.htm.2012.
    [4]中国互联网络信息中(CNNIC). http://www.cnnic.net.cn/hlwfzyj/jcsj/.2012.
    [5]国家计算机网络应急技术处理协调中心.中国互联网网络安全报告——2010年网络安全状况综述.]http://www.cert.org.cn/UserFiles/File/2010annual%20report1.pdf.
    [6]国家计算机网络应急技术处理协调中心.2012年我国互联网网络安全态势综述.http://www.cert.org.cn/publish/main/46/2013/20130320093925791767941/20130320093 925791767941.html.2013.
    [7]林冠洲.网络流量识别关键技术研究[学位论文].北京,北京邮电大学,2011年.
    [8]Vint Cerf, Brucc Davie, Albert Greenberg, et al. FIND Observer Panel Report. http://www.nets-find.net/FIND reprt_final.pdf. April 9,2009.
    [9]林平.网络流量的离线分析[学位论文].北京,北京邮电大学,2010年.
    [10]Internet Assigned Numbers Authority (I ANA). http://www.iana.org/assignments/port-numbers.
    [11]T. Karagiannis, A. Broido, M. Faloutsos, et al. Transport layer identification of P2P traffic [C]. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement. Taormina, Sieily, Italy,2004:121-134.
    [12]CAIDA. http://www.caida.org/.
    [13]CoralReef. http://www.caida.org/tools/measurement/coralreef/status.xml.
    [14]M. Roughan, S. Sen, O. Spatscheck, et al. Class-of-service mapping for QoS:a statistical signature-based approach to IP traffic classification[C]. In IMC'04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement. Taormina, Sicily, Italy:ACM, October 2004, pp.135-148.
    [15]H. Schulzrinne, S. Casner, R. Frederick, et al. RTP:A transport protocol for real-time applications. RFC 1889, IETF,1996.
    [16]A. Moore, K. Papagiannaki. Toward the accurate identification of network applications[C]. In Sixth Passive and Active Measurement Workshop (PAM), Boston, MA, USA, March/April 2005.
    [17]A. Madhukar, C. Williamson. A longitudinal study of P2P traffic classification. In 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, September 2006, pp.179-188.
    [18]S. Sen, O. Spatscheck, D.Wang. Accurate, scalable in-network identification of P2P traffic using application signatures[C]. In Proceedings of the 13th international conference on World Wide Web. New York, NY, USA:ACM, May 2004, pp.512-521.
    [19]Y. Zhang, V. Paxson. Detecting back doors[C]. In Proceedings of the 9th USENIX Security Symposium. Denver, Colorado, USA,2000:157-170.
    [20]D. E. Knuth, J. H. Morris, VR. Pratt. Fast Pattern matching in strings[J]. SIAM Journal on Computing,1977,6(2):323-350.
    [21]R. S. Boyer, J. S. Moore. A fast string searching algorithm[J]. Communications of the ACM,1977,20(10):762-772.
    [22]A. V. Aho, M. J. Corasick. Efficient string matching:an aid to bibliographic search[J]. Communications of the ACM,1975,18(6):333-340.
    [23]B. Commentz-Walter. A string matching algorithm fast on the average[C].In Proceedings of the 6th Colloquium on Automata, Languages and Programming,1979.
    [24]黄昆,谢高岗.深度数据包检测技术研究进展[J].信息技术快报,2010,6(8):1-18.
    [25]J. Van Lunteren. High performance pattern-matching for intrusion detection[C]. In Proceedings of IEEEINFOCOM. Barcelona, Spain,2006.
    [26]T. Song, W. Zhang, D. Wang. A memory effieient multiple pattern matching architecture for network security[C]. In Proceedings of IEEE INFOCOM. Phoenix, AZ, United states,2008:673-681.
    [27]S. Dharmapurikar, J. Lockwood. Fast and scalable pattern matching for content filtering[C]. In Proceedings of ACM ANCS. Princeton, NJ, USA,2005:183-192.
    [28]H. Lu, K. Zheng, B. Liu, et al. A memory efficient parallel string matching architecture for high-speed intrusion detection[J]. IEEE Journal on Selected Areas in Communieation,2006,34(10):793-1804.
    [29]S. Kumar, S. Dharmapurikar, F. Yu, et al. Algorithms to accelerate multiple regular expressions matching for deep packet inspection[C]. In Proceedings of ACM SGCOMM, 2006:339-350.
    [30]S. Kumar, J. Turner, J. Williams. Advanced algorithms for fast and scalable deep packet inspection[C].In Proeeedings of ACM ANCS. San Jose, California, USA,2006:81-92.
    [31]M. Becchi, S. Cadambi. Memory -dfficient regular expression search using state merging[C]. In Proceedings of IEEE INFOCOM. Anchorage, AK, United States, 2007:1064-1072.
    [32]R. Smith, C. Estan, S. Jha. XFA:faster signature matching whith extended automata[C]. In Proceedings of IEEE Symposium on Security and Privacy. Oakland, CA, United States,2008:187-201.
    [33]Mahbod Tavallaee, Wei Lu, Ali A. Online Classification of Network Flows. Conference on Communication Networks and Services Research-CNSR,2009, pp.78-85.
    [34]W. Long, Y. Xin, Y. Yang. An application-level signatures extracting algorithm based on offset constraint[C]. In Proceedings of Intelligent Information Technology Application Workshops. Shanghai, China,2008:122-125.
    [35]Z. Lin, Y. Xin, Y. Yang, et al. An application-level features mining algorithm based on PrefixSpan. In Proceedings of 2010 International Conference on Computer Engineering and Technology, ICCET 2010. Chengdu, China,2010:461-465.
    [36]V. Paxson. Empirically derived analytic models of wide-area TCP connections[J]. IEEE/ACM Transactions on Networking, vol.2, no.4, pp.316-336,1994.
    [37]C. Dewes, A. Wichmann, A. Feldmann. An analysis of Internet chat systems. In IMC 03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. Miami, Florida, USA:ACM, October 2003, pp.51-64.
    [38]K. C. Claffy. Internet traffic characterization[PhD Thesis]. University of California, San Diego,1994.
    [39]T. Lang, G. Armitage, P. Branch, H.Y. Choo. A synthetic traffic model for Half-Life[C]. In Proceedings of Australian Telecommunications Networks & Applications Conference 2003 ATNAC2003, Melbourne, Australia, December 2003.
    [40]T. Lang, P. Branch, G. Armitage. A synthetic traffic model for Quake 3[C]. In Proceedings of ACM SIGCHI ACE2004. Singapore, June 2004.
    [41]A. McGregor, M. Hall, P. Lorier, J. Brunskill. Flow clustering using machine learning techniques[C]. In Passive and Active Measurement (PAM) Conference,2004. Antibes Juan-les-Pins, France, April 2004.
    [42]S. Zander, T. Nguyen, G. Armitage. Automated traffic classification and application identification using machine learning. In IEEE 30th Conference on Local Computer Networks (LCN 2005), Sydney, Australia, November 2005, pp.250-257.
    [43]NetMate. http://ip-measurement.org/index.php?option=com_content&view=article&id=10&Itemid =9.
    [44]L. Bernaille, R. Teixeira, I. Akodkenou, et al. Traffic classification on the fly. ACM SIGCOMM Computer Communication Review, vol.36, no.2, pp.23-26,2006.
    [45]P.Haffner, S.Sen, O. Spatscheck, et al. ACAS:Automated construction of application signatures. In MineNet'05:Proceeding of the 2005 ACM SIGCOMM workshop on Mining network data. New York, NY, USA:ACM Press, August 2005, pp.197-202.
    [46]J. Erman, M. Arlitt, A. Mahanti. Traffic classification using clustering algorithms. In MineNet'06:Proceedings of the 2006 SIGCOMM workshop on Mining network data. Pisa, Italy:ACM,2006, pp.281-286.
    [47]J. Erman, A. Mahanti, M. Arlitt, et al. Identifying and discriminating between web and peer-to-peer traffic in the network core[C]. In WWW'07:Proceedings of the 16th international conference on World Wide Web. Banff, Alberta, Canada:ACM Press, May 2007, pp.883-892.
    [48]A. Moore, D. Zuev. Internet traffic classification using Bayesian analysis techniques. In SIGMETRICS'05:Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems. Banff, Alberta, Canada: ACM, June 2005, pp.50-60.
    [49]T. Auld, A. W. Moore, and S. F. Gull. Bayesian neural networks for Internet traffic classification. IEEE Transactions on Neural Networks, no.1, pp.223-239, January 2007.
    [50]N. Williams, S. Zander, G. Armitage. A Preliminary Performance Comparison of five machine learning algorithms for Practical IP traffic flow classification. ACM SIGCOMM Computer Communication Review, vol.36, PP.5-16, October2006.
    [51]T. Nguyen, G. Armitage. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world IP networks. In Proceedings 2006 31st IEEE Conference on Local Computer Networks, Tampa, Florida, USA, November 2006, pp.369-376.
    [52]T. Nguyen, G. Armitage. Synthetic sub-flow pairs for timely and stable IP traffic identification[C]. In Proceedings of Australian Telecommunication Networks and Application Conference, Melbourne, Australia, December 2006.
    [53]T. Nguyen, G. Armitage. Clustering to assist supervised machine learning for real-time IP traffic classification," in IEEE International Conference on Communications (ICC'08),2008, Beijing, China,2008, pp.5857-5862.
    [54]Xu Tian, Qiong Sun, Xiaohong Huang, Yan Ma. Dynamic Online Traffic Classification Using Data Stream Mining. International Conference on MultiMedia and Information Technology,2008. MMIT'08. pp.104-107.
    [55]Xu Tian, Qiong Sun, Xiaohong Huang, Yan Ma. Dynamic Online Traffic Classification Methodology based on Data Stream Mining.2009 WRI World Congress on Computer Science and Information Engineering, pp.298-302
    [56]J. Erman, A. Mahanti, M. Arlitt, I. Cohen, et al. Semi-supervised network traffic classification. ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS) Performance Evaluation Review, vol.35, no.1, pp. 369-370,2007.
    [57]J. Erman, A. Mahanti, M. Arlitt, I. Cohen, et al. Offline/realtime network traffic classification using semi-supervised learning. Department of Computer Science, University of Calgary, Tech. Rep., February 2007.
    [58]Amita Shrivastav, Aruna Tiwari. Network Traffic Classification using Semi-Supervised Approach. The Second International Conference on Machine Learning and Computing, 2010,pp.345-349.
    [59]T. Karagiannis, K. Papagiannaki, M. Faloutsos. BLINC:multilevel traffic classification in the dark. In SIGCOMM'05:Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications. Philadelphia, Pennsylvania, USA:ACM, August 2005, pp.229-240.
    [60]A. Moore, D. Zuev. Internet traffic classification using Bayesian analysis techniques. In SIGMETRICS'05:Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems. Banff, Alberta, Canada: ACM, June 2005, pp.50-60.
    [61]J. Erman, A. Mahanti, M. Arlitt. Byte me:a case for byte accuracy in traffic classification. In MineNet'07:Proceedings of the 3rd annual ACM workshop on Mining network data. San Diego, California, USA:ACM Press, June 2007, pp.35-38.
    [62]J. Han, M. Kamber.数据挖掘:概念与技术(第2版).机械工业出版社,2007年3月.
    [63]I. H. Witten,E.Frank.数据挖掘:实用机器学习技术(第2版).机械工业出版社,2006年2月.
    [64]Y. Reich, S. J. Fenves. The formation and use of abstract concepts in design. Concept formation knowledge and experience in unsupervised learning, pp.323-353,1991.
    [65]Z. Shi. Principles of machine learning. International Academic Publishers,1992.
    [66]R. Kohavi, J. R. Quinlan, W. Klosgen, et al. Decision tree discovery. Handbook of Data Mining and Knowledge Discovery, pp.267-276,2002.
    [67]G. John, P. Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Montreal, Quebec, Canada:Morgan Kaufmann, August 1995, pp.338-345.
    [68]Lloyd, S. P. (1957). Least square quantization in PCM. Bell Telephone Laboratories Paper. Published in journal much later:Lloyd., S. P. (1982).
    [69]Lloyd, S. P. (1957). Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2):129-137.
    [70]MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. pp.281-297.
    [71]Martin Ester, Hans-Peter Kriegel, Jorg Sander, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. in Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96),1996, pp.226-231
    [72]T. Zhang, R. Ramakrishnan, M. Livny. BIRCH:An Efficient Data Clustering Method for Very Large Databases", In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. ACM Press, New York,1996, pp.103-114.
    [73]Wang W., Yang J., Muntz R. STING:A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings 23th International Conference on Very Large Data Bases, Athens, Greece, Morgan Kaufmann Publishers, San Francisco, CA,1997, pp.186-195.
    [74]G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. New York:Wiley, 1997.
    [75]C. Pizzuti, D. Talia. P-AutoClass:Scalable parallel clustering for mining large data sets," IEEE Trans. Knowl. Data Eng., vol.15, no.3, pp.629-641, May-Jun.2003.
    [76]P. Cheeseman and J. Stutz. Bayesian classification (AutoClass):Theory and results. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Menlo Park, CA:AAAI Press,1996, pp.153-180.
    [77]J. W. Shavlik, T. G. Dietterich. Reading in Machine Learning[Z]. Morgan Kaufmann, 1990.
    [78]J. Gennari, P. Langley, D. Fisher. Models of incremental concept formation. Artificial Intelligence,40:11-62,1989.
    [79]T. Kohonen. The self-organizing map. In Proceeding of IEEE. vol.78, no.9, pp. 1464-1480, Sep.1990.
    [80]T. Kohonen. Self-Organizing Maps,3rd ed. New York:Springer-Verlag,2001.
    [81]N. Pal, J. Bezdek, E. Tsao. Generalized clustering networks and Kohonen's self-organizing scheme. IEEE Transation on Neural Network, vol.4, no.4, pp.549-557, Jul.1993.
    [82]B. Everitt, S. Landau, M. Leese. Cluster Analysis. London:Arnold,2001.
    [83]J. Gower. Ageneral coefficient of similarity and some of its properties. Biometrics, vol. 27, pp.857-872,1971.
    [84]A. Jain and R. Dubes. Algorithms for Clustering Data. Englewood Cliffs. NJ: Prentice-Hall,1988.
    [85]R. Kathari and D. Pitts. On finding the number of clusters. Pattern Recognition. vol.20, pp.405-416,1999.
    [86]J. Cherng, M. Lo. A hyper graph based clustering algorithm for spatial data sets. In Proceedings of IEEE International Conference on Data Mining (ICDM'01),2001, pp. 83-90.
    [87]V. Estivill-Castro, I. Lee. AMOEBA:Hierarchical clustering based on spatial proximity using Delaunay diagram. In Proceedings of 9th International Symposia on Spatial Data Handling (SDH'99), Beijing, China,1999, pp.7a.26-7a.41.
    [88]G. Karypis, E. Han, V. Kumar. Chameleon:Hierarchical clustering using dynamic modeling. IEEE Computer, vol.32, no.8, pp.68-75, Aug.1999.
    [89]T. T. Nguyen and G. Armitage. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, to appear,2008.
    [90]M. A. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 359-366,2000.
    [91]M. Hall, G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, vol.15, no.6, pp. 1437-1447, November/December 2003.
    [92]D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Boston, MA, USA:Addison-Wesley Longman Publishing Co., Inc.,1989.
    [93]R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligent, vol.97, no.1-2, pp.273-324,1997.
    [94]P. H. Winston. Artificial Intelligence,2nd ed. Boston, MA, USA:Addison-Wesley Longman Publishing Co., Inc.,1984.c
    [95]P. Domingos, G. Hulten, Mining high-speed data streams. In Proceedings of KDD 2000, pp.71-80.
    [96]Hulten G, Spencer L, Domingos P. Mining Time Changing Data Streams. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2001.97-106.
    [97]O'Callaghan L.Streaming data algorithms for high quality clustering[C]. In Proceedings of the 18th International Conference on Data Engineering. Massachusetts:IEEE Computer Society,2002:685-694.
    [98]Aggarwal C C, Han J W, Wang J Y, et al. A framework for clustering evolving data streams[C]. In Proceedings of the 29th VLDB Conference. Berlin:VLDB Endowment, 2003:81-92.
    [99]Bezdek J C. A fuzzy mathematics in pattern classification[D]. PHD thesis, Applied Math. Center, Cornell University, Ithaca,1973.
    [100]Dunn J C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters[J]. Jounal of Cybernetics,1973,3(1):32-57.
    [101]Bezdek J C. Pattern recognition with fuzzy objective function algorithms[M]. New York, Plenum Press,1981,43-93.
    [102]A. Hinneburg, D. A. Keim. An efficient approach to clustering in large multimedia databases with noise [C]. In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD'98). New York, NY,1998:58-65.
    [103]G. Sheikholeslami, S. Chatterjee, A. Zhang. WaveCluster:A multiresolution clustering approach for very large spatial databases[C]. In Proceedings of International Conference on Very Large Databases (VLDB'98). New York,1998:428-439.
    [104]Rakesh A, Johanners G, Dimitrios G, et al. Automatic subspace clustering of high dimensional data for data mining applications. In:Snodgrass RT, Winslett M, eds. Proc. of the 1994 ACM SIGMOD Int'1 Conf. on Management of Data. Minneapolis:ACM Press,1994.94-105.
    [105]Guha S, Meyerson A, Mishra N, et al. Clustering data streams[C]. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science. Washington DC: IEEE Computer Society,2000:359-366.
    [106]Babcock B, Datar M, Motwani R, et al. Maintaining variance and k-medians over data stream windows [C]. Proceedings of the 22nd ACM Symposium on Principles of Database Systems.San Diego:ACM Press,2003:234-243.
    [107]Aggarwal C C, Han J W, Wang J Y, et al. A framework for projected clustering of high dimensional data streams[C]. In Proceedings of the 30th VLDB Conference. Toronto:VLDB Endowment,2004:852-863.
    [108]Wang Z, et al. Clustering Data St reams on the Two-Tier Structure. In:Advanced Web Technologies and Applications:6t h Asia Pacific Web Conf (APWeb 2004),2004. 416-425.
    [109]Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. Cure:An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, June 1998.
    [110]Ordonez C. Clustering Binary Data Streams wit h K-Means. In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery,2003.12-19.
    [111]Babcock B, et al. Maintaining Variance and k-Medians over Data Stream Windows. In Proceedings of the twenty-second ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems,2003.234-243.
    [112]Udommanetanakit K, Waiyamai K, Rakthanmanon T. Advanced data mining and applications [M]. Heidelberg:Springer-Verlag,2007:605-615.
    [113]常建龙,曹锋,周傲英.基于滑动窗口的进化数据流聚类[J].软件学报,2007,18(4):905-918.
    [114]曹锋,周傲.基于图形处理器数据流快速聚类[J].软件学报,2007,18(2):291-302.
    [115]Rakesh A., Johanners G., Dimitrios G., et al. Automatic subspace clustering of high diemensional data for data mining applications. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data Minneapolis,1994,94-105.
    [116]Chen Y X, Tu L. Density-based clustering for real-time stream data [C]. Proceedings of the 13th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. California:ACM,2007:133-142.
    [117]Bhatnagar V, Kaur S. Exclusive and Complete Clustering of Streams. Springer-Verlag Berlin Heidelberg 2007, pp 629-638.
    [118]孙玉芬,卢炎生.一种基于网格方法的高维数据流子空间聚类算法[J].计算机科学,2007,34(4):199-203.
    [119]周晓云,孙志挥,张柏礼.高维数据流聚类及演化分析研究[J].计算机研究与发展,2006,43(11):2005-2011.
    [120]KDD Cup 1999 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
    [121]Cao F, Ester M, Qian W N, et al. Density-based clustering over an evolving data stream with noise[C]. SI AM Conference on Data Mining,2006.
    [122]朱蔚恒,印鉴,谢益煌.基于数据流动任意形状聚类算法[J].软件学报, 2006,17(3):379-387.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700