Quick Knowledge Reduction Based on Divide and Conquer Method in Huge Data Sets
详细信息    查看全文
文摘
The problem of processing huge data sets has been studied for many years. Many valid technologies and methods for dealing with this problem have been developed. Random sampling [1] was proposed by Carlett to solve this problem in 1991, but it cannot work when the number of samples is over 32,000. K. C. Philip divided a big data set into some subsets which fit in memory at first, and then developed a classifier for each subset in parallel [2]. However, its accuracy is less than those processing a data set as a whole. SLIQ [3] and SPRINT [4], developed by IBM Almaden Research Center in 1996, are two important algorithms with the ability of dealing with disk-resident data directly. Their performance is equivalent to that of classical decision tree algorithms. Many other improved algorithms, such as CLOUDS [5] and ScalParC [6], are developed later. RainForest [7] is a framework for fast decision tree construction for large datasets. Its speed and effect are better than SPRINT in some cases. L. A. Ren, Q. He and Z. Z. Shi used hyper surface separation and HSC classification method to classify huge data sets and achieved a good performance [8, 9]. Rough Set (RS) [10] is a valid mathematical theory to deal with imprecise, uncertain, and vague information. It has been applied in such fields as machine learning, data mining, intelligent data analyzing and control algorithm acquiring, etc, successfully since it was proposed by Professor Z. Pawlak in 1982. Attribute reduction is a key issue of rough set based knowledge acquisition. Many researchers proposed some algorithms for attribution reduction. These reduction algorithms can be classified into two categories: reduction without attribute order and reduction with attribute order. In 1992, A. Skowron proposed an algorithm for attribute reduction based on discernibility matrix. It’s time complexity is t = O(2 m ¡Án 2), and space complexity is s = O(n 2¡Ám) (m is the number of attributes, n is the number of objects) [11]. In 1995, X. H. Hu improved Skowron’s algorithm and proposed a new algorithm for attribute reduction with complexities of t = O(n 2¡Ám 3) and s = O(n¡Ám) [12]. In 1996, H. S. Nguyen proposed an algorithm for attribute reduction by sorting decision table. It’s complexities are t = O(m 2¡Án¡Álogn) and s = O(n + m) [13]. In 2002, G. Y. Wang proposed an algorithm for attribute reduction based on information entropy. It’s complexities are t = O(m 2¡Án¡Álogn) and s = O(n¡Ám) [14]. In 2003, S. H. Liu proposed an algorithm for attribute reduction by sorting and partitioning universe. It’s complexities are O(m 2¡Án¡Álogn) and s = O(n¡Ám) [15]. In 2001, using Skowron’s discernibility matrix, J. Wang proposed an algorithm for attribute reduction based on attribute order. Its complexities are t = O(m¡Án 2) and s = O(m¡Án 2) [16]. In 2004, M. Zhao and J. Wang proposed an algorithm for attribute reduction with tree structure based on attribute order. Its complexities are t = O(m 2¡Án) and s = O(m¡Án) [17].
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.