Quick Knowledge Reduction Based on Divide and Conquer Method in Huge Data Sets

设为首页

收藏本站

网站地图 | English | 公务邮箱

About the library

Background
History
Leadership
Organization

Readers' Guide

Opening Hours
Collections
Help Via Email

Publications

Electronic Information Resources

Quick Knowledge Reduction Based on Divide and Conquer Method in Huge Data Sets

详细信息查看全文

作者：Guoyin Wang ; Feng Hu
关键词：Huge data sets processing ; rough set ; quick sort ; divide and conquer ; attribute reduction ; attribute order
刊名：Lecture Notes in Computer Science
出版年：2007
出版时间：2007
年：2007
卷：4815
期：1
页码：312-315
全文大小：238.0 KB

文摘

The problem of processing huge data sets has been studied for many years. Many valid technologies and methods for dealing with this problem have been developed. Random sampling [1] was proposed by Carlett to solve this problem in 1991, but it cannot work when the number of samples is over 32,000. K. C. Philip divided a big data set into some subsets which fit in memory at first, and then developed a classifier for each subset in parallel [2]. However, its accuracy is less than those processing a data set as a whole. SLIQ [3] and SPRINT [4], developed by IBM Almaden Research Center in 1996, are two important algorithms with the ability of dealing with disk-resident data directly. Their performance is equivalent to that of classical decision tree algorithms. Many other improved algorithms, such as CLOUDS [5] and ScalParC [6], are developed later. RainForest [7] is a framework for fast decision tree construction for large datasets. Its speed and effect are better than SPRINT in some cases. L. A. Ren, Q. He and Z. Z. Shi used hyper surface separation and HSC classification method to classify huge data sets and achieved a good performance [8, 9]. Rough Set (RS) [10] is a valid mathematical theory to deal with imprecise, uncertain, and vague information. It has been applied in such fields as machine learning, data mining, intelligent data analyzing and control algorithm acquiring, etc, successfully since it was proposed by Professor Z. Pawlak in 1982. Attribute reduction is a key issue of rough set based knowledge acquisition. Many researchers proposed some algorithms for attribution reduction. These reduction algorithms can be classified into two categories: reduction without attribute order and reduction with attribute order. In 1992, A. Skowron proposed an algorithm for attribute reduction based on discernibility matrix. It’s time complexity is t = O(2 m ¡Án 2), and space complexity is s = O(n 2¡Ám) (m is the number of attributes, n is the number of objects) [11]. In 1995, X. H. Hu improved Skowron’s algorithm and proposed a new algorithm for attribute reduction with complexities of t = O(n 2¡Ám 3) and s = O(n¡Ám) [12]. In 1996, H. S. Nguyen proposed an algorithm for attribute reduction by sorting decision table. It’s complexities are t = O(m 2¡Án¡Álogn) and s = O(n + m) [13]. In 2002, G. Y. Wang proposed an algorithm for attribute reduction based on information entropy. It’s complexities are t = O(m 2¡Án¡Álogn) and s = O(n¡Ám) [14]. In 2003, S. H. Liu proposed an algorithm for attribute reduction by sorting and partitioning universe. It’s complexities are O(m 2¡Án¡Álogn) and s = O(n¡Ám) [15]. In 2001, using Skowron’s discernibility matrix, J. Wang proposed an algorithm for attribute reduction based on attribute order. Its complexities are t = O(m¡Án 2) and s = O(m¡Án 2) [16]. In 2004, M. Zhao and J. Wang proposed an algorithm for attribute reduction with tree structure based on attribute order. Its complexities are t = O(m 2¡Án) and s = O(m¡Án) [17].