基于GEP和RS的大数据集分类模型研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
分类作为数据分析形式的一种,它可以从大量的数据中提取描述所有对象的模型。由于分类是利用已知的模型对新的数据进行预测,因此它是一个很好的有监督的学习过程。一个好的分类规则能够让我们更好的认识这个类,同时有效的利用类中的这些数据。分类是数据挖掘中最重要的任务,它通过分析已知数据提取分类
     模型,然后使用该分类模型将接下来要分类的数据一一映射到指定的分类规则当中。分类已经被广泛的应用到机器学习、神经网络、性能等方面的预测。实际上分类的训练集大多是连续的、有噪音的、不完整的,这往往会影响分类的精度。为了提高分类的精度,本文首先采用临界值等宽区间离散法将连续数据离散化,然后利用粗糙集这一能够对不完整、冗余、缺失的知识进行处理的理论方法所具有的知识分类的特点,结合基因表达式编程的进化策略,重点研究在数据预处理层去除冗余、不完整数据,提出了一种基于基因表达式编程的粗糙集属性约简研究算法(Attribute Reduction of Rough Set Based on GeneExpression Programming,简称ARRS_GEP),最后针对当前分类规则提取存在规则繁多的问题,提出一个新的分类模型。该模型包括对数据准备、数据预处理、规则提取、规则测试、规则评价等过程。本文所作主要工作:(1)系统的阐述了分类、基因表达式编程和粗糙集理论的相关知识及研究现状,对粗糙集的核心内容属性约简问题进行了详细的介绍,指出遗传算法约简的不足。将遗传算法与基因表达式编程进行了比
     较,找出这两种进化算法的区别。(2)在对基因表达式编程进行理论分析的基础上,研究如何改进属性约简算法,提出了基于GEP的约简算法,即ARRS_GEP算法。采用不同的约简方法进行实验,验证ARRS_GEP算法的有效性。
     (3)分类问题中的很多算法都要求数据为离散的,比如,粗糙集等,本文针对这一问题提出采用临界值等宽区间离散法对连续特征进行离散。同时,对提取分类规则时存在的噪音数据的问题进行分析,提出在预处理层使用ARRS_GEP约简算法进行交叉、变异、重组、插串等操作,对条件属性进行约简,约简后再使用分类算法提取规则。
     (4)采用对某年上市公司失败的预测,对本文提出的分类模型进行验证,实验表明该模型减少了分类规则的复杂性,提取的分类规则简单,属性少。这表明该模型在知识约简和规则提取中是有效性。
Classification,as one of data analysis ways,can extract the model which can describe all objects from the large amount of data. Because of using the known model to predict new data, Classification is a favourable supervised learning process. A good classification rule can make us not only understand this class better, but also use these data effectively.
     The classification is an important task in data mining, it extracts a model by analyzing the known attributes of training set. By using the model,we can map the data that will be classified to the specified classification rule one-on-one. Classification has been widely applied to machine learning, neural networks and performance prediction.In most cases, the training set of classification are continuous, noisy and incomplete actually, which will affect the accuracy of classification. In order to improve the accuracy of classification.Firstly,the paper uses a wide range of threshold discretization method to discretize continuous data.Secondly,this paper takes advantage of the rough set theory, which can deal with these incomplete, redundant, partial knowledge, and the evolutionary strategy of GEP. We focus on how to remove those redundant, continuous and partial data on the data preprocess layer.This paper proposed attribute reduction algorithm of Rough set based on Gene expression programming(GEP).Finally, to the question that the present classification rule is complicated, this paper proposes a new classification model, which includes data acquisition, preprocessing, discretization and reduction. The main work of this paper is as follows:
     (1) We systematically review the related literatures on classification,GEP and rough set theory's;give a detailed discussion on the core content-reduction of rough set; point out the defect of the genetic algorithm reduction;and find the differences between Genetic algorithm and gene expression programming.
     (2) On the basis of theoretical analysis of GEP, this paper studies how to improve the attribute reduction algorithm,and proposes a reduction algorithm based on GEP,ARRS_GEP,and uses different reduction methods to verify the validity of the new algorithm.
     (3) Many algorithms in the classification task require discrete data, for example, rough sets, etc.To solve such a problem,this paper uses the wide range of threshold discrete method to discretize the continuous features.By analyzing the problem that there exists noisy data when we extract classification rule,this paper proposes to do these operations such as cross, variation, restructuring, inserted string,on the data link layer. After the reduction of condition attributes,we use the classification algorithm to extract the rule reduction.
     (4) To test and verify the proposed model, this paper has predicted one trading enterprise.The result shows that the model can reduce the complicacy of classification rule.The derived classification rule via the proposed method has fewer attributes, and is simple relatively.This indicates that the model is effective in knowledge reduction and rule extraction.
引文
[1]王国胤,姚一豫,于洪.粗糙集理论与应用研究综述[J].计算机学报,2009,vol(32):1229-1245.
    [2]元昌安,彭昱忠,覃晓等.基因表达式编程的原理与算法应用[M].北京:科学出版社,2010.
    [3]胡卉颖,钟智,元昌安等.基于基因表达式编程的粗糙集属性约简研究[J].广西师范大学学报,2012.
    [4]邝艳敏,王自强,李鹏.分类规则挖掘算法综述[J].农业网络信息,2007,10:8-10.
    [5]王刚,黄丽华,张成洪等.数据挖掘分类算法研究综述[J].科技导报,2006,24(12):73-76.
    [6]张亦军.基于粗糙集和遗传算法的大数据集数据挖掘应用研究[D].太原理工大学,2006.
    [7]王宗军,李红侠,邓晓岚.粗糙集理论研究的最新进展及发展趋势[J].武汉理工大学学报,2006,28(1):43-52.
    [8]苗夺谦,李道国.粗糙集理论、算法与应用[M].北京:清华大学出版社,2008.
    [9]王艳.基于GEP与决策树融合的分类算法的研究[D].广西师范学院,2009.
    [10]Joanna Jedrzejowiez,Piotr Jedrzejowiez.Cellular Gene Expression Programming Classifier Learning.Computer Science,2011:66-83.
    [11]费红霞.改进基因表达式编程算法的研究及应用[D].武汉科技大学,2009.
    [12]蔡宏果,元昌安,彭昱忠等.基于GEP的多层关联规则挖掘算法及其应用[J].计算机工程与设计,2010,31(1):137-140.
    [13]刘齐宏,唐常杰,胡建军等.多样性制导分段进化的基因表达式编程[J].四川大学工程学报(工程科学版),2006,38(6):108-113.
    [14]陶俊剑,元昌安,蔡宏果.基于GEP优化的RBF神经网络算法[J].小型微型计算机系统,2010,vol(5):950-954.
    [15]邓松,林为民,张涛.基于混合基因表达式编程的入侵检测算法[J].计算机与现代化,2011,vol(9):33-39.
    [16]Hazi Mohammad Azamathulla,Aminuddin Ab.Ghani,Cheng Aiang Leow,etal. Gene Expression Programming for the Development of a Stage-Discharge Curve of the Pahang River. Computer Science,2011,vol(25):2901-2916.
    [17]李世祥,樊红,王玉莉.基因表达式程序设计的卫星遥感影像恢复[J].武汉科技大学学报,2010,vol(35):877-881.
    [18]Jiang Siwei,Cai Zhihua,Zeng Dan,etal.Gene Expression Programming based Simulated Annealing.IEEE,2005:1218-1221.
    [19]饶元,元昌安.基于模拟退火的基因改进型GEP算法[J].四川大学学报,2008,45(4):767-772.
    [20]张增银.改进基因表达式编程算法的研究及应用[J].计算机工程与设计,2009,31(9),2027:2029.
    [21]王文栋,钟智,元昌安等.基于GEP的支持向量机参数优化[J].广西师范学院学报(自然科学版),2010,27(2):66-70.
    [22]罗锦光.基于GEP-CPN的可信网络终端的行为聚类模型的研究[D].广西师范学院,2011.
    [23]何国建.基于粗糙集理论与遗传算法的分类算法研究[D].西南交通大学,2003.
    [24]李玲俐.数据挖掘中分类算法综述[J].重庆师范大学学报,2011,28(4):44-47.
    [25]王彪,段禅伦,吴昊等.粗糙集与模糊集的研究及应用[M].北京:电子工业出版社,2008.
    [26]胡卉颖,罗锦坤,刘阿宁.三枝决策粗糙集模型属性约简研究[J].软件导刊,2012,11(2):20-22.
    [27]肖厚国.基于遗传算法的粗糙集属性约简方法研究[D].大连海事大学,2008.
    [28]邱玉霞.进化计算与粗糙集研究及应用[M].北京:冶金工业出版社,2009
    [29]由胜勇.数据约简及其在社保联网审计中的应用[D].哈尔滨工业大学,2007.
    [30]白燕娥,崔广才.基于遗传算法的属性约简算法研究与实现[J].长春理工大学学报,2005,28(3):36-38.
    [31]邹瑞芝.基于粗糙集的分类算法研究[D].长沙理工大学,2009.
    [32]Wang S K M,Ziarko W.On optimal decision rules in decision tables[J].Bulletin of Polish Academy of Sciences,1985,33(6):693-676.
    [33]李伟生,易哲.基于遗传算法的粗糙集属性约简算法[J].微电子学与计算机,2010,27(3):71-74.
    [34]Turner M B,Demir M C.A genetic approach to data dimensionality reductionusing a special initial population. International Work Conference on the Interplay between Natural and Artificial Computation [C].Las Palmas,Canary Islands, Spain,2005:310-316.
    [35]Wang Wen hui,Zhou Dong hua.An Algorithm for Knowledge Reduction in Rough Sets Based on Genetic Algorithm[J].JOURNAL OF SYSTEM SIMU-LATIO,2001,13(Z1):91-96.
    [36]李订芳,章文,李贵斌等.基于可行域的遗传约简算法[J].小型微型计算统,2006,27(2):312-315.
    [37]朱克敌,陶志.并行遗传算法在粗糙集属性约简中的应用[J].沈阳工程学院学报,2005,1(1):70-73.
    [38]D. Song,W. Ru-Chuan, F. Xiong,etal. Gene Expression Programming For Attribution Reduction In Rough Set[J]. International Journal of Computers and Applications,2010,17(2):122-126.
    [39]陈维言,徐上.基于基因表达式编程编程和粗糙集属性约简分类方法[J].太原科技,2009,vol(01):49-50.
    [40]陶志,许宝栋,汪定伟等.基于遗传算法的粗糙集知识约简方法[J].系统工程,2003,vol(21):04-07.
    [41]陈曦,雷健,傅明.基于改进遗传算法的粗糙集属性约简算法[J].计算机工程与设计,2010,31(3):602-608.
    [42]UCI repository of machine learing database.URL:http://archive.ics.uci.edu/ml
    [43]何明,冯博琴,马兆丰等.基于增量式遗传算法的粗糙集分类规则挖掘[J].西安交通大学学报,2004,38(6):579-582.
    [44]覃伟荣,秦亮羲,朱杰.基于粗糙集分类器设计与应用研究[J].微计算机信息,2008,24(11):234-300.
    [45]李勃,王艳兵,姚青.基于粗糙集分类算法研究与实现[J].计算机工程与应用,2008,22(15):142-157.
    [46]Slowinski R.Rough Classification of HSV Patients.Intelligent Decision Supp-ort[M].Kluwer:Roman Slowinski,1992:77-944.
    [47]张小峰,赵永升,刘智云等.分类问题中连续属性的离散化[J].兰州理工大学学报,2007,33(1):104-106.
    [48]汪庆,张巍,刘鹏.连续特征离散化方法综述[C].第六届中国管理科学与工程论坛,2008.
    [49]高翔,侯小静.数据挖掘技术综述[J].牡丹江教育学院学报,2008,vol(6):109-110.
    [50]耿晓中,张冬梅.数据挖掘综述[J].长春师范学院学报(自然科学版),2006,25(3):24-27.
    [51]佘春红.数据清理方法[J].计算机应用,2002,22(12):128-130.
    [52]赵群.消除重复值数据清理方法研究[J].福建电脑,vol(6):1-52
    [53]菅利荣.面向不确定决策的杂合粗糙集方法及其应用[M].北京:科学出版社,2008.
    [54]韩燮,杨炳儒.一种属性与值约简简化算法[J].小型微型计算机系统,2004,25(2):245-247
    [55]陈建辉,陈贞.基于粗糙集的决策树规则提取算法[J].河北工程大学学报,2008,25(1):108-110.
    [56]朱颖翠,马英红,王常伟.基于粗糙集理论的规则提取算法[J].科技信息,2007,20:8-9
    [57]陈茵,闪四清,刘鲁等.最小冗余的无损关联规则集表述[J].自动化学报,2008,12:1491-1496.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700