流式大数据下随机森林方法及应用
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Random Forest Method and Application in Stream Big Data Systems
  • 作者:刘迎春 ; 陈梅玲
  • 英文作者:Liu Yingchun;Chen Meiling;School of Economics and Management,Beihang University;
  • 关键词:决策树 ; 随机森林方法 ; 大数据 ; 流式计算 ; 社交网站 ; 搜索引擎 ; 分类器 ; 剪枝 ; 客户评分 ; 分布式系统
  • 英文关键词:decision tree;;random forest;;big data;;stream computing;;social network;;searching engine;;classifier;;pruning;;customer rating;;distributed system
  • 中文刊名:XBGD
  • 英文刊名:Journal of Northwestern Polytechnical University
  • 机构:北京航空航天大学经济管理学院;
  • 出版日期:2015-12-15
  • 出版单位:西北工业大学学报
  • 年:2015
  • 期:v.33;No.156
  • 语种:中文;
  • 页:XBGD201506033
  • 页数:7
  • CN:06
  • ISSN:61-1070/T
  • 分类号:184-190
摘要
流式计算形态下的大数据分析一直是当前需要解决的问题,而且研究成果和实践经验较少。随机森林方法是目前应用较多的分类算法,但在流式计算应用场景中,数据所呈现出来的实时性、易失性、无序性等特征会使得算法准确度逐渐降低。针对这个问题,分析了随机森林的算法特点,提出了根据决策树的准确度进行随机森林剪枝的思路。同时为了适应数据的变化,结合准确度间隔的概念提出生成、验证并补充新决策树的方法,最终形成可以不断随数据更新的随机森林,满足流式大数据环境对算法的要求。使用实际数据对改进后方法的可行性进行了验证,证明新方法在真实流式大数据场景中有着更高的分类准确度,最后分析讨论了随机森林方法如何进一步研究改进的主题。
        Stream computing is an important form of big data computing. Random forest method is one of the mostwidely applied classification algorithms at present. From the actual requirements, random forest method faces notonly huge number of features but also constantly changing data pattern over time. The accuracy of a random forestalgorithm without self renewal and adaptive algorithm will gradually reduce over time. Aiming at this problem, thispaper analyzes the characteristics of random forest algorithm, gives a new pruning idea according to the accuracy ofthe decision trees. In order to adapt to the change of data, a new random method based on margin is presented. Thisnew method can update itself constantly and can be applied in stream big data environments. Using the actual data,the new method is verified has higher accuracy in classification, and analysis and discussion of how to further re-search and improve the random forest method in big data environment.
引文
[1]孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169Meng X F,Ci X.Big Data Management:Concepts,Techniques and Challenges[J].Journal of Computer Research and Develop-ment,2013,50(1):146-169(in Chinese)
    [2]Lim L,Misra A,Mo T L.基于节能智能手机的连续处理传感器数据流自适应数据采集策略[J].分布式和并行数据库,2013,31(2):321-351Lim L,Misra A,Mo T L.Adaptive Data Acquisition Strategies for Energy-Efficient,Smartphone-Based,Continuous Processingof Sensor Streams[J].Distributed and Parallel Databases,2013,31(2):321-351(in Chinese)
    [3]Li B D,Mazur E,Diao Y L.SCALLA:可伸缩的单通过分析用Map Reduce平台[J].ACM数据库系统通讯,2012,37(4):1-43Li B D,Mazur E,Diao Y L.SCALLA:A Platform for Scalable One-Pass Analytics Using Map Reduce[J].ACM Trans.on Da-tabase Systems,2012,37(4):1-43(in Chinese)
    [4]Yang D,Rundensteiner E A,Ward M.数据流中的邻近模式挖掘[J].信息系统,2013,38(3):331-350Yang D,Rundensteiner E A,Ward M.Mining Neighbor-Based Patterns in Data Streams[J].Information Systems,2013,38(3):331-350(in Chinese)
    [5]李国杰,程学旗.大数据的研究现状与科学思考[J].中国科学院院刊,2012,27(6):647-657Li G J,Cheng X Q.Research Status and Scientific Thinking of Big Data[J].Bulletin of Chinese Academy of Sciences,2012,27(6):647-657(in Chinese)
    [6]王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138Wang Y Z,Jin X L,Cheng X Q.Network Big Data:Present and Future[J].Chinese Journal of Computers,2013,36(6):1125-1138(in Chinese)
    [7]覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生[J].软件学报,2012,23(1):32-45Qin X P,Wang H J,Du X Y,Wang S.Big Data Analysis:Competition and Symbiosis of RDBMS and Map Reduce[J].RuanJian Xue Bao/Journal of Software,2012,23(1):32-45(in Chinese)
    [8]Kobielus A.大数据架构中流式计算技术的角色.2013.http://ibmdatamag.com/2013/01/the-role-of-stream-computing-in-big-data-architectures/Kobielus A.The Role of Stream Computing in Big Data Architectures.2013.http://ibmdatamag.com/2013/01/the-role-of-stream-computing-in-big-data-architectures/(in Chinese)
    [9]孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014(4):839-862Sun D W,Zhang G Y,Zheng W M.Big Data Stream Computing:Technologies and Instances[J].Journal of Software,2014(4):839-862(in Chinese)
    [10]Neumeyer L,Robbins B,Nair A,Kesari A.S4:分布式流计算平台.第十届IEEE数据挖掘国际会议(ICDMW 2010).Syd-ney:IEEE Press,2010.2010.170-177Neumeyer L,Robbins B,Nair A,Kesari A.S4:Distributed Stream Computing Platform.In:Proc.of the 10th IEEE Int'l Conf.on Data Mining Workshops(ICDMW 2010).Sydney:IEEE Press,2010:170-177(in Chinese)
    [11]Borthakur D,Sarma JS,Gray J,Muthukkaruppan K,Spigeglberg N,Kuang HR,Ranganathan K,Molkov D,Mennon A,RashS,Schmidt R,Aiyer A.脸书中Apachi Hadoop的实时应用.ACM数据管理国际会议(SIGMOD 2011 and PODS 2011).Athens:ACM Press,2011:1071-1080Borthakur D,Sarma JS,Gray J,Muthukkaruppan K,Spigeglberg N,Kuang HR,Ranganathan K,Molkov D,Mennon A,RashS,Schmidt R,Aiyer A.Apache hadoop goes realtime at Facebook.In:Proc.of the ACM SIGMOD Int'l Conf.on Management ofData(SIGMOD 2011 and PODS 2011).Athens:ACM Press,2011:1071-1080(in Chinese)

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700