具有两类上限条件的虚拟样本生成数量优化
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Quantity Optimization of Virtual Sample Generation with Two Kinds of Upper Bound Conditions
  • 作者:林越 ; 刘廷章 ; 王哲河
  • 英文作者:LIN Yue;LIU Tingzhang;WANG Zhehe;College of Science,Hainan Tropical Ocean University;College of Automation,Shanghai University;
  • 关键词:小样本 ; 机器学习 ; 虚拟样本 ; 信息熵 ; 置信水平
  • 英文关键词:small sample;;machine learning;;virtual sample;;information entropy;;confidence level
  • 中文刊名:GXSF
  • 英文刊名:Journal of Guangxi Normal University(Natural Science Edition)
  • 机构:海南热带海洋学院理学院;上海大学自动化学院;
  • 出版日期:2019-01-10
  • 出版单位:广西师范大学学报(自然科学版)
  • 年:2019
  • 期:v.37
  • 基金:国家自然科学基金(61273190);; 海南省自然科学基金(117150)
  • 语种:中文;
  • 页:GXSF201901016
  • 页数:7
  • CN:01
  • ISSN:45-1067/N
  • 分类号:146-152
摘要
面对小样本数据集,虚拟样本生成(virtual sample generation,VSG)技术已被证实能有效提升机器学习算法的性能,然而对于最优的生成数量并未有明确的结论。本文首先在给定训练样本标准方差上限的条件下,采用信息熵理论研究最优虚拟样本生成数量;其次将虚拟样本所产生的噪声加以考虑,在给定的置信水平(0.95)下建立了最优虚拟样本生成数量的一般概率模型及分析方法;最后以2016年浙江湖州某变电站历史监测故障数据建立小样本数据集,设计4次相关虚拟样本生成实验,结果表明,上述两种最优虚拟样本生成数量法则行之有效,相应的机器学习预测精度有所提高。
        With small sample data sets,the virtual sample generation technology has been proved to effectively improve the performance of machine learning algorithm.However,there is no definite conclusion for the optimal generation number.First of all,under the condition of the limit of standard variance of a given training sample,the information entropy theory is proposed to study the number of optimal virtual sample generation.In addition,the noise generated by virtual sample generation is taken into account and a general probability model and the analysis method of the number of optimal virtual samples are established at a given confidence level(0.95).A small sample data set is set up based on the historical monitoring fault data of a substation in Huzhou,Zhejiang,in 2016 and a four virtual sample generation experiment is designed.The results show that the two optimal virtual sample generation rules are effective,and the accuracy of the corresponding machine learning prediction is obviously improved.
引文
[1]陈潭.大数据战略实施的实践逻辑与行动框架[J].中共中央党校学报,2017,21(2):19-26.DOI:10.14119/j.cnki.zgxb.2017.02.003.
    [2]郭毅可.走好我们的大数据之路[J].上海大学学报(自然科学版),2016,22(1):1-2.DOI:10.3969/j.issn.1007-2861.2015.05.016.
    [3]宫夏屹,李伯虎,柴旭东,等.大数据平台技术综述[J].系统仿真学报,2014,26(3):489-496.DOI:10.16182/j.cnki.joss.2014.03.039.
    [4] EFRON B,TIBSHIRANI R J.An introduction to the bootstrap[M].New York:Chapmen and Hall,1993.
    [5] TSAI T I,LI D C.Utilize bootstrap in small data set learning for pilot run modeling of manufacturing systems[J].Expert Systems with Applications,2008,35(3):1293-1300.DOI:10.1016/j.eswa.2007.08.043.
    [6] HUANG Chongfu,MORAGA C.A diffusion-neural-network for learning from small samples[J].International Journal of Approximate Reasoning,2004,35(2):137-161.DOI:10.1016/j.ijar.2003.06.001.
    [7] LI D C,WU C S,CHANG F M.Using data-fuzzification technology in small data set learning to improve FMS scheduling accuracy[J].The International Journal of Advanced Manufacturing Technology,2005,27(3/4):321-328.DOI:10.1007/s00170-003-2184-y.
    [8] LI D C,WU C S,TSAI T I,et al.Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge[J].Computers and Operations Research,2007,34(4):966-982.DOI:10.1016/j.cor.2005.05.019.
    [9] LIN Y S,LI D C.The generalized-trend-diffusion modeling algorithm for small data sets in the early stages of manufacturing systems[J].European Journal of Operational Research,2010,207(1):121-130.DOI:10.1016/j.ejor.2010.03.026.
    [10] LI D C,CHEN C C,CHANG C J,et al.A tree-based-trend-diffusion prediction procedure for small sample sets in the early stages of manufacturing systems[J].Expert Systems with Applications,2012,39(1):1557-1581.DOI:10.1016/j.eswa.2011.08.071.
    [11]朱宝,陈忠圣,余乐安.一种新颖的小样本整体趋势扩散技术[J].化工学报,2016,67(3):820-826.DOI:10.11949/j.issn.0438-1157.20151921.
    [12] CHEN Zhongsheng,ZHU Bao,HE Yanlin,et al.A PSO based virtual sample generation method for small sample sets:applications to regression datasets[J].Engineering Applications of Artificial Intelligence,2017,59:236-243.DOI:10.1016/j.engappai.2016.12.024.
    [13] YANG Jing,YU Xu,XIE Zhiqiang,et al.A novel virtual sample generation method based on Gaussian distribution[J].Knowledge-Based Systems,2011,24(6):740-748.DOI:10.1016/j.knosys.2010.12.010.
    [14]徐中民,张志强,程国栋,等.运用信息熵理论研究条件估值调查中的抽样问题[J].系统工程理论与实践,2003(3):129-134.DOI:10.3321/j.issn:1000-6788.2003.03.023.
    [15]林耀三,张延全,张哲荣,等.虚拟样本合适性筛选机制[C]//第25届全国色系统会议论文集.北京:中国高等科学技术中心,2014:372-379.
    [16]王松桂,张忠占,程维虎,等.概率论与数理统计[M].北京:科学出版社,2004:120-127.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700