用户名: 密码: 验证码:
数据集分类可用性评估的置信区间方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Confidence Interval Method for Classification Usability Evaluation of Data Sets
  • 作者:谈询滔 ; 顾依依 ; 阮彤 ; 袁玉波
  • 英文作者:TAN Xun-tao;GU Yi-yi;RUAN Tong;YUAN Yu-bo;Department of Computer Science and Engineering,East China University of Science and Technology;
  • 关键词:数据可用性 ; 分类系统 ; 区间分析 ; 信息粒化 ; 分类可用性
  • 英文关键词:Data usability;;Classification system;;Interval analysis;;Information granulation;;Classification usability
  • 中文刊名:JSJA
  • 英文刊名:Computer Science
  • 机构:华东理工大学计算机科学与工程系;
  • 出版日期:2019-01-15
  • 出版单位:计算机科学
  • 年:2019
  • 期:v.46
  • 基金:国家自然科学基金项目(61772201);; 上海市科委基金项目(16511101000);上海市科委基金项目(17DZ11011003)资助
  • 语种:中文;
  • 页:JSJA201901013
  • 页数:8
  • CN:01
  • ISSN:50-1075/TP
  • 分类号:85-92
摘要
如何有效评价训练数据集的可用性,一直是困扰智能分类系统应用的难点问题。针对机器学习领域的数据分类问题,提出了一种基于区间分析和信息粒化的数据集分类可用性的评估方法,用于评价数据集的可分程度。该方法将待评估的数据集定义为分类信息系统,提出了分类置信区间的概念,通过区间分析进行信息粒化。在此信息粒化策略下,定义分类可用性的数学模型,并进一步给出单个属性以及整体数据集的分类可用性的计算方法。选择18个UCI标准数据集作为评估对象,给出了部分数据集分类可用性的评估结果,并且选取3种分类器对所选数据集进行分类实验,最终通过对上述实验结果的分析证明了该评估方法的有效性和可行性。
        It is always a difficult problem to evaluate the usability of training data sets effectively,which hinders the application of intelligent classification systems.Aiming at the issue of data classification in the field of machine learning,based on interval analysis and information granulation,this paper proposed an evaluation method of data classification usability to measure the separability of data sets.In this method,dataset is defined as the classification information system,and the concept of classification confidence interval is put forward,then the information granulation is carried out by interval analysis.Under this information granulation strategy,this paper defined the mathematical model of classification usability,and further gave the calculation method of the classification usability for single attribute and the total data set.In this paper,18 UCI standard data sets were selected as evaluation objects,the evaluation results of classification usability were given,and 3classifiers were selected to classify the above data sets.Finally,the effectiveness and feasibility of this evaluation method are verified by the analysis of experimental results.
引文
[1]NOH Y K,ZHANG B T,LEE D D.Generative local metric learning for nearest neighbor classification[C]∥Annual Conference on Neural Information Processing Systems.2018:106-118.
    [2]HOLLIFIELD T,SAILLET Y.Data quality assessment[J].Communications of the Acm,2017,45(4):211-218.
    [3]CHEN Y C.Research on classification algorithm for weakly usable data[D].Harbin:Harbin Institute of Technology,2014.(in Chinese)陈懿诚.弱可用数据上的分类算法研究[D].哈尔滨:哈尔滨工业大学,2014.
    [4]LI J Z,WANG H Z,GAO H,et al.State-of-the-art of research on big data usability[J].Journal of Software,2016,27(7):1605-1625.(in Chinese)李建中,王宏志,高宏,等.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625.
    [5]MERINO J,CABALLERO I,RIVAS B,et al.A data quality in use model for big data[J].Future Generation Computer Systems,2016,63(C):123-130.
    [6]BAHNSEN A C,AOUADA D,STOJANOVICA.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications an International Journal,2016,51(C):134-142.
    [7]LI J,LIU X.An important aspect of big data:data usability[J].Journal of Computer Research&Development,2013,50(6):1147-1162.
    [8]ZADEH L A.Toward a theory of fuzzy information granulation and itscentrality in human reasoning and fuzzy logic[J].Fuzzy Sets&Systems,1997,90(90):111-127.
    [9]LIN T Y.Granular computing on binary relations I:data mining and neighborhood systems[J].Rough Sets in Knowledge Discovery,1998,1(2):165-166.
    [10]LIN T Y.Granular computing on binary relations II:Rough set representations and belief functions[OL].http://core.ac.uk/display/24652632.
    [11]LIN T Y.Granular computing:Fuzzy logic and rough sets[M]∥Computing with Words in Information/Intelligent Systems 1.Physica-Verlag HD,1999:183-200.
    [12]YAO Y Y.Information granulation and rough set approximation[J].International Journal of Intelligent Systems,2001,16(1):87-104.
    [13]YAO Y.Perspectives of granular computing[C]∥IEEE International Conference on Granular Computing.IEEE,2005:85-90.
    [14]YAO J T,VASILAKOS A V,PEDRYCZ W.Granular computing:Perspectives and challenges[J].IEEE Transactions on Cybernetics,2013,43(6):1977-1989.
    [15]LI J,MEI C,XU W,et al.Concept learning via granular computing:A cognitive view point[J].Information Sciences,2015,298(1):447-467.
    [16]BATINI C,CAPPIELLO C,FRANCALANCI C,et al.Methodologies for data quality assessment and improvement[J].Acm Computing Surveys,2009,41(3):16.
    [17]KORN F,MUTHUKRISHNAN S,ZHU Y.Checks and balances:monitoring data quality problems in network traffic databases[C]∥International Conference on Very Large Data Bases.VLDB Endowment,2003:536-547.
    [18]XIONG H,PANDEY G,STEINBACH M,et al.Enhancing data analysis with noise removal[J].IEEE Transactions on Knowledge&Data Engineering,2006,18(3):304-319.
    [19]MIAO D,LIU X,LI J.On the complexity of sampling query feed back restricted data base repair of functional dependency violations[J].Theoretical Computer Science,2016,609:594-605.
    [20]MA S,FAN W,BRAVO L.Extending inclusion dependencies with conditions[J].Theoretical Computer Science,2014,515(1):64-95.
    [21]EMRAN N A.Data completeness measures[M]∥Pattern Analysis,Intelligent Security and the Internet of Things.Springer International Publishing,2015:117-130.
    [22]EMRAN N A,EMBURY S,MISSIER P.Measuring populationbased completeness for single nucleotide polymorphism(SNP)databases[J].Springer International Publishing,2014,551:173-182.
    [23]CAO Y,FAN W,YU W.Determining the relative accuracy of attributes[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2013:565-576.
    [24]ZHANG Y,WANG H,GAO H,et al.Efficient accuracy evaluation for multi-modal sensed data[J].Journal of Combinatorial Optimization,2015,32(4):1-21.
    [25]ZHANG Y,WANG H,YANG Z,et al.Relative accuracy evaluation[J].Plos One,2014,9(8):e103853.
    [26]FAN W,GEERTS F,WIJSEN J.Determining the currency of data[J].Acm Transactions on Database Systems,2011,37(4):1-46.
    [27]LI M H,LI J Z,GAO H.Evaluation of data currency[J].Chinese Journal of Computers,2012,35(11):2348.
    [28]SHEN W,LI X,DOAN A H.Constraint-based entity matching[C]∥National Conference on Artificial Intelligence.AAAIPress,2005:862-867.
    [29]LI L,LI J,GAO H.Evaluating entity-description conflict on duplicated data[M].Springer-Verlag New York,Inc.,2016,31(2):918-941.
    [30]QIAN Y H.Granulating mechanism and data modeling of complex data[D].Taiyuan:Shanxi University,2011.(in Chinese)钱宇华.复杂数据的粒化机理与数据建模[D].太原:山西大学,2011.
    [31]SKOWRON A,WASILEWSKI P.Information systems in modeling inter active computation songranules[J].Theoretical Computer Science,2010,412(42):5939-5959.
    [32]PAWLAK Z.Theoretical aspect of reasoning about data[M]∥Rough Sets:Theoretical Aspects of Reasoning about Data.Kluwer Academic Publishers,1991.
    [33]ZHANG Y P,ZHANG L,WU T.The representation of different granular worlds:A quotient space[J].Chinese Journal of Computers,2004,27(3):328-333.
    [34]JIANG L,WANG S,LI C,et al.Structure extended multinomial naive bayes[J].Information Sciences,2016,329(C):346-356.
    [35]SPEYBROECK N.Classification and regression trees[J].Wiley Interdisciplinary Reviews Data Mining&Knowledge Discovery,2012,57(1):243-246.
    [36]FAN R E,CHANG K W,HSIEH C J,et al.LIBLINEAR:A library forlarge linear classification[J].Journal of Machine Learning Research,2008,9(9):1871-1874.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700