多值属性和多标记数据分类
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机技术、网络技术和数据库技术的迅速发展,现实中越来越多的应用都与多值属性、多标记数据密切相关,因此多值属性和多标记数据的分类算法成为了当前数据挖掘和机器学习领域的一个研究热点。
     目前的研究主要集中于多标记数据的分类算法,没有考虑多值属性的问题,而且大多数算法没有充分学习标记之间的相关信息,加上现实中多样本的数量少、标记困难等问题,对传统的分类算法提出很多新的挑战。本文的主要工作分为3个部分:(1)提出5种多值属性分解算法,结合已有的多标记分类算法,建立多值属性多标记分类的学习框架,并通过实验比较了不同分解算法的优劣,验证了按照取值顺序进行分解的学习效果最好;(2)改进已有的贝叶斯网络算法,提出了结合通用贝叶斯网络GBN和多网贝叶斯网络MBN的多标记学习算法,能够有效获取多个标记之间的相关信息,较大地提高了分类的精度;(3)针对多标记数据标记样本少的问题,结合实际对基于多标记组合算法的缺点进行了深入分析,建立多标记组合的分层模型,并提出基于不确定度的主动学习和基于置信度的半监督学习,交替选择最有效的样本进行学习,最终建立分层多标记分类器模型,实验验证了该方法能够大大提高多标记分类器的有效性和鲁棒性。
     本文的研究成果为学习多标记之间的相关信息以及在少量标记样本下的多标记分类学习提供了有效的方法,并通过结合多值属性分解的算法,为多值属性多标记数据的分类建立了新的学习框架。
With the rapid development of computer technology, internet and database system, more and more applications are combined with multi-valued and multi-labeled datasets. Hence, multi-valued and multi-labeled classification has become a hot topic for researchers in data mining and machine learning.
     At present, most of the existing researches are done on multi-labeled classification without consideration about multi-valued problem. Meanwhile, the correlations between different labels are not studied adequately. What is more, lack of labeled sample results in insufficient information to learn during the training stage. All these arise new challenges to traditional classifiers. There are three contributions of this thesis. Firstly, it puts forward a new learning framework for multi-valued and multi-labeled classification by combining multi-value decomposition with multi-labeled classification algorithms. Five efficient decomposition methods are proposed and Rank Order method performs the best. Secondly, based on the study of Bayesian network, this thesis constructs a multi-labeled Bayesian network with the combination of General Bayesian network and Multi-net Bayesian network. The proposed algorithm can learn the correlations of labels in a better way, enhancing the accuracy of classification largely. Thirdly, as to the lack of labeled samples, an active learning and semi-supervised multi-labeled classification algorithm is conducted alternately based on hierarchical model. Experimental results demonstrates this algorithm greatly boosts the efficiency and robust of the classifier.
     This thesis provides an effective way to learn correlations between different labels and to construct a robust classifier with limited number of the labeled samples. Through combining multi-valued decomposition and multi-label classification algorithms, it builds a new learning framework for multi-valued and multi-labeled datasets.
引文
[1]Blake C, Merz C. UCI Repository of Machine Learning Databases.1998 http://www.ics.uci.edu/-mlearn/MLRepository.html
    [2]UC Irvine Machine Learning Repository Data Sets. http://archive.ics.uci.edu/ml/
    [3]Machine Learning & Knowledge Discovery Group:Learning from Multi-labeled Data. http://mlkd.csd.auth.gr/multilabel.html
    [4]Han J., Kamber M. Data Mining Concept and Technology. [M]. Peking:China Machine Press,2001
    [5]Clare A., King R.D. "Knowledge discovery in multi-label phenotype data", In: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery(PKDD 2001), Freiburg, Germany,2001.42-53
    [6]Codd E.F. Further normalization of the database relational model. Data base systems, Courant Computer Science Symposium series 6, Englewood Cliffs, NJ: Prentice-Hall,1972.33-64
    [7]Tsoumakas G., Katakis I. "Multi-label classification:An overview". International Journal of Data Warehousing and Mining,2007,3(3):1-13
    [8]Chen Y., Hsu C., Chou S. Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications,2003,25 (2):199-209
    [9]Chou S., Hsu C. MMDT:a multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems with Applications,2005,28 (2):799-812
    [10]Li H., Zhao R., Chen J.E., et al.Research on Multi-labeled-Decision Trees. ADMA 2006, LNAI 4093, Springer-Verlag Berlin Heidelberg,2006.247-254
    [11]李宏,陈松乔,赵蕊,郭跃健.多值属性多标记数据决策树算法研究,模式识别与人工智能,2007,21(6):815-820
    [12]Elisseeff A., and Jason W. A kernel method for multi-labelled classification. In: Dietterich T.G., Becker S., Ghahramani Z., eds. Advances in. Neural Information Processing Systems, Cambridge, MA:MIT Press,2002,14:681-687
    [13]Schapire R.E., Singer Y. Boostexter:a boosting-based system for text categorization. Machine Learning,2000,39(2/3):135-168
    [14]Ueda N., Saito K. Parametric metric models for multi-labeled text. In: Proceedings of NIPS,2002
    [15]Griffiths T., Ghahramani Z. Infinite latent feature models and the indian buffet process. In:Proceedings of NIPS,2005
    [16]Lu Z.W., Horace H. S., He Q.ZH. Context-based multi-label image annotation. In: Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR 2009),2009.224-230
    [17]Boutell MR, Luo J, Shen X, et al. Learning multi-label scene classification. Pattern Recognition.2004,37(9):1757-1771
    [18]Zhou Z.H., Zhang M.L. Multi-instance multi-label learning with application to scene classification. In:Advances in Neural Informati on Processing Systems 19 (NIPS'06) (Vancouver, Canada), B. Scholkopf, J. Platt, and T. Hofmann, eds. Cambridge, MA:MIT Press,2007.1609-1616.
    [19]Guo Y.J., Li H., Zhang W., et al. Multi-source Color Transfer Based on Multi-labeled Decision Tree. The 9th International Conference for Young Computer Scientists(ICYCS 2008),2008.820-825
    [20]Bo Li, Hong Li, Min Wu, et al. Multi-label Classification based on Association Rules with Application to Scene Classification. The 9th International Conference for Young Computer Scientists (ICYCS 2008),2008.36-41
    [21]Read J., Pfahringer B., Holmes G. Multi-label Classification Using Ensembles of Pruned Sets. In:Proceedings of IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy,2008.995-1000
    [22]Chen X.F., Wang S.T., Cao S.Q. Gene function analysis of semi-supervised multi-label learning. CAAI TRANSACTIONS ON INTELLIGENT SYSTEMS,2008, 1:83-90
    [23]Ye Wu Y, Ren F.J. A corpus-based multi-label emotion classification using maximum entropy. In:Proceedings of the 6th International Workshop on Natural Language Processing and Cognitive Science (NLPCS 2009 In Conjunction with ICEIS 2009),2009.103-110
    [24]Trohidis K, Tsoumakas G, Kalliris G, et al. Multi-label Classification of Music into Emotions. In:Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), Philadephia, PA, USA,2008.14-18
    [25]Zhu S., Ji X., Xu W., et al. "Multi-labeled classification using maximum entropy method", In:Proceedings of SIGIR,2005
    [26]Tsoumakas G, Vlahavas I. Random k-label sets:An ensemble method for multi-label classification. In:Proceedings of the 18th European Conference on Machine Learning.2007.406-417
    [27]Nathalie Japkowicz. The class imbalance problem:A Systematic Study. Intelligent Data Analysis,2002.429-450
    [28]Chawla N.V., Bowyer K.W., Hall L.O., et al. SMOTE:Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research,2002,16: 321-357
    [29]Chen K., Lu B.L., James T. Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers. IEEE International Joint Conference on Neural Networks, July 16-21,2006.1771-1775
    [30]Rafael G.C., Manuel M.G., Paolo R., et al. Taking Advantage of the Web for Text Classification with Imbalanced Classes. MICAI 2007, LNAI 4827,2007.831-838
    [31]Schapire R, Singer Y. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning,1999,37(3):297-336
    [32]Zhang M.L., Zhou Z.H. ML-KNN:A lazy learning approach to multi-label learning. Pattern Recognition,2007,40(7):2038-2048
    [33]Zhang M.L., Zhou Z.H. Multi-label neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering,2006,18(10),1338-1351.
    [34]Zhang M.L., Zhou Z.H. Multi-label learning by instance differentiation. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI'07), Vancouver, Canada,2007.669-674
    [35]李宏,李博,吴敏,陈松乔.一种基于关联规则的多类标分类算法.控制与决策.2009.4(24):574-578.
    [36]Zhang M.L., Zhou Z.H. M3-MIML:A maximum margin method for multi-instance multi-label learning. In:Proceedings of the 8th IEEE International Conference on Data Mining. (ICDM'08), Pisa, Italy,2008.688-697
    [37]Wan S.P., Xu J.H. A multi-label classification algorithm based on triple class support vector machine. In:Proceedings of 2007 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR'07), Beijing, China, November 2-4,2007
    [38]李宏,谢政,向遥,吴敏,一种采用LLE降维和贝叶斯分类的多类标学习算法.系统工程与电子技术.2009.31(6):1467-1472
    [39]葛雷;李国正;尤鸣宇.多标记学习的嵌入式特征选择.南京大学学报(自然科学版),2009,05:671-676
    [40]姜远,佘俏俏,黎铭,等.一种直推式多标记文档分类方法.计算机研究与发展.2008,45(11):1817-1823
    [41]Liu Y, Jin R, and Yang L. Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization. In:Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference, AAAI-06/IAAI-06.2006.421-426
    [42]陈晓峰,王士同,曹苏群.半监督多标记学习的基因功能分析.智能系统学报.2008,3(1):83-90
    [43]Zhang S.L., Li B., Xue X.Y. Semi-automatic dynamic auxiliary-tag-aided image annotation. Pattern Recognition,2010,2(43):470-477
    [44]Chen G, Song Y. Q., Wang F., et al. Semi-supervised Multi-label Learning by Solving a Sylvester Equation. In:Proceedings of the 2008 SIAM International Conference on Data Mining.2008.410-419
    [45]Qi G.J., Hua X. SH., Rui Y., et al. Two-dimensional active learning for image classification. In:Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2008.1-8
    [46]Brinker K. On active learning in multi-label classification. From Data and Information Analysis to Knowledge Engineering of Book Series "Studies in Classification, Data Analysis, and Knowledge Organization", Springer,2006.1,2.
    [47]Li X., Wang L., Sung E. Multi-label SVM active learning for image classification. Image Processing,2004. ICIP'04.2004,4:2207-2210
    [48]Letouzey F., Denis F., Gilleron R. Learning from positive and unlabeled examples. In 11th Intl. Conf.on Algorithmic Learning Theory(ALT),Sydney, Australia, December 2000.71-85
    [49]d'Alche Buc F., Grandvalet Y, Ambroise C. Semi-supervised marginboost. In Dietterich T.G., Beck-er S., Ghahramani Z., Advances in Neural Information Processing Systems 14. MIT Press,2002
    [50]Li M., Zhou Z.-H. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Transactions on Systems, Man and Cybernetics-Part A:Systems and Hu-mans,2007,37(6):1088-1098
    [51]Zhou Z.H., Zhan D.C., Yang Q. Semi-supervised learning with very few labeled training examples. In:Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI'07), Vancouver, Canada,2007.675-680
    [52]Zhu X.J, Lafferty J., Ghahramani Z. Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. In ICML 2003 workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining,2003.
    [53]A.Blum and T.Mitchell. Combining labeled and unlabeled data with co-training. In:Proceedings of the 11th Annual Conference on Computational Learning Theory,1998
    [54]Lee W.S., Liu B. Learning with Positive and Unlabeled Examples Using Weighted Logistic Re-gression, In:Proceedings of ICML2003.448-455.
    [55]Nigam K., Mccallum A., Thrun S., et al. Text Classification from Labeled and Unlabeled Documents using EM, Machine Learning,2000,39:103-134
    [56]Zhu X.J, Lafferty J. Harmonic mixtures:combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML).
    ACM Press,2005
    [57]易星.半监督学习若干问题的研究:[工学硕士学位论文].北京:清华大学,2004.5
    [58]Hoi S. C. H., Lyu M. R. A semi-supervised active learning framework for image retrieval. In:Proceedings of IEEE CVPR,2005.1
    [59]Li X.CH., Wang L. Multi-label svm active learning for image classification. In: Proceedings of International Conference on Image Processing, ICIP, v4,2004. 2207-2210
    [60]Singh, M., Cunningham, P. and Curran, E. Active Learning for Multi-label Image Annotation, In:Proceedings of 19th Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2008).2008
    [61]Wang W, Zhou Z.H. On multi-view active learning and the combination with semi-supervised learning. In:Proceedings of the 25th International Conference on Machine Learning (ICML'08), Helsinki, Finland,2008.1152-1159
    [62]Zha ZH.J., Tao M., Wang J.D. Graph-based semi-supervised learning with multi-label. IEEE International Conference on Multimedia and Expo, Hannover, Germany, June 23-26,2008.1321-1324
    [63]Platt J. Fast training of support vector machines using sequential minimal optimization. In:Scholkopf B., Burges C., Smola A., eds. Advances in Kernel Methods:Support Vector Machine, MIT Press, Cambridge MA.1998.185-208
    [64]Witten I.H., Frank E. Data Mining:Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, second edition.2005
    [65]Friedman N., Geiger D., Goldszmidt M. Bayesian Network Classifiers. Machine Learning.1997,29(2-3):131-161
    [66]Cheng J, Greiner R. Learning Bayesian belief network classifiers:Algorithms and system. In:Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, Ottawa, ON.2001.
    [67]Cheng J, Greiner R. Comparing Bayesian networks classifiers. In:Proceedings of UAI.1999.101-107
    [68]Cheng J., Bell D.A., Liu W. Learning belief networks from data:An information theory based approach. In:Proceedings of ACM CIKM'97.1997 b.
    [69]Kohavi R., John G., et al. MLC++:A machine learning library in C++. In: Proceedings of the 6th International on Tools with Artificial Intelligence journal. IEEE Society.1994
    [70]Cheng J. Power Predictor System. http://www.cs.ualberta.ca/-jcheng/bnpp.htm. 2000.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700