随机森林的特征选择和模型优化算法研究

英文题名：Research on Feature Selection and Model Optinization of Random Forest
作者：雍凯
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：集成学习 ; 随机森林 ; 特征选择 ; 模型聚类
英文关键词：ensemble learning ; random forest ; feature selection ; model clustering
学位年度：2008
导师：叶允明
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2008-12-01

摘要

集成学习的兴起,为分类方法的设计提供了一个新的研究方向。随机森林是在众多集成方法中逐渐发展起来的一种分类器集成学习的方法,在实际中得到广泛应用,成为数据挖掘、人工智能、机器学习、模式识别等领域的研究人员以及工程应用领域中的技术人员共同关心的一个研究热点。
     随机森林在降低分类系统泛化误差、简化分类器设计等方面表现优良,但是随机森林方法并不完美,从实际应用中看,还有着大量进一步提升精度,降低泛化误差的需求。
     本文在介绍集成学习和随机森林的研究现状、算法思想的基础上,重点分析了随机森林的优缺点,并提出了一些改进的方案,进行了大量的实验分析,完成了以下研究工作:
     (1)在分析了随机森林集成的强度和相关度之间的关系的基础上,提出了一种新的特征选择算法。为了降低随机森林的泛化误差上界,提高森林整体性能,在综合考虑强度和相关度之间相互影响的关系后,利用卡方检验进行特征的相关性评估,依据评估的结果在特征空间进行有区分的随机选择特征。经实验验证,这种方法在保留原始算法所有的优点的基础上,可以进一步的降低随机森林的误差上界,提高泛化精度;
     (2)在理解单个分类树与集成的整体效果之间的关系后,进一步对分类树之间关系进行了分析,设计了一种基于层次聚类的模型选择算法。通过将符合度量标准的分类树不断凝聚在一起,再从中寻求代表树进行参与森林的集成。提出了树与树之间的相似性度量,并在实验中使用多种度量比较分析,该模型选择算法可以提高树与树之间的差异度,利用较少的树就可以提高森林的分类精度;
     (3)在对随机森林的特征选择和模型选择进行一定的研究后,对进一步研究提出了一些需要进一步研究的方向,对今后随机森林的研究具有一定的指导意义。
Ensemble learning has been developed into a popular method, and it provides a new research orientation for the design of classification method. Random forest is a classifier ensemble learning method which gradually developed from many ensemble methods, and it has been widely used in practice. It has become a common concern of both researchers and technicians in the fields of Data Mining, Artificial Intelligence, Machine Learning, and Pattern Recognition and so on.
     Random forest has excellent performance in reducing generalization error, and simplifying classifiers design. But it is not a perfect method. From the application, there are a lot of requirement in further improving accuracy and reducing generalization error.
     This paper focused on analysis the advantages and disadvantages of random forest based on introducing the related work and the algorithm’s main idea of the ensemble learning and random forest. Then, we propose some improved methods, and give amount of experiments and analysis. We have completed these research works as follows:
     (1) Based on the analysis of the relationship between strength and correlation of random forest, we propose a new feature selection algorithm. In order to reduce the upper bound of the generalization error and improve the whole performance of the forest, after considering the relationship of the interaction between strength and correlation, we use the chi-square test to evaluate the correlation of the features, then distinguished select features randomly in the feature space on the basis of the result of the evaluation. According to experimental validation, this method can not only reserve all the advantage of original algorithm, but also reduce the upper bound of the error and improve generalization accuracy.
     (2) After understanding the relationship of the performance between the single classifier and ensemble forest, we make further analysis the relationship of the classified trees, and design a model selection algorithm based on hierarchical clustering. Through the constant agglomeration of the classified trees which meet the measurement criterion, we seek the represent tree to participate the forest ensemble. Also, we propose the similarity measurement between different trees, and compare all the measurement methods. This model selection algorithm can improve the diversity between trees and the classified accuracy using a few trees.
     (3) Having researched the feature selection and model selection methods, we proposed some research direction can be done in the future, which has certain significance for further works.

引文

1.周峰.集成分类器模型的研究.上海交通大学硕士论文. 2007, 12:1-3
    2. T. G. Dienerich. Machine Learning Research for Current Directions Al Magazine, 1997, 18(4):97-136
    3. D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Science Education. 2006, 34(5):336-337
    4. M. Kearns, L. G. Valiant. Learning Boolean Formulate or Factorin. Technical Report TR-1488, Cambridge, MA: Havard University Aiken Computation Laboratory, 1988
    5. M. Kearns, L. G. Valiant. Crytographic Limitation on Learning Boolean Formulae and Finite Antomata. In: Proceedings of the 21st Annual ACM Sysposium on Theory of Computing, NewYork, NY: ACM press, 1989:443-444
    6. R. E. Sehapire. The Strength of Weak Learnability. Machine Learning, 1990, 5(2): 197-227
    7.王丽丽.集成学习算法研究.广西大学硕士论文. 2006, 6:1-21
    8. L. Breiman. Bagging Predicators. Machine Learning, 1996, 24(2):123-140
    9. T. K. Ho. Random Decision Forest. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, 1995, 8:278-282
    10. T. K. Ho, The Random Subspace Method for Constructing Decision Forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(8):832-844.
    11. L. Breiman, Random Forests, Machine Learning, 2001, 45(1):5-32
    12. Y. Freund. Boosting a Weak Algorithm by Majority. Information and Cmputation, 1995, 121(2):256-285
    13. Y. Freund, R. E. Schapire. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, 55(1):119-139
    14. Y. Freund, R. E. Schapire. A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5):771-780
    15. Y. Freund, R. E. Schapire. Game Theory, On-line Prediction and Boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory, Desenzano sul Garda, Italy, ACM Press, 1996: 325-332
    16. P. Bartlett, Y. Freund, W. S. Lee, R. E. Schapire. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. Ann. Statist. 1998, 26(5):1651-1686
    17. Z. H. Zhou, J. X. Wu, W. Tang. Ensembling Neural Networks: Many Could be Better Than All. Artificial Intelligence, 2002, 137(1-2):239-263
    18.沈学华,周志华,吴建鑫等. Boosting和Bagging综述.计算机工程与应用. 2002, 12:31-32
    19. Z. H. Zhou, Ensemble Learning: http://cs.nju.edu.cn/zhouzh/zhouzh.files /publication /springerEBR09.pdf
    20. L. L. Diao, K. Y. Hu, Y. C. Lu, C. Y. Shi. Improved Stumps Combined by Boosting for Text Categorization. Journal of Software. 2002, 13(8): 1361-1367
    21.俞扬,周志华.集成学习中完全随机学习策略研究.计算机工程. 2006, 32(17):100-102
    22.袁时金,李荣陆,周水庚,胡运发.层次化中文文档分类.通信学报. 2004, 25(11):55-63
    23. R. E. Banfield, L. O. Hall, K. W. Bowyer, W. P. Kegelmeyer. A Comparison of Decision Tree Ensemble Creation Techniques, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(5):223-232
    24. A. Prinzie, D. V. D. Poel. Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB. Database and Expert Systems Applications, 18th International Conference, DEXA 2007, Regensburg, Germany, 2007, 9:349-358
    25. A. Prinzie, D. V. D. Poel. Random Forests for Multiclass Classification: Random MultiNomial Logit. Expert Systems with Applications. 2008, 34(3): 1721-1732
    26. S. Q. Guo, C. Gao, J. Yao, L. Xie. An Intrusion Detection Model Based on Improved Random Forests Algorithm. Journal of Software. 2005, 16(8):1490-1498
    27.张华伟.基于层次分类和集成学习的文本分类技术研究.江西师范大学硕士论文. 2007, 5:38-53
    28.吕飒丽,汪强虎,李霞,郭政.基于决策森林特征基因的两种识别方法.生物信息学. 2004, 2(3):20-23
    29. W. Fan, H. X. Wang, P. S. Yu, S. Ma. Is Random Model Better? On Its Accuracy and Efficiency. In Proceedings of the Third IEEE International Conference on Data Mining. IBM Thomas J. Watson Res. Center, Hawthorne, NY, USA , 2003, 12:51-58
    30.李霞,张田文,饶绍奇等.特征基因挖掘的决策森林方法.哈尔滨工业大学学报. 2004, 36(4):480-483
    31. P. Xu, F. Jelinek. Random Forests and the Data Sparseness Problem in Language Modeling. Computer Speech Language, 2007,21(1): 105-152
    32. P. Tirilly, V. Claveau, P. Gros. Language Modeling for Bag-of-Visual Words Image Categorization. Conference on Image and Video Retrieval, Niagara Falls, Canada , 2008, 7:249-258
    33. Liaw A, Wiener M. Classification and Regression by Random Forest. R News. 2002.9, 2(3):18-22
    34. Chi-Square Test: http://en.wikipedia.org/wiki/Chi-square_test
    35. Y. S. Kim, W. N. Street, F. Menczer. Feature Selection in Data Mining. Data mining: opportunities and challenges, 2003:80-105
    36. R. S. Marko. Improving Random Forests. Machine Learning: ECML 2004 Proceedings, Springer, Berlin, 2004, 11:359-370
    37. D. Yook. Decision Tree Based Clustering. Lecture Notes in Computer Science. Springer Berlin, 2002, 1:487-492
    38. UCI Machine Learning Repository: http://archive.ics.uci.edu/ml /datasets.html
    39. Weka: Collections of datasets: http://www.cs.waikato.ac.nz/ml/weka/
    40.陈海霞.面向数据挖掘的分类器集成研究.吉林大学博士论文. 2006, 12:109-133
    41.罗会兰,孔繁胜,李一啸.聚类集成中的差异性度量研究.计算机学报. 2007, 30(8):1315-1324
    42. K. Tumer, J. Ghosh. Error Correlation and Error Reduction in Ensemble Classifiers. Connection Science, Special Issue on Combining Artificial Neural Networks: Ensemble Approaches. 1996, 8(3,4):385-404
    43. G. U. Yule. On the Association of Attributes in Statistics. Philosophical Transactions of the Royal Society of London, 1900, 194:257-319
    44. D. D. Margineantu, T. G. Dietterich. Pruning Adaptive Boosting. In Processing 14th International Conference on Machine Learning. Morgan Kaufman, 1997:211-218
    45. D. B. Skalak. The Sources of Increased Accuracy for Two Proposed Boosting Algorithms. In Processing American Association for Artificial Intelligence, AAAI-96, Integrating Multiple Learned Models Workshop, 1996:120-125
    46. G. Giacinto, F. Roli. Design of Effective Neural Network Ensembles for Image Classification Purposes. Image Vision and Computing Journal, 19(9-10): 699-707
    47. B. Philip. A Survey on Tree Edit Distance and Related Problems. Theoretical Computer Science. 2005, 337(13):217-239
    48. A. Torsello, A. R. Kelly, E. R. Hancock. Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering. International Journal of Computer Vision. 2007, 72(3):259-285
    49. V. Segura, A. Ouangraoua, P. Ferraro, E. Costes. Comparison of Tree Architecture using Tree Edit Distances: Application to 2-year-old Apple Hybrids. Euphytica. 2008.5, 161(1, 2):155-164
    50.乔少杰,唐常杰,陈瑜等.基于树编辑距离的层次聚类算法.计算机科学与探索. 2007, 1(3):282-292
    51.龚安,刘华山.基于编辑距离的XML文档结构聚类的改进算法.微计算机应用. 2008, 29(2):88-91

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700