在划分数据空间的视角下基于决策边界的分类器研究

英文题名：Studies on Classifiers Based on Decision Boundaries from the Perspective of Dividing Data Space
作者：严志永
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：机器学习 ; 分类器 ; 决策边界 ; 可视化 ; 分类器要素 ; 组合分类器 ; 局部分类器 ; Vapnik-Chervonenkis维 ; C4.5算法 ; Naive ; Bayes分类器 ; 支持向量机 ; K近邻算法
英文关键词：Machine learning ; classifier ; decision boundary ; visualization ; elements of classifier ; combining classifier ; local classifier ; Vapnik-Chervonenkis dimension ; C4.5 algorithm ; Naive Bayes classifier ; support vector machine ; k nearest neighbors algorithm
学位年度：2011
导师：潘云鹤
学科代码：081203
学位授予单位：浙江大学
论文提交日期：2011-09-29

摘要

分类器是机器学习的一项重要技术。分类器研究中存在映射和划分两种视角。在映射视角下分类模型可被看作从数据空间到标签集的映射,分类器的训练过程可被看作在假设空间中搜索最优假设的过程。在划分视角下分类模型可被看作将数据空间划分成若干决策区域的一组决策边界,分类器的训练过程可被看作划分数据空间获得决策边界的过程。映射视角是主流,在映射视角下研究分类器的工作很多。目前还没有在划分视角下对分类器进行系统研究的工作。本文在划分视角下以决策边界为工具对分类器进行研究,进行构建在划分数据空间视角下以决策边界研究分类器的理论框架和基于此理论框架改进分类器两方面的研究。
     本文的研究工作主要有：
     1)提出了决策边界、决策区域和概率梯度区域的定义。提出了获取决策边界的形式化方法和采样法。提出了决策边界点集(Decision Boundary Point Set,简称DBPS)算法、决策边界2D网格点集(Decision Boundary Point Set using Grid for 2-D data,简称DBPSG-2D)算法和决策边界神经元集(Decision Boundary Neuron Set,简称DBNS)算法来获取决策边界附近的采样点。提出了基于自组织映射的决策边界可视化(Self-Organizing Mapping based Decision Boundary Visualization,简称SOMDBV)算法和基于自组织映射的概率梯度区域可视化(Self-Organizing Mapping based Probability Gradient Regions Visualization,简称SOMPGRV)算法来分别对决策边界和概率梯度区域进行可视化。
     2)提出了在划分数据空间视角下基于决策边界的分类器三要素九因素理论框架。在此理论框架下,划分目标、决策边界形式和划分方法是分类器的三要素。划分目标需要考虑训练准确率、错分样本特征和决策边界的微位置三个因素；决策边界形式需要考虑划分能力、提供的领域知识和可理解性三个因素；划分方法需要考虑利用的信息、划分方式和分类模型复杂度三个因素。
     3)提出了基于K近邻(K nearest Neighbors,简称KN)类型的错分样本特征。KN类型根据样本与其K近邻之间的类别关系,将样本分为S类、DS类和D类三类。C4.5算法、Naive Bayes分类器和支持向量机(Support Vector Machine,简称SVM)三个分类器与K近邻(K Nearest Neighbors,简称KNN)算法在KN类型上的错分样本特征有着显著不同。提出了组合KNN算法和C4.5算法/Naive Bayes分类器/SVM的K近邻组合(Knearest Neighbors Combining,简称KNC)算法。KNC算法使用KNN算法来对S类和DS类样本进行预测,使用其他三个分类器对D类样本进行预测。
     4)研究了离散化算法对分类器决策边界的影响。提出了离散化算法能够提高Naive Bayes分类器泛化能力的原因在于离散化算法能够提高Naive Bayes分类器的Vapnik-Chervonenkis (VC)维。将离散化算法应用于SVM和KNN算法,并研究了离散化算法对SVM和KNN算法的VC维的影响。
     5)提出了在Naive Bayes分类器的决策区域内训练分类器的二次划分(Second Division,简称SD)算法,并对现有的局部分类器训练算法进行研究。SD算法是一种组合全局学习和局部学习的算法,因此能够提高Naive Bayes分类器的泛化能力。将现有的局部分类器训练算法分为测试选择、划分覆盖和训练选择三类。并提出了训练局部分类器能够提高分类器泛化能力的原因在于其能够提高分类器的VC维和能够利用训练数据集中更多信息。
Classifier is an important technique of machine learning. In classifier studies, there are two perspectives, the mapping perspective and the dividing perspective respectively. From the mapping perspective, a classifier model can be regarded as a mapping from the data space to the label set, and the training process of a classifier can be regarded as searching the hypotheses space for the appropriate one. From the dividing respective, a classifier model can be regarded as a group of decision boundaries dividing the data space into several decision regions, and the training process of a classifier can be regarded as dividing the data space to obtain decision boundaries. The mapping respective is the mainstream, and there are many studies from this perspective. There are no systematic studies on classifiers from the dividing perspective. This dissertation adopts decision boundaries to study classifiers from the dividing perspective. This dissertation will construct the theoretical framework based on decision boundaries from the dividing perspective and improve classifiers based on the new theoretical framework.
     Studies of this dissertation are as follows.
     1) This dissertation presents formal definitions of decision boundary, decision region and probability gradient region. This dissertation proposes two methods for obtaining decision boundaries, the formal method and the sampling method. This dissertation proposes Decision Boundary Point Set (DBPS) algorithm, Decision Boundary Point Set using Grid for 2-D data (DBPSG-2D) algorithm and Decision Boundary Neuron Set (DBNS) algorithm to obtain sampling points near decision boundaries. This dissertation proposes Self-Organizing Mapping based Decision Boundary Visualization (SOMDBV) algorithm and Self-Organizing Mapping based Probability Gradient Regions Visualization (SOMPGRV) algorithm to visualize decision boundaries and probability gradient regions.
     2) This dissertation proposes a theoretical framework based on decision boundaries from the perspective of dividing the data space. In this new theoretical framework, the dividing objective, the decision boundary form and the dividing methode are three elements of a classifier. The dividing objective needs to consider three factors, the training accuracy, the characteristic of misclassified instances and the micro-location of decision boundaries. The decision boundary form needs to consider three factors, the dividing capability, the domain knowledge provided and the comprehensibility. The dividing method needs to consider three factors, the information utilized, the dividing pattern and the complexity of the classification model.
     3) This dissertation proposes a new characteristic of misclassified instances based on the K nearest Neighbors (KN) type. According to the label relationship between an instance and its K nearest neighbors, instances can be divided into three KN types, S-type, DS-type and D-type. The characteristic of misclassified instances of K Nearest Neighbors (KNN) algorithm on the KN type is different from those of other three classifiers, C4.5 algorithm, Naive Bayes Classifier and support vector machine (SVM). This dissertation proposes K nearest Neighbors Combining (KNC) algorithm to combine KNN algorithm and C4.5 algorithm/Naive Bayes Classifier/ SVM. KNC algorithm uses KNN algorithm to make predictions for instances belonging to S-and DS-type, and uses other three classifiers to make predictions for D-type instances.
     4) This dissertation studies impact of discretization algorithms on classifiers' decision boundaries. This dissertation proposes that the reason why discretization algorithms can improve the generalization ability of Naive Bayes classifier is that discretization algorithms can improve the Vapnik-Chervonenkis (VC) dimension of Naive Bayes classifier. This dissertation applies discretization algorithms to SVM and KNN algorithm, and discusses impact on VC dimensions of above two classifiers.
     5) This dissertation proposes the second division (SD) algorithm to train classifiers in decision regions of Naive Bayes classifier and studies existing algorithms of training local classifiers. The SD algorithm is a new hybrid of global learning and local learning. Thus it can improve the generalization ability of Naive Bayes classifier. Existing algorithms of training local classifiers are divided into three types, test slection, divide-and-conquer and training selection. This dissertation proposes that the reason why local classifier training algorithms can improve generalization ability of classifiers is that they can improve VC dimensions of classifiers and they can utilize more information of training data set.

引文

[1]MITCHELL T M. Machine learning [M]. New York, NY:McGraw-Hill,1997.
    [2]BLUM A, MITCHELL T. Combining labeled and unlabeled data with co-training [C]//Proceedings of the 11th Annual Conference on Computational Learning Theory. New York, NY, USA:ACM,1998: 92-100.
    [3]ZHOU Z H. Ensemble learning. LI S Z. ed. Encyclopedia of Biometrics, Berlin:Springer,2009: 270-273.
    [4]HULTEN G, GOODMAN J. Tutorial on junk e-mail filtering [C]//Proceedings of the 21st International Conference on Machine Learning. Banff, Alberta, Canada:ACM,2004:45-57.
    [5]LIU C L, NAKASHIMA K, SAKO H, et al. Handwritten digit recognition:benchmarking of state-of-the-art techniques [J]. Pattern Recognition,2003,36(10):2271-2285.
    [6]BRUCE V, YOUNG A. Understanding face recognition [J]. British Journal of Psychology,1986,77: 305-327.
    [7]KOTSIANTIS S B, ZAHARAKIS I D, PINTELAS P E. Machine learning:A review of classification and combining techniques [J]. Artificial Intelligence Review,2006,26(3):159-190.
    [8]HAN J, KAMBER M. Data mining:Concepts and techniques [M].2nd ed. San Francisco, USA: Morgan Kaufmann,2006.
    [9]TAN P N, STEINBACH M, KUMAR V. Introduction to data mining [M]. Boston, USA:Pearson Addison Wesley,2006.
    [10]ALPAYDIN E. Introduction to machine learning [M]. Cambridge, MA, USA:The MIT Press,2004.
    [11]DUDA R, HART P, STORK D. Pattern Classification [M].2nd ed. USA:John Wiley & Sons, Inc, 2001.
    [12]WITTEN I H, FRANK E, HALL M A. Data mining:Practical machine learning tools and techniques [M].3rd ed. San Francisco:Morgan Kaufmann,2011.
    [13]边肇祺,张学工.模式识别[M].2版.北京：清华大学出版社,1999.
    [14]史忠植.知识发现[M].北京：清华大学出版社,2002.
    [15]VAPNIK V N. The nature of statistical learning theory [M]. Berlin Heidelberg:Springer,1995.
    [16]CRISTIANINI N, SHAWE-TAYLOR J. An introduction to support vector machines and other kernel-based learning methods [M]. England:Cambridge University Press,2000.
    [17]BURGES C J C. A tutorial on support vector machines for pattern recognition [J]. Data Mining and Knowledge Discovery,1998,2(2):121-167.
    [18]邓乃扬,田英杰.数据挖掘中的新方法——支持向量机[M].北京：科学出版社,2004.
    [19]邓乃扬,田英杰.支持向量机：理论、算法与拓展[M].北京：科学出版社,2009.
    [20]QUINLAN J R. C4.5:Programs for machine learning [M]. San Mateo:Morgan Kaufmann,1993.
    [21]COHEN W W. Fast effective rule induction [C]// Proceedings of the 12fth International Conference on Machine Learning. Tahoe City, CA, USA:Morgan Kaufmann,1995:115-123.
    [22]ZHANG G. Neural networks for classification:a survey [J]. IEEE Transaction on Systems, Man, and Cybernetics, Part C,2000,30(4):451-462.
    [23]BLUM A. Empirical support for winnow and weighted-majority algorithms:results on a calendar scheduling domain [J]. Machine Learning,1997,26(1):5-23.
    [24]FREUND Y, SCHAPIRE R. Large margin classification using the perceptron algorithm [J]. Machine Learning,1999,37:277-296.
    [25]NEOCLEOUS C, SCHIZAS C. Artificial neural network learning:a comparative review [C]// Proceedings of the 2nd Hellenic Conference on AI:Methods and Applications of Artificial Intelligence. Berlin, Heidelberg:Springer-Verlag,2002:300-313.
    [26]HOWLETT R J, JAIN L C. Radial basis function networks 2:new advances in design. Heidelberg: Physica-Verlag,2001.
    [27]RUMELHART D E, HINTON G E, WILLIAMS R J. Learning internal representations by error propagation. In RUMELHART D E, MCCLELLAND J L. ed. Parallel Distributed Processing. Cambridge, MA, USA:MIT Press,1986.
    [28]DOMINGOS P, PAZZANI M. On the optimality of the simple Bayesian classifier under zero-one loss [J]. Machine Learning,1997.29:103-130.
    [29]JENSEN F. An introduction to Bayesian networks [M]. Berlin:Springer,1996.
    [30]SCHOLKOPF B, SMOLA A, MULLER K R. Kernel principal component analysis. In SCHOLKOPF B, BURGES C J C, SMOLA A J, ed. Advances in Kernel Methods-SV Learning. Cambridge. MA, USA:MIT Press,1999,327-352.
    [31]AHA D W. Lazy Learning [M]. Norwell, MA, USA:Kluwer Academic Publishers,1997.
    [32]FIX E, HODGES JR J L. Discriminatory analysis, non-parametric discrimination:consistency properties [R]. Randolph Field, Texas:United States Air Force, School of Aviation Medicine, Technical Report 21-49-004(4),1951.
    [33]WU Xin-dong, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining [J]. Knowledge and Information Systems,2008,14(1):1-37.
    [34]MACQUEEN J. Some methods for classification and analysis of multivariate observations [C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. California, CA, USA: University of California Press,1967:281-297.
    [35]AGRAWAL R, SRIKANT R. Fast algorithm for mining association rules [C]//Proceedings of the 20th International Conference on Very Large Data Bases. Santiago, Chile:Morgan Kaufmann,1994:487-499.
    [36]MCLACHLAN G J, PEEL D. Finite mixture models [M]. New York, NY, USA:Wiley,2000.
    [37]BRIN S, PAGE L. The anatomy of a large-scale hypertextual web search engine [J]. Computer Networks,1998,30(1-7):107-117.
    [38]FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting [J]. Journal of Computer System Science,1997,55(1):119-139.
    [39]BREIMAN L, FRIEDMAN J H, OLSHEN, R A, et al. Classification and regression trees [M]. Belmont, CA, USA:Wadsworth,1984.
    [40]QUINLAN J R. Discovering rules by induction from large collections of examples. MICHIE D. ed. Expert Systems in the Micro-Electronic Age. Edinburgh:Edinburgh University Press,1979.
    [41]MAGIDSON J. The CHAID approach to segmentation modeling:Chi-squared automatic interaction detection. BAGOZZI R P, ed. Advanced Methods of Marketing Research. Cambridge, MA, USA:Basil Blackwell,1994:118-159.
    [42]LOH W Y, VANICHSETAKUL N. Tree-structured classification via generalized discriminant analysis (with discussion) [J]. Journal of the American Statistical Association,1988,83:715-728.
    [43]MEHTA M, AGRAWAL R, RISSANEN J. SLIQ:a fast scalable classifier for data mining [C]// Proceedings of the 5th International Conference on Extending Database Technology. Avignon, France: Springer,1996:18-32.
    [44]SHAFER J, AGRAWAL R, MEHTA M. SPRINT:a scalable parallel classifier for data mining [C]// Proceedings of the 22nd International Conference of Very Large Data Bases. Bombay, India:Morgan Kaufmann,1996:544-555.
    [45]LOH W Y, SHIH Y S. Split selection methods for classification trees [J]. Statistica Sinica,1997, 7(4):815-840.
    [46]RASTOGI R, SHIM K, Public:a decision tree classifier that integrates building and pruning [C]// Proceedings of 24th International Conference of Very Large Data Bases. New York City, New York, USA: Morgan Kaufmann,1998:404-415.
    [47]GEHRKE J, PAMAKRISHNAN R. GANTI V. RainForest:A Framework for fast decision tree construction of large datasets [J]. Data Mining and Knowledge Discovery,2000,4(2-3):127-162.
    [48]MINGER J. An empirical comparison of selection measures for decision-tree induction [J]. Machine Learning,1989,3:319-342.
    [49]WARY B, TIM N. A further comparison of splitting rules for decision-tree induction [J]. Machine Learning,1992,8:75-85.
    [50]HART A. Experience in the use of an inductive system in knowledge engineering. BRAMER M, ed. Research and developments in expert systems. Cambridge:Cambridge University Press,1985,117-126.
    [51]MINGERS J. Expert systems:rule induction with statistical data [J]. Journal of the Operational Research Society,1987,38:39-47.
    [52]QUINLAN J R. Induction of decision trees [J]. Machine Learning,1986,1:81-106.
    [53]MARSHALL R. Partitioning methods for classification and decision making in medicine [J]. Statistics in Medicine,1986,5:517-526.
    [54]MINGERS J. An empirical comparison of pruning methods for decision tree induction [J]. Machine Learning,1989,4:227-243.
    [55]LIM T S, LOH W Y, SHIH Y S. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms [J]. Machine Learning,2000,40:203-228.
    [56]QUINLAN J. Generating production rules from decision trees [C]// Proceedings of the 10th International Joint Conference on Artificial Intelligence. Milan, Italy:Morgan Kaufmann,1987:304-307.
    [57]ROKACH L. Ensemble-based classifiers [J]. Artificial Intelligence Review,2010,33:1-39.
    [58]BREIMAN L. Bagging predictors [J]. Machine Learning,1996,24(2):123-140.
    [59]FREUND Y, SCHAPIRE R E. Experiments with a new boosting algorithm [C]//Proceedings of the 13th International Conference on Machine Learning. San Francisco:Morgan Kaufmann,1996:148-156.
    [60]WOLPERT D H. Stacked generalization [J]. Neural Network,1992,5(2):241-260.
    [61]BREIMAN L. Random forests [J]. Machine Learning,2001,45(1):5-32.
    [62]DIETTERICH T G An experimental comparison of three methods for constructing ensembles of decision trees:bagging, boosting, and randomization [J]. Machine Learning,2000,40:139-157.
    [63]DIETTERICH T G, KONG E B. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms [R]. Technical Report, Department of Computer Science, Corvallis, Oregon:Oregon State University,1995.
    [64]LEWIS D D. Naive Bayes at forty:the independence assumption in information retrieval[C]// Proceedings of the 10th European Conference on Machine Learning. Chemnitz, Germany:Springer,1998: 4-15.
    [65]DOUGHERTY J, KOHAVI R, SAHAMI M. Supervised and unsupervised discretization of continuous features [C]//Proceedings of 12th International Conference on Machine Learning. Tahoe City, California:Morgan Kaufmann,1995:194-202.
    [66]JOHN G H, LANGLEY P. Estimating continuous distributions in Bayesian classifiers [C]// Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Montreal, Canada:Morgan Kaufmann,1995:338-345.
    [67]LIU Huan, HUSSAIN F, Tan C L, et al. Discretization:An enabling technique [J]. Data Mining and Knowledge Discovery,2002,4(6):393-423.
    [68]YANG Ying, WEBB G I. On why discretization works for Naive-Bayes classifiers [C]// Proceedings of the 16th Australian Conference on Artificial Intelligence. Perth, Australia:Springer-Verlag,2003:440-452.
    [69]HSU C N, HUANG H J, Wong T T. Why discretization works for naive Bayesian classifiers [C]// Proceedings of the 17th International Conference on Machine Learning. Stanford University, Stanford, CA, USA:Morgan Kaufmann,2000:309-406.
    [70]KONONENKO I. Naive Bayesian classifier and continuous attributes [J]. Informatica 1992,16(1): 1-8.
    [71]KOHAVI R. Scaling up the accuracy of Naive-Bayes classifiers:A decision-tree hybrid [C]// Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park, California:AAAI Press,1996:202-207.
    [72]ZHENG Zi-jian, WEBB G I. Lazy learning of Bayesian rules [J]. Machine Learning,2000,41(1): 53-84.
    [73]WEBB G I, BOUGHTON J, Wang Z. Not so naive Bayes:Aggregating one-dependence estimators [J]. Machine Learning,2005,58:5-24.
    [74]FRANK E, HALL M, PFAHRINGER, B. Locally weighted naive Bayes [C]//Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence. Acapulco, Mexico:Morgan Kaufmann,2003: 249-256.
    [75]JIANG Liang-xiao, WANG Dian-hong, CAI Zhi-hua, et al. Survey of improving naive bayes for classification[C]//Proceedings of Advanced Data Mining and Applications. Berlin Heidelberg: Springer-Verlag,2007:134-145.
    [76]GENTON M. Classes of kernels for machine learning:a statistics perspective [J]. Journal of Machine Learning Research,2001,2:299-312.
    [77]TSANG I W, KWOK J T, CHEUNG P M. Core vector machines:fast SVM training on very large data sets [J]. Journal of Machine Learning Research,2005,6:363-392.
    [78]PLATT J C. Fast training of support vector machines using sequential minimal optimization. In: SCHOLKOPF B, BURGES C J C, SMOLA A J, ed. Advances in kernel methods:Support vector machines. Cambridge, MA:MIT Press,1999:185-208.
    [79]王强,沈永平,陈英武.支持向量机规则提取[J].国防科技大学学报,2006,28(2)：106-110.
    [80]MARTENS D, HUYSMANS J, SETIONO R, et al. Rule Extraction from support vector machines: an overview of issues and application in credit scoring [J]. Studies in Computational Intelligence,2008,80: 33-63.
    [81]BARAKAT N, BRADLEYB A P. Rule extraction from support vector machines:A review [J]. Neurocomputing,2010,74(1-3):178-190.
    [82]LAUER F, BLOCH G. Incorporating prior knowledge in support vector machines for classification: A review [J]. Neurocomputing,2008,71:1578-1594.
    [83]FRIEDMAN J H, BENTLEY J L, FINKEL R A. An algorithm for finding best matches in logarithmic expected time [J]. ACM Transaction on Math Software,1977,3:209-226.
    [84]DEVIJVER P A, KITTLER J. Pattern recognition:a statistical approach [M]. Prentice Hall, Englewood Cliffs, NJ,1982.
    [85]李蓉,叶世伟.史忠植SVM-KNN分类器—一种提高SVM分类精度的新方法[J].电子学报,2002,30(5)：745-748.
    [86]COST S, SALZBERG S. A weighted nearest neighbor algorithm for learning with symbolic features [J]. Machine Learning,1993,10:57-78.
    [87]WEINBERGER K Q, SAUL L K. Distance metric learning for large margin nearest neighbor classification [J]. Journal of Machine Learning Research,2009,10:207-244.
    [88]HAN E H, KARYPIS G, KUMAR V. Text categorization using weight adjusted k -nearest neighbor classification [C]//Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Hong Kong, China:Springer,2001:53-65.
    [89]OLVERA-LOPEZ J A, CARRASCO-OCHOA J A, MARTINEZ-TRINIDAD J F, et al. A review of instance selection methods [J]. Artificial Intelligence Review,2010,34:133-143.
    [90]BRIGHTON H, MELLISH C. Advances in instance selection for instance-based learning algorithms [J]. Data Mining and Knowledge Discovery,2002,6(2):153-172.
    [91]WILSON D L. Asymptotic properties of nearest neighbor rules using edited data [J]. IEEE Transactions on Systems, Man, and Cybernetics,1972, SMC-2(3):408-421.
    [92]HART P E. The condensed nearest neighbor rule [J]. IEEE Transactions on Information Theory, 1968,14(3):515-516.
    [93]GATES G W. The reduced nearest neighbor rule [J]. IEEE Transactions on Information Theory, 1972,18(3):431-433.
    [94]RITTER G L, WOODRUFF H B, LOWRY S R, et al. An algorithm for the selective nearest neighbour decision rule [J]. IEEE Transactions on Information Theory,1975,21(6):665-669.
    [95]ZHANG Hao, BERG A C, MAIRE M, et al, SVM-KNN:discriminative nearest neighbor classification for visual category recognition [C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 2. Washington, DC, USA:IEEE Computer Society,2006: 2126-2136.
    [96]VAPNIK V, LEVIN E, LE CUN Y. Measuring the VC-dimension of a learning machine [J]. Neural Computation,1994,6(5):851-876.
    [97]SHAO Xu-hui, CHERKASSKY V, Li W. Measuring the VC-dimension using optimized experimental design [J]. Neural Computation,2000,12(8):1969-1986.
    [98]张学工.关于统计学习理论与支持向量机[J].自动化学报,2000,26(1)：32-42.
    [99]MAEMON O, ROKACH L. Improving supervised learning by feature decomposition [C]// Proceedings of the 2nd International Symposium on Foundations of Information and Knowledge Systems. Salzau Castle, Germany:Springer-Verlag,2002:91-128.
    [100]ASIAN O, YILDIZ O T, ALPAYDIN E. Calculating the VC-dimension of decision trees [C]// Proceedings of the 24th International Symposium on Computer and Information Sciences. Northern Cyprus: IEEE,2009:193-198.
    [101]CRAVEN M, SLATTERY S. Relational learning with statistical predicate invention:Better models for hypertext [J]. Machine Learning,2001,43(1/2):97-119.
    [102]BLUMER A, EHRENFEUCHT A, HAUSSLER D, et al. Learnability and the Vapnik-Chervonenkis dimension [J]. Journal of the ACM,1989,36(4):929-965.
    [103]KOHAVI R. A study of cross-validation and bootstrap for accuracy estimation and model selection[C]// Proceedings of the 14th International Joint Conference on Artificial Intelligence. Montreal, Quebec, Canada:Morgan Kaufmann,1995:1137-1143.
    [104]KOHAVI R, WOLPERT D. Bias plus variance decomposition for zero-one loss functions[C]// Proceedings the 13th International Conference on Machine Learning. Bari, Italy:Morgan Kaufmann:1996: 275-283.
    [105]DIETTERICH T G. Approximate statistical tests for comparing supervised classification learning algorithms [J]. Neural Computing,1998,10:1895-1923.
    [106]DEMSAR J. Statistical comparisons of classifiers over multiple data sets [J]. Journal of Machine Learning Research,2006,7:1-30.
    [107]NADEAU C., BENGIO Y. Inference for the generalization error [J]. Machine Learning,2003,52(3): 239-281.
    [108]Ling C X, Huang J, Zhang H. AUC:a statistically consistent and more discriminating measure than accuracy [C]//Proceedings of the 18th International Joint Conference on Artificial Intelligence. Acapulco, Mexico:Morgan Kaufmann,2003:519-524.
    [109]HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software:an update [J]. SIGKDD Explorations,2009,11(1):10-18.
    [110]MURPHY P M, AHA D W. UCI repository of machine learning databases [DB/OL]. Irvine, CA: University of California, Department of Information and Computer Science,1998. http://www.ics.uci.edu/～mlearn/MLRepository.html.
    [111]BROMLEY J, SACKINGER E. Neural-network and k-nearest-neighbor classifiers [R]. Technical Report,11359-910819-16TM,AT&T,1991.
    [112]LE CUN Y, CORTES C. The MNIST database of handwritten digits [DB/OL]. http://yann.lecun.com,/exdb/mnist/index.html,1998.
    [113]DIETTERICH T G. Machine learning research:four current directions [J]. AI Magazine,1997, 18(4):97-136.
    [114]孙正雅,陶卿.统计机器学习——损失函数与优化求解[J].中国计算机学会通讯,2009,5(8)：7-14.
    [115]ZHANG T. Statistical analysis of some multi-category large margin classification methods [J]. Journal of Machine Learning Research,2004,5:1225-1251.
    [116]BISHOP C M. Pattern recognition and machine learning (information science and statistics) [M]. Secaucus, NJ, USA:Springer-Verlag New York, Inc.,2006.
    [117]GIBSON G J, COWAN C F N. On the decision regions of multilayer perceptrons [C]//Proceedings of IEEE,1990,78:1590-1594.
    [118]MELNIK O. Decision region connectivity analysis:a method for analyzing high-dimensional classifiers [J]. Machine Learning,2002,48(1-3):321-351.
    [119]HASTIE T., ZHU J. Comment [J]. Statistical Science,2006,21(3):352-357.
    [120]SCHAPIRE R E, FREUND Y, BARTLETT P, et al. Boosting the margin:a new explanation for the effectiveness of voting methods [J]. The Annals of Statistics,1998,26(5):1651-1686.
    [121]HAND D, MANNILA H, SMYTH P. Principles of Data Mining [M]. Cambridge, MA, USA:the MIT Press,2001
    [122]KOHONEN T. Self-organizing maps [M]. Berlin Heidelberg:Springer,1997.
    [123]WANG X, WU S, LI Q. SVMV-a novel algorithm for the visualization of SVM classification results [C]//Proceedings of Advances in Neural Networks. Berlin Heidelberg:Springer-Verlag,2006: 968-973.
    [124]王晓红,王晓茹,李群湛.二分类数据的分类结果可视化算法[J].西南交通大学学报,2006,41(3)：329-334.
    [125]WU S, CHOW W S. Support vector visualization and clustering using self-organization map and support vector one-class classification [C]//Proceedings of IEEE International Joint Conference on Neural Networks. Portland, USA:[s. n.],2003:803-808.
    [126]JOHANSSON U, NIKLASSON L, KONING R. Accuracy vs. comprehensibility in data mining models [C]//Proceedings of the 7th International Conference on Information Fusion. Stockholm, Sweden: International Society of Information Fusion,2004:295-300.
    [127]SCHAFFER C. A conservation law for generalization performance [C]//Proceedings of the 11th International Conference on Machine Learning. New Brunswich, NJ:Morgan Kaufmann,1994:259-265.
    [128]WOLPERT D. The lack of a priori distinctions between learning algorithms [J]. Neural Computation,1996,8:1341-1390.
    [129]RAO RB, GORDON D, and SPEARS W. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance [C]//Proceedings of the 12th International Conference on Machine Learning. Tahoe, City:Morgan Kaufmann,1994:471-479.
    [130]HAWKINS D M. The problem of overfitting [J]. Journal of Chemical Information and Computer Science,2004,44:1-12.
    [131]JIANG L., LI C., WU J., et al. A combined classification algorithm based on C4.5 and NB [C]// Proceedings of the 3rd International Symposium on Intelligent Computation and its Applications. Berlin Heidelberg:Springer-Verlag,2008:350-359.
    [132]FAWCETT T. An introduction to ROC analysis [J]. Pattern Recognition Letters,2006,27(8): 861-874.
    [133]ELKAN C. The foundations of cost-sensitive learning [C]//Proceedings of the 17th International Joint Conference on Artificial Intelligence. Seattle:Morgan Kaufmann,2001:973-978.
    [134]BRADLEY A P. The use of the area under the ROC curve in the evaluation of machine learning algorithms [J]. Pattern Recognition,1997,30 (7):1145-1159.
    [135]CORINNA C, MOHRI M. AUC optimization vs. error rate minimization [C]//Proceedings of Advances in Neural Information Processing Systems 16. British Columbia, Canada:MIT Press,2004: 313-320.
    [136]BREFELD U, SCHEFFER T. AUC maximizing support vector learning [C]//Proceedings of ICML 2005 Workshop ROC Analysis in Machine Learning,2005.
    [137]FERRI C, FLACH P A, HERNANDEZ-ORALLO J. Learning decision trees using the area under the ROC curve [C]//Proceedings of the 19th International Conference on Machine Learning. Sydney, Australia:Morgan Kaufmann,2002:139-146.
    [138]FAWCETT T. Using rule sets to maximize ROC performance [C]//Proceedings of the 2001 IEEE International Conference on Data Mining. San Jose, CA:IEEE Computer Society,2001:131-138.
    [139]ROSSET S. Model selection via the AUC [C]//Proceedings of the 21st International Conference on Machine learning. Banff, Alberta, Canada:ACM International Conference Proceeding Series,2004:89-96.
    [140]闫明松,周志华.代价敏感分类算法的实验比较[J].模式识别与人工智能,2005,18：5：628-635.
    [141]DOMINGOS P. MetaCost:A general method for making classifiers cost-sensitive [C]//Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego, CA:ACM,1999: 155-164.
    [142]LIU Xu-ying, ZHOU Zhi-hua. Learning with cost intervals [C]//Proceedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Washington, DC:ACM; 2010:403-412.
    [143]LOBO J M, JIMENEZ-VALVERDE A, REAL R. AUC:A misleading measure of the performance of predictive distribution models [J]. Global Ecology and Biogeography,2008,17:145-151.
    [144]BOSER B E, GUYON I M, VAPNIK V N. A training algorithm for optimal margin classifiers [C]// Proceedings of the 5th Annual Workshop on Computational Learning Theory. Pittsburgh, PA:ACM,1992: 144-152.
    [145]BENNETT K, CRISTIANINI N, SHAWE-TAYLOR J, et al. Enlarging the margins in perceptron decision trees [J]. Machine Learning,2000,41:295-313.
    [146]WEINBERGER K Q, SAUL L K. Distance metric learning for large margin nearest neighbor classification [J]. Journal of Machine Learning Research,2009,10:207-244.
    [147]RUDIN C, DAUBECHIES I, SCHAPIRE R E. The dynamics of Adaboost:cyclic behavior and convergence of margins [J]. Journal of Machine Learning Research,2004,5:1557-1595.
    [148]HUANG Kai-zhu, YANG Hai-qin, KING I, et al. Maxi-min margin machine:learning large margin classifiers locally and globally [J]. IEEE Transactions on Neural Networks,2008,19:260-272.
    [149]WU Xiao-yun, SRIHARI R. Incorporating prior knowledge with weighted margin support vector machines [C]//Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining. New York:ACM,2004:326-333.
    [150]TAO Qing, WANG Jue. A new fuzzy support vector machine based on the weighted margin [J]. Neural Procession Letters,2004,20(3):139-150.
    [151]HAASDONK B, BURKHARDT H. Invariant kernel functions for pattern analysis and machine learning [J]. Machine Learning,2007,68(1):35-61.
    [152]DUNDAR M, KRISHNAPURAM B, BI J, et al. Learning classifiers when the training data is not iid [C]//Proceedings of the 20th International Joint Conference on Artificial Intelligence. Hyderabad, India: AAAI Press,2007:756-761.
    [153]SUN Qiang, DEJONG G Explanation-augmented SVM:an approach to incorporating domain knowledge into svm learning [C]// Proceedings of the 22nd International Conference on Machine Learning. Bonn, Germany:ACM International Conference Proceeding Series,2005:864-871.
    [154]NIYOGI P, POGGIO T, GIROSI F. Incorporating prior information in machine learning by creating virtual examples [C]//IEEE Proceedings on Intelligent Signal Processing. Piscataway, NJ:IEEE,1998.86(11): 2196-2209.
    [155]DECOSTE D, SCHOLKOPF B. Training invariant support vector machines [J]. Machine Learning, 2002,46:161-190.
    [156]NUNEZ H, ANGULO C, and CATALA A. Rule extraction from support vector machines [C]// Proceedings of 10th European Symposium on Artificial Neural Networks. Bruges, Belgium:[s. n.],2002: 107-112.
    [157]BARAKAT N, DIEDERICH J. Learning-based rule-extraction from support vector machines [C]// Proceedings of the 14th International Conference on Computer Theory and Applications. Alexandria, Egypt: IEEE,2004:247-252.
    [158]BARAKAT N, DIEDERICH J. Eclectic rule-extraction from support vector machines [J]. International Journal Computational Intelligence,2005,2(1):59-62.
    [159]ORSENIGO C, VERCELLIS C. Discrete support vector decision trees via tabu search [J]. Computational Statistics & Data Analysis,2004,47(2):311-322.
    [160]ACKOFF R L. From data to wisdom [J]. Journal of Applies Systems Analysis,1989,16:3-9.
    [161]ZINS C. Conceptual approaches for defining data, information, and knowledge [J]. Journal of the American Society for Information Science and Technology,2007,58(4):479-493.
    [162]ROWLEY J. The wisdom hierarchy:representations of the DIKW hierarchy [J]. Journal of Information Science,2007,33(2):163-180.
    [163]QUIGLEY E J, DEBONS, A. Interrogative Theory of information and knowledge [C]// Proceedings of the 1999 ACM SIGCPR conference on Computer personnel research. New Orleans:ACM Press,1999: 4-10.
    [164]TUOMI I. Data is more than knowledge:implications of the reversed knowledge hierarchy for knowledge management and organizational memory [J]. Journal of Management Information Systems,1999, 16(3):103-117.
    [165]SCHOHN G, COHN D. Less is more:active learning with support vector machines [C]// Proceedings of the 17th International Conference on Machine Learning. San Francisco, CA, USA:Morgan Kaufmann,2000:839-846.
    [166]FROSSYNIOTIS D S, STAFYLOPATIS A. A Multi-SVM classification system [C]//Proceedings of the 2nd International Workshop on Multiple Classifier Systems. Berlin Heidelberg:Springer-Verlag,2001: 198-207.
    [167]FROSYNIOTIS D, STAFYLOPATIS A, LIKAS A. A divide-and-conquer method for multi-net classifiers [J]. Journal of Pattern Analysis and Application,2003,6(1):32-40.
    [168]SPADE P V. Ockham's nominalist metaphysics:some main themes. In SPADE, P. V. ed. The cambridge companion to Ockham [M],2006,100-117.
    [169]余纪元译.后分析篇,300-302.In苗力田主编.亚里士多德全集,第一卷[M].北京：中国人民大学出版社,1990.
    [170]DOMINGOS P. The Role of Occam's razor in knowledge discovery [J]. Data Mining and Knowledge Discovery,1999,3:409-425.
    [171]BLUMER A, EHRENFEUCHT A, HAUSSLER D, et al. Occam's razor [J]. Information Processing Letters,1987,24:377-380.
    [172]QUINLAN J R. Bagging, boosting, and C4.5 [C]//Proceedings of the 13th International Conference on Artificial Intelligence. Portland, OR, USA:AAAI Press,1996:725-730.
    [173]YANG Y, WEBB G I. Proportional k-interval discretization for naive-Bayes classifiers [C]// Proceedings of 12th European Conference on Machine Learning. Freiburg, Germany:Springer,2001: 564-575.
    [174]YANG Y, WEBB G I. Discretization for naive-Bayes learning:managing discretization bias and variance [J]. Machine Learning,2009,74(1):39-74.
    [175]FAYYAD U M, IRANI K B. Multi-interval discretization of continuous-valued attributes for classification learning[C]//Proceedings of the 13th International Joint Conference on Artificial Intelligence. Chambery, France:Morgan Kaufmann,1993:1022-1027.
    [176]HOLTE R C. Very simple classification rules perform well on most commonly used datasets [J]. Machine Learning,1993,11:63-90.
    [177]YANG Y, WEBB G I. A comparative study of discretization methods for naive-Bayes classifiers[C]//Proceedings of the Pacific Rim Knowledge Acquisition Workshop,2002:359-173.
    [178]YANG Y, WEBB G I. Weighted proportional k-interval discretization for naive-bayes classifiers[C]//'Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Seoul, Korea:Springer,2003:501-512.
    [179]BEZDEK J C. Pattern recognition with fuzzy objective function algorithms [M]. New York, USA: Plenum Press,1981.
    [180]VLASSIS N, LIKAS A. A greedy EM algorithm for gaussian mixture learning [J]. Neural Processing Letters,2002,15(1):77-87.
    [181]ZHENG F, WEBB G I. A comparative study of semi-naive Bayes methods in classification learning [C]//Fourth Australasian Data Mining Workshop. University of Technology Sydney,2005:141-156.
    [182]FRANK E, WITTEN I H. Generating accurate rule sets without global optimization[C]// Proceedings of the 15th International Conference on Machine Learning. Madison, Wisconson, USA:Morgan Kaufmann,1998:144-151.
    [183]HUANG K Z, YANG H Q, KING I, et al. Machine learning:modeling data locally and globally [M]. Springer-Verlag, New York Inc,2008:1-28.
    [184]BOTTOU L, VAPNIK V. Local learning algorithms [J]. Neural Computation,1992,4(6):888-900.
    [185]KOPPEL M. ENGELSON S. Integrating multiple classifiers by finding their areas of expertise[C]// Proceedings of AAAI-96 Workshop on Integrating Multiple Learning Models for Improving and Scaling Machine Learning Algorithms. Portland, OR:[s:n],1996:53-58.
    [186]PAL S K, MITRA S. Multi-layer perceptron, fuzzy sets and classification [J]. IEEE Transaction on Neural Networks,1992,3:683-697.
    [187]TODOROVSKI L, DZEROSKI S. Combining classifiers with meta decision trees [J]. Machine Learning,2003,50(3):223-250.
    [188]ROKACH L, MAIMON O. Data mining with decision trees:theory and applications [M]. New Jersey, London:World Scientific,2008:117-121.
    [189]HO T K. The random subspace method for constructing decision forests [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(8):832-844.
    [190]TSYMBAL A, PECHENIZKIY M, CUNNINGHAM P. Diversity in search strategies for ensemble feature selection [J]. Information Fusion,2005,6(1):83-98.
    [191]KUSIAK A. Decomposition in data mining:an industrial case study [J]. IEEE Transactions on Electronics Packaging Manufacturing,2000,23(4):345-353.
    [192]ATKESON C, MOORE A. SCHAAL S. Locally weighted learning [J]. Artificial Intelligence Review,1997,11:11-73.
    [193]VAPNIK V, BOTTOU L. Local algorithms for pattern recognition and dependencies estimation [J]. Neural Computation,1993,5(6):893-909.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700