名词性属性距离度量问题及其应用研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

名词性属性距离度量问题及其应用研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Distance Metrics for Nominal Attributes and Application
作者：李超群
论文级别：博士
学科专业名称：地球探测与信息技术
中文关键词：距离度量 ; 名词性属性 ; 属性独立假设 ; 属性依赖关系 ; 类概率估测
英文关键词：distance metrics ; nominal attributes ; attribute independence assumption ; attribute
英文关键词：dependence realtionships ; class probability estimation
学位年度：2012
导师：李宏伟
学科代码：081802
学位授予单位：中国地质大学
论文提交日期：2012-06-01

摘要

基于实例的学习,包括最近邻学习、局部加权学习、以及基于记忆的推理等,都依靠一个好的距离度量获得成功。可以说,距离度量问题是基于距离的机器学习算法的核心所在。此外,距离度量还被广泛应用到模式识别、神经网络、统计学、以及认知心理学等各个领域。实例之间的距离度量一直是一个非常重要的问题。为此,学者们提出了许多距离度量。比如,Euclidean距离、Manhattan距离、Minkowsky距离、Mahalanobis距离、Camberra距离等等。然而,这些距离度量都只适用于数值属性而不适用于名词性属性。
     相比数值属性的距离度量,名词性属性的距离度量是一个更加复杂的问题。为了给名词性属性值之间的距离以合理的估计,学者们做出了很多努力,提出了一些距离度量。比如,重叠度量(Overlap Metric, OM),值差度量(Value Difference Metric, VDM),修正的值差度量(Modified Value Difference Metric, MVDM),Short-Fukunaga度量(Short and Fukunaga Metric, SFM),最小风险化度量(Minimum Risk Metric, MRM),基于熵的度量(Entropy-Based Metric, EBM),及基于频率的度量(Frequency-Based Metric, FBM)等等。
     在现实问题中,大量的数据集都涉及到名词性属性。名词性属性距离度量相比数值属性更加复杂。本论文将主要针对名词性属性的距离度量问题展开研究。主要考虑的问题有以下几点：1、如何理解距离度量中的属性独立性假设?
     VDM是一个被广泛应用的名词性属性距离度量,而据我们分析,VDM是作了属性独立假设的。在VDM中,两个实例之间的距离是每一维距离的叠加,维与维之间没有关联。事实上,大多数的距离度量都是如此,如最简单的处理名词性属性的距离度量OM,还有处理数值属性的Euclidean距离和Hamming距离等,都是利用的这种每一维距离的简单叠加来度量实例之间的距离的。我们认为,这种简单的叠加实质上就是假设每一维是相互独立的。Kasif等人也指出,VDM作了和朴素贝叶斯分类器一样的属性独立假设。尽管这个不现实的属性独立假设,朴素贝叶斯分类器表现出了令人惊讶的分类性能,而VDM也是目前为止被应用最为广泛的名词性属性距离度量之一。那么,如何理解并应用这种属性独立假设,构造改进的或者新的距离度量,使之具有简洁、可理解、容易计算的特点呢?
     2、如何在距离度量中体现属性之间的依赖关系?现实数据中,属性之间多半是存在依赖关系的。而大多数的距离度量,就像朴素贝叶斯分类器一样,作了属性独立假设。尽管朴素贝叶斯分类器表现出了令人惊讶的分类性能,但是当属性间存在强依赖关系时,朴素贝叶斯的分类性能受到了一定程度的伤害。为此,学者们利用各种技术对朴素贝叶斯分类器进行改进,一个有效的途径就是结构扩展。结构扩展的中心思想就是在朴素贝叶斯模型中利用有限的有向边来表达属性之间的依赖关系,得到扩展的贝叶斯网络分类器。现在已有很多学者提出了许多扩展的贝叶斯网络分类器,这些扩展的贝叶斯网络分类器用有向边来表达属性之间的依赖关系,在一定程度上放宽了属性独立假设,从而具有比朴素贝叶斯分类器更好的分类性能。既然表达属性依赖关系可以改进朴素贝叶斯分类器的性能,那么,能不能仿效扩展的贝叶斯网络分类器,把这种属性依赖关系也引入到距离度量中来,从而改进现有距离度量的性能,甚至构造新的距离度量,使这些距离度量在属性有强依赖关系的数据上表现出更好的性能呢?
     3、如何尽可能准确地估测基于概率的距离度量中的类成员概率?
     在处理名词性属性的距离度量中,有一部分需要估测概率,这部分距离度量被称为基于概率的距离度量,比如VDM,MVDM, SFM, MRM等。其中,一部分基于概率的距离度量又需要估测类成员概率。比如,上节中提到的SFM和MRM都需要估测实例x属于类成员c的概率P(c|x)。要使得这些基于概率的距离度量获得成功,如何尽可能准确地估测类成员概率P(c|x)是一个非常关键的问题。有研究表明：完全地估测类成员概率P(c|x)等同于学习一个最优的贝叶斯网络,是一个NP-hard问题。为降低计算复杂性,现有文献都是用朴素贝叶斯分类器进行近似估测,这在一定程度上影响了距离度量的性能。已有学者在人工数据集上的实验证明,如果可以准确的知道类成员概率,SFM和MRM可以比VDM具有更好的性能。其实,已有研究表明：朴素贝叶斯的类概率估测能力较差(尽管它是一个性能良好的分类器)。为了提高朴素贝叶斯的类概率估测能力,学者们提出了一些改进的贝叶斯模型。那么,能不能把类概率估测方面的研究成果应用于基于概率的距离度量中,提高各种距离度量中的类成员概率估测精度,从而改进相关距离度量的性能呢?
     4、如何克服维度灾难问题?
     本论文立意研究距离度量问题,而与距离度量密切相关的一个问题就是维度灾难问题(the Curse of Dimensionality Problem)。维度灾难问题已被众多的学者所关注,维度灾难问题是指当数据中有大量冗余或者不相关属性时,算法的性能会受到影响。维度灾难对距离度量所导致的问题在于,当数据中存在大量不相关属性时,如果用所有的属性来计算实例之间的距离,那么近邻间的距离会被大量不相关的属性所支配,从而导致计算出来的所谓近邻可能相去甚远。克服维度灾难问题的一种方法是属性加权,也就是考虑每个属性与类变量之间不同的相关性,给相关性较大的属性赋予较大的权值,从而抑制不相关属性对距离计算的影响。另一种更加强有力的方法是属性选择,也就是从属性空间中完全消除不相关的属性。近年来,学者们对属性加权和属性选择问题都已经作了大量的研究工作。现有文献已有大量的属性加权和属性选择方法方面的研究成果,本论文将特别针对名词性属性距离度量对这个问题继续进行深入探讨。比如OM因为其简单性,得到了广泛的应用。那么,能不能利用属性加权技术保持距离度量的简洁性,同时提高它的性能呢?对著名的VDM,既然这个距离度量作了属性独立假设。那么,能不能在这个假设的基础上,设计适合VDM的属性选择方法呢?
     如前所述,机器学习、模式识别、神经网络、统计学、以及认知心理学等领域中的很多算法都涉及到距离度量,并且它们的性能都依赖于所使用的距离度量。比如,k-近邻(k-Nearest Neighbor, KNN)算法及其改进：距离加权的k-近邻(Distance Weighted k-Nearest Neighbor, KNNDW)算法、局部加权的朴素贝叶斯(Locally Weighted Naive Bayes, LWNB)算法等等。经过对前几个问题的研究势必会提出一些高性能的距离度量。因此,如何利用这些新提出的距离度量来改进上述这些距离相关算法的性能显得尤为重要。本论文将对这个问题进行深入研究。
     鉴于上面提出的几点问题,本论文以名词性属性距离度量为研究对象,从不同的角度对现有的名词性属性距离度量进行了研究和改进。主要的工作如下
     1、研究了距离度量中的属性独立假设；
     尽管朴素贝叶斯分类器的属性独立假设众所周知,但距离度量中的属性独立假设还未引起学者们的广泛关注。论文第二章详细地讨论了值差度量(Value Difference Metric, VDM)中的属性独立假设,指出这个属性独立假设和朴素贝叶斯分类器的假设是一致的。在这个假设的基础上,以Short-Fukunaga度量(Short and Fukunaga Metric, SFM)为原型,提出了修正的Short-Fukunaga度量(Modified Short and Fukunaga Metric, MSFM)。实验证明,MSFM和VDM性能相当,超过了SFM和SFM的另一个修改版本SF2LOG。
     2、将属性依赖关系引入距离度量中;
     扩展的贝叶斯网络分类器通过引入属性依赖关系,获得了比朴素贝叶斯分类器更好的性能。论文第三章从理论和实验两方面调查了朴素贝叶斯分类器和一些扩展的贝叶斯网络分类器的性能。扩展的贝叶斯网络分类器利用有向边来表达属性之间的依赖关系,一定程度上释放了朴素贝叶斯的属性独立假设,由此改进了朴素贝叶斯分类器的性能。受扩展的贝叶斯网络模型的启迪,本论文将属性依赖关系引入到距离度量中,利用扩展的贝叶斯网络分类器来学习属性依赖关系,根据学到的属性依赖关系构造相应的距离度量。以值差度量(Value Difference Metric, VDM)为原型,提出了表达属性依赖关系的距离度量：一依赖的值差度量(One Dependence Value Difference Metric, ODVDM)。实验证明,在具有强依赖关系的数据上,ODVDM比VDM表现出了更好的性能。
     3、提高基于概率的距离度量中的类成员概率估测精度；
     基于概率的距离度量中有一部分需要估测类成员概率P(c|x),这些距离度量的性能直接受类成员概率估测精度的影响。论文第四章以基于概率的距离度量Short-Fukunaga度量(Short and Fukunaga Metric, SFM)和最小风险化度量(Minimum Risk Metric, MRM)为研究对象。因为SFM和MRM的性能极大地依赖于类成员概率P(c|x)的估测精度,现有文献一般用朴素贝叶斯来估测类成员概率。但已有文献表明朴素贝叶斯的类成员概率估测能力不高。为了提高朴素贝叶斯的类概率估测性能,学者们提出了大量改进的算法。论文第四章重点调查了这些算法的类概率估测性能,并利用它们来估测SFM和MRM的类成员概率值。实验表明,精确的类成员概率估测方法可以极大的提高SFM和MRM的性能。
     4、利用属性加权途径改进距离度量；
     属性加权途径是克服维度灾难问题的一个有效途径。论文第五章考察了最简单的名词性属性距离度量：重叠度量(Overlap Metric, OM),和最简单的可以同时处理名词性属性和数值属性的距离度量：异构欧几里得—重叠度量(Heterogeneous Euclidean-Overlap Metric, HEOM),利用属性加权的途径对其进行改进,提出了相关性加权的异构欧几里得—重叠度量(Correlation Weighted Heterogeneous Euclidean-Overlap Metric, CWHEOM)。在CWHEOM中,针对分类和回归问题,我们应用不同的技术提出了加权方案。在36个分类数据和36个回归数据上的实验表明,相关性加权的途径极大的改进了HEOM的性能,同时保持了距离度量的简洁性和可理解性。
     5、利用属性选择途径改进距离度量；
     前面章节中主要关注距离度量应用到距离相关算法时,是否能改进距离相关算法的的分类性能。其实,类概率估测也是机器学习和数据挖掘领域一个重要的问题。论文第六章以类概率估测为任务,研究了KNN及其改进KNNDW的类概率估测性能,关注当距离度量VDM被应用到KNN和KNNDW时,如何改进方法的性能。论文第六章应用属性选择的途径去改进VDM的性能。基于VDM作了属性独立假设这个基础,找到了适合VDM的属性选择方法CFS和SBC-CLL。实验结果表明,利用CFS和SBC-CLL为VDM作属性选择后,KNN和KNNDW的类概率估测性能有了很大提高。
     6、应用论文中提出的距离度量到距离相关算法去处理地球物理和工程方面的实际应用问题。
     本论文所有章节的实验都以UCI数据库(http://archive.ics.uci.edu/ml/datasets.html)中的大量数据集为实验数据,广泛调查了我们所提出的距离度量应用到距离相关算法时的泛化性能。除此之外,论文还以孔隙度预测、瓦斯量涌出预测、岩爆预测和边坡稳定性预测等一些地球物理和工程方而的实际应用问题为背景,调查了我们的距离度量应用到距离相关算法时在这些地球物理和工程问题数据集上的表现。
     综上所述,本文将重点依托贝叶斯网络模型,对名词性属性的距离度量问题进行系统深入的研究。借鉴朴素贝叶斯分类器的研究成果,来研究距离度量中的属性独立假设；应用贝叶斯网络表达属性依赖关系的方法来学习距离度量,将距离度量的构造问题转化为属性依赖关系的学习问题；全面调查现有的类概率估测算法,并用来计算距离度量中的类成员概率,从而提高距离度量的性能,推动基于概率的距离度量的应用,使得距离相关的学习算法有更好的性能。因此本文的研究可以为名词性属性的距离度量新方法研究提供示例,具有重大的理论意义和应用前景。但因用贝叶斯网络来表达属性依赖关系和估测类成员概率本身具有一定难度,将其与距离度量问题联系起来,有几个关键科学问题尚待解决,这使得对这一问题的研究面临着不小的挑战。
     论文的主要创新点如下：
     1、朴素贝叶斯分类器的属性独立假设受到了学者们广泛的关注,但距离度量中同样存在的属性独立假设还未受到学者们广泛的关注。本论文对距离度量中的属性独立假设作了详细研究,并在此基础上提出了改进的距离度量。
     2、提出将属性依赖关系引入距离度量问题中,构造新的距离度量,使之在有强依赖关系的数据上表现出更好的性能。主要借助贝叶斯网络分类器来学习属性依赖关系,将距离度量的构造问题转化为属性依赖关系的学习问题。
     3、详细地研究了现有类成员概率估测方面的成果,并借助贝叶斯网络类成员概率估测器来提高基于概率的距离度量中的类成员概率估测精度。将距离度量问题和贝叶斯网络学习模型结合。
Instance-based learning algorithms, including the nearest-neighbor learning algortihms, locally weighted learning algorithms, and memory-based reasoning etc, all depend on a good distance metric to be successful. Distance metric is the key of the distance-related machine learning algorithms. Besides, distance metrics are also used in many fields including pattern recgonition, neural networks, statistics, and cognitive psychology etc. Many distance metrics have been presented to decide the difference between two instances, such as Euclidean distance metric, Manhattan metric, Minkowsky metric, Mahalanobis metric and Camberra metric et al. However, these distance metrics only worl well for numerical attributes but do not appropriately handle nominal attributes.
     Compared with numerical attributes, it is a more sophisticated problem to difine an appropriate distance metric for nominal attributes. In order to give appropriate distance metrics for nominal attributes, researchers have presented some distance metrics, such as Overlap Metric, Value Difference Metric (VDM), Modified Value Difference Metric (MVDM), Short and Fukunaga Metric (SFM), Minimum Risk Metric (MRM), Entropy-Based Metric (EBM) and Frequency-Based Metric (FBM) etc.
     In real application, most datasets involve nominal attributes, and the tasks involved include classification, class probability estimation, and clustering etc. These tasks are important tasks in the field of machine learning and data mining. In this thesis, we study on the distance metrics for nominal attributes. There are some primary problems need to be addressed:
     1. How to understand the attribute independence assumption?
     Value Difference Metric (VDM) is a distance metric used widely to decide the difference between the nominal attribute values. According to our observation, VDM assumes the attribute independence. VDM defines the distance between two instances as the sum of the value differences across all attributes, and the attributes are independent. In fact, most distance metrics, such as the most simplest distance metric for nominal attributes:Overlap Metric, and the distance metrics for numerical attributes:Euclidean Metric and Hamming Metric, all define the distance between two instances as the sum of the value differences across all attributes. Kasif et al point that VDM assumes the attribute independence as well as naive Bayes. Although the attribute indenpence assumption is rarely held in real world, naive Bayes shows surprised classification performance, and VDM is one of the distance metrics used widely for nominal attributes. How to understand and apply the attribute independence assumption to improve the distance metrics or design the new distance metrics which are simple, comprehensible, and easily computed?
     2. How to express the attribute dependence in distance metrics?
     In real world, attribute independence is unrealistic in most datasets. However, most distance metrics assume the attribute independence as well as Naive Bayes. Although Naive Bayes shows surprised classification performance, the classification performance of naive Bayes is harmed to some extent when the attributes are strong dependence. In order to improve the performance of naive Bayes, many techniques are presented. One of these techniques is structure extension. The key idea of structure extension is to express attribute dependence by limited directed edges, and the resulted models are called augmented Bayesian networks classifiers. In recent years, many researchers have presented some augmented Bayesian networks classifiers. These augmented Bayesian network classifiers show the attribute dependence by edges among the attributes, thus dispense with its strong assumptions about independence in naive Bayes. Compared with naive Bayes, these augmented Bayesian network classifiers show better classification performance. Motivated by the success of augmented Bayesian network classifiers, we try to design more general distance metrics that take the dependence relationships among the attributes into account which will show better performance in the applications with complex attribute dependencies.
     3. How to accurately estimate the class membership probabilities in probability-based distance metrics?
     Many probability-based distance metrics need to estimate the class membership probabilities, such as Short and Fukunaga Metric (SFM) and Minimum Risk Metric (MRM). In order to make these probability-based distance metrics to be successful, the accurate class membership probabilities estimation is a key problem. Some research have shown that full estimation of the class membership probabilities is an NP-hard problem. In order to reduce the computation complexity, naive Bayes is used to estimate the class membership probabilities in distance metrics. In the way, these distance metrics's performance is violated because of the inaccurate estimation of the class membership probabilities. In fact, there are researchers prove by experiments in artificial datasets that SFM and MRM show better performance than VDM when accurate class membership probabilities is computed. However, some research have shown that the class membership probability estimation ability of naive Bayes is poor. In order to improve the class probability estimation ability of naive Bayes, researcher have presented some improved Bayesian netwok models. So we try to apply these class membership probability estimators based on improved Bayesian network models to probability-based distance metrics, consequently improve the performance of distance metrics.
     4. How to overcome the curse of dimensionality problem?
     In the thesis, we research on the distance metrics for nominal attributes. A closely realted problem with distance metrics is the curse of dimensionality problem. The curse of dimensionality problem have been noticed by many researchers. When there are plenty of redundant and (or) irrelevant attributes, the performance of learning algorithms is violated. The curse of dimensionality problem in distance metrics will result that, when there are many irrelevant attributes, if all attributes are used to decide the distance between two instances, then the distance between two instances is governed by many irrelevant attributes. As a result, the distance metric depending on all attributes will be misleading.
     An effective approach to overcome the curse of dimensionality problem is to weight each attribute differently when measuring the distance between each pair of instances. This approach is widely known as attribute weighting. Another drastic approach to overcome the curse of dimensionality is to completely eliminate the least relevant attributes from the attribute space when measuring the distance between each pair of instances. This approach is widely known as attribute selection. In recent years, researchers have presented many work to attribute weighting and attribute selection. In the thesis, we focus on attribute weighting and attribute selection on nominal attributes distance metrics. For example, Overlap Metric is applied widely because of its simpleness. Aim at Overlap Metric, we hope to enhance its performance by attribute weighting and simultaneously keep the distance metirc's simpleness. For another example, the well known distance metric VDM assumes the attribute independence. Then, we hope to find suitable attribute selection methods based on the attribute independence assumption.
     As what mentioned before, many learning algorithms in those fields of machine learning, pattern recognition, neural networks, stastic, and cognitive psychology are distance-related algorithms, and their performance all depend on the used distance metrics. For example, k-nearest neighbor (KNN) learning algorithm and its variant distance weighted k-nearest neighbor learning algorithm (KNNDW), locally weighted naive Bayes algorithms (LWNB) et al. are all distance-related algorithms. So, we will apply these new distance metrics to improve these distance-realted algorithms.
     In view of aforementioned some problems, in the thesis we study on the distance metrics for nominal attributes, and improve the distance metrics for nominal attributes from different perspectives. The major work include:
     1. The attribute independence assumption in distance metrics is investigated.
     The attribute independence assumption is a crucial assumption to the naive Bayesian classifier. Although the assumption is rarely held in real world, the naive Bayesian classifier shows surprised performance. In fact, many distance metrics also assume the attribute independence, such as the well known Value Difference Metric (VDM). Short and Fukunaga Metric (SFM) is another widely used metric, which doesn't assume the attribute independence. In order to improve the performance of SFM. In the2nd chapter of the thesis, we propose a Modified Short and Fukunaga Metric (MSFM) based on the attribute independence assumption. MSFM is surprisingly similar to VDM in terms of their expression forms and their computation complexities. Our experiments have shown that the performance of MSFM is much better than SF2LOG (another improved inversion of SFM) and SFM, and is competitive with VDM.
     2. The attribute dependence realtionships is expressed in distance metrics.
     The augmented Bayesian network classifiers express the attribute dependence relationships by the limited directed edges, consequently result in the more better performance than the naive Bayes classifier. The Value Difference Metric (VDM) is one of distance metrics widely used to decide the distance between each pair of instances with nominal attribute values only, and also assumes the attribute independence. In the3rd chapter of the thesis, we investigate in detail the performance of the naive Bayes classifier and the augmented Bayesian network classifiers by experiment and theory analysis, and single out an improved Value Difference Metric by relaxing its unrealistic attribute independence assumption which is called One Dependence Value Difference Metric, simply ODVDM. In our ODVDM, the structure learning algorithms for Bayesian network classifiers are used to find the dependence relationships among the attributes. Our experimental results on some datasets which have strong attribute dependence relationships validate our distance metric's effectiveness.
     3. The class membership probabilities estimation in probability-based distance metrics is improved.
     When we apply probability-based distance metrics SFM and MRM to distance-related learning algorithms, a key problem is how to accurately estimate the class membership probabilities. For simplicity, existing works use naive Bayesian classifiers to estimate class membership probabilities in SFM and MRM. However, it has been proved that the class membership probabilities estimation ability of naive Bayesian classifiers is poor, which results that SFM and MRM do not perform well. In the4th chapter of the thesis, we study on the class probability estimation performance of some augmented Bayesian network classifiers and then apply them to estimate the class membership probabilities in SFM and MRM. Our experimental results on a large number of UCI datasets show that the use of more accurate class probability estimation algorithms can improve the performance of SFM and MRM.
     4. The technique of attribute weighting is used to improve distance metrics.
     Attribute weighting is an attractive way to improve the accuracy of a distance metric. In the5th chapter of the thesis, we focus on the simplest distance metrics for nominal attributes: Overlap Metric, and use attribute weighting to improve the performance of Overlap Metric. Among large numbers of distance functions, HEOM is the simplest distance metric to handle application with both continuous and nominal attributes. Further, we present an improved distance metric:Correlation Weighted Heterogeneous Euclidean-Overlap Metric (CWHEOM). In CWHEOM, to discrete and continuous class problems, we apply different correlation functions to estimate the weight values of attribute variables. The improved distance metric significantly raises the generalization performance of HEOM, and at the same time keeps the simplicity and comprehensibility of the distance metric. Experiments results on36discrete class datasets and36continuous class datasets prove that our new method achieve higher accuracy than HEOM.
     5. The technique of attribute selection is used to improve distance metrics.
     In the previous chapters, we apply distance metrics to distance-related learning algorithms to deal with the classification tasks. In real-world applications, accurate class probability estimation is required instead of just a classification. Probability-based classifiers also produce the class probability estimation. In the6th chapter of the thesis, we focus on the class probability estimation performance of KNN and KNNDW using VDM. We try to use the attribute selection method to improve the accuracy of VDM, consequently improve the class probability estimation performance of KNN and KNNDW. The key question is what kind of attribute selection method is suitable for VDM. According to our observation, VDM assume the attribute independence as well as NB, so our idea is that the attribute selection methods for VDM should select attribute subsets among which the attribute independence assumption is held as possible. In this perspective we propose to use the attribute selection methods based on CFS and SBC-CLL to select attribute subsets for VDM. Experimental results prove our attribute selection methods significantly improve the class probability estimation performance of KNN and KNNDW using VDM.
     6. We apply the distance metrics presented in the thesis to distance-related learning algorithms to deal with some practical tasks in the fields of geophysics and engineering.
     In the experiments presented in every chapter, we all run our experiments on many UCI datasets (http://archive.ics.uci.edu/ml/datasets.html). Besides, we take the problems in the fields of geophysics and engineering, such as reservoir porosity prediction, gas emission prediction, rockbursts prediction, and slope stability prediction to study on the practical apply value of these new distance metrics presented in the thesis.
     In sum, we study on thoroughly and systematically the distance metrics for nominal attributes in the thesis. Moreover, we study on distance metrics associated with Bayesian networks. Bayesian networks model is an excellent learning model, and it seems to be unrelated with distance metrics problem. In the thesis, we apply the way of the attribute dependence relationship in Bayesian networks to study on the distance metrics, and transform the distance measure problem into the structure learning problem of Bayesian network classifiers. We also investigate exsiting class probability estimators in detail and apply them to improve the estimation of calss membership probabilities. But there are some problems needed to be resolved if we want relate Bayesian networks with distance metrics.
     The main innovations in the thesis include:
     1. As we all know, the naive Bayes classifier assumes the attribute independence. In the thesis, we investigate the attribute independence assumption in distance metrics in detail and present improved distance metrics.
     2. We express attribute dependence relationships in distance metrics, and transform the distance measure problem into the structure learning problem of Bayesian network classifiers.
     3. We investigate the exsting work about the class membership probability estimators, and apply the class membership probability estimators based on Bayesian networks to improve the probability estimation in distance metrics.

引文

[1] Pang-Ning Tan, Michael Steinbach and Vipin Kumar著;范明、范宏建等译.数据挖掘导论.
    北京：人民邮电出版社,2006年。
    [2]Jiawei Han and Micheline Kamber著;范明、孟小峰等译.数据挖掘—概念与技术.北京：
    机械工业出版社,2007年。
    [3]王珏,周志华,周傲英主编.机器学习及其应用.北京：清华大学出版社,2006年。
    [4]Tom M. Mitchell著；曾华军,张银奎等译.机器学习.北京：机械工业出版社,2003年。
    [5]史忠植著.知识发现.北京：清华大学出版社,2002年
    [6]周志华,曹存根主编.神经网络及其应用.北京：清华大学出版社,2004年。
    [7]蒋宗礼.人工神经网络导论.北京：高等教育出版社,2001年。
    [8]Aha DW, Kibber D, and Albert MK. Instance-Based Learning Algorithms. Machine Learning,
    1991,6(1):37-66.
    [9]Domingos P. Rule Induction and Instance-Based Learning:A Unified Approach. In:
    Proceedings of the 1995 International Joint Conference on Artificial Intelligence (IJCAI'95),
    1995, pp.1226-1232, Morgan Kaufmann Press.
    [10]Kang P and Cho s. Locally Linear Reconstruction for Instance-based Learning. Pattern Recognition,2008,41(11):3507-3518.
    [11]Dutt V and Gonzalez C. Making Instance-based Learning Theory Usable and Understandable:The Instance-based Learning Tool. Computers in Human Behavior,2012, doi:10.1016/j.chb.2012.02.006.
    [12]Cover TM and Hart PE. Nearest Neighbor Pattern Classification. Institute of Electrical and
    Electronics Engineers Transactions on Information Theory,1967,13(1):21-27.
    [13]Jun T, Mineichi K and Hideyuki Imai. Probably Correct k-Nearest Neighbor Search in High
    Dimensions. Pattern Recognition,2010,43(4):1361-1372.
    [14]Hu Q, Zhu P and Yang Y et al. Large-Margin Nearest Neighbor Classifiers via Sample Weight Learning.2011,74(4):656-660.
    [15]Campos T de, Csurka G and Perronnin F. Images as sets of locally weighted features.
    Computer Vision and Image Understanding,2012,116(1):68-85.
    [16]Frank E, Hall M and Pfahringer B. Locally Weighted Naive Bayes. In:Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI'03),2003, pp.249-256, Morgan Kaufmann Press.
    [17]Jiang L, Zhang H and Wang D et al. Learning Locally Weighted C4.4 for Class Probability Estimation. In:Proceedings of the 10th International Conference on Discovery Science (DS'07),2007, pp.104-115, Springer Press.
    [18]Stanfill C and Waltz D. Toward Memory-Based Reasoning. Communications of the ACM, December 1986,29(12):1213-1228.
    [19]Cost S and Salzberg S. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning,1993,10(1):57-78.
    [20]Rachlin J, Kasif S and Salzberg S et al. Towards a Better Understanding of Memory-Based Reasoning Systems. In:Proceedings of the Eleventh International Machine Learning Conference (ICML'94).1994, pp.242-250, Morgan Kaufmann Press.
    [21]Bailador G, Sanchez-Avila C and Guerra-Casanova J et al. Analysis of Pattern Recognition Techniques for in-air Signature Biometrics. Pattern Recognition,2011,44(10-11): 2468-2478.
    [22]Carvalho FAT de and Souza RMCR de. Unsupervised Pattern Recognition Models for Mixed Feature-Type Symbolic Data. Pattern Recognition Letters,2010,31(5):430-443.
    [23]Nosofsky RM. Attention, Similarity, and the Identification Categorization Relationship. Journal of Experimental Psychology:General,1986,115(1):39-57.
    [24]Wu X, Kumar V and Quinlan JR et al. Top 10 Algorithms in Data Mining, Knowledge and Information Systems,2008,14(1):1-37.
    [25]Shen C, Kim J and Wang L. Scalable Large-Margin Mahalanobis Distance Metric Learning. IEEE Transactions on Neural Networks,2010,21(9):1524-1530.
    [26]Wilson DR and Martinez TR. Improved Heterogeneous Distance Functions. Journal of Artificial Intelligence Research,1997,6(1):1-34.
    [27]Short RD and Fukunaga K. The Optimal Distance Measure for Nearest Neighbour Classification. IEEE Transactions on Information Theory,1981,27(5):622-627.
    [28]Myles JP and Hand DJ. The Multi-Class Metric Problem in Nearest Neighbor Discrimination Rules. Pattern Recognition,1990,23(11):1291-1297.
    [29]Blanzieri E and Ricci F. Probability Based Metrics for Nearest Neighbor Classification and Case-Based Reasoning. In:Proceedings of the 3rd International Conference on Case-Based Reasoning Research and Development (ICCBR'99),1999, pp.14-28.
    [30]Quang LS and Bao HT. A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data. In:Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04),2004, pp.580-589, Springer Press.
    [31]McCane B and Albert MH. Distance Functions for Categorical and Mixed Variables. Pattern Recognition Letters,2008,29(7):985-993.
    [32]Khorshidpour Z, Hashemi S and Hamzeh A. An Approach to Learn Categorical Distance Based on Attributes Correlation. In:Proceedings of the 19th Iranian Conference on Electrical Engineering (ICEE'11),2011, pp.1-6.
    [33]Burkovski A, Klenk S and Heidemann G. Similarity Calculation with Length Delimiting Dictionary Distance. In:Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI'11),2011, pp.856-864.
    [34]Jin R, Wang S and Zhou Z. Learning a Distance Metric from Multi-instance Multi-label Data. In:Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'09),2009, pp.896-902.
    [35]Hsu CM and Chen MS. On the Design and Applicability of Distance Functions in High-Dimensional Data Space, IEEE Transactions on Knowledge and Data Engineering, 2009,21(4),523-536.
    [36]Schaffer C. A Conservation Law for Generalization Performance. In:Proceedings of the 11th International Conference on Machine Learning (ICML'94),1994, pp.259-265, Morgan Kaufmann Press.
    [37]Kasif S, Salzberg S and Waltz D et al. A Probabilistic Framework for Memory-Based Reasoning. Artificial Intelligence,1998,104(1-2):287-311.
    [38]Friedman N, Geiger D and Goldszmidt M. Bayesian Network Classifiers. Machine Learning, 1997,29(2-3):131-163.
    [39]Keogh E and Pazzani M. Learning Augmented Bayesian Classifiers:A Comparison of Distribution-Based and Classification-Based Approaches. In:Proceedings of the seventh International Workshop on Artificial Intelligence and Statistics,1999, pp.225-230.
    [40]Zhang H and Ling CX. An Improved Learning Algorithm for Augmented Naive Bayes. In: Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'01).2001, pp.581-586, Springer Press.
    [41]Webb GI, Boughton J and Wang Z. Not so Naive Bayes:Aggregating One-Dependence Estimators. Machine Learning,2005,58(1):5-24.
    [42]Sun J, Wang C and Chen S. A Double Layer Bayesian Classifier. In:Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'07),2007, pp.540-544, IEEE Computer Society Press.
    [43]Li N, Yu Y and Zhou ZH. Semi-Naive Exploitation of One-Dependence Estimators. In: Proceedings of the 9th IEEE International Conference on Data Mining (ICDM'09),2009, pp. 278-287, IEEE Computer Society Press.
    [44]Jiang L, Zhang H and Cai Z. A Novel Bayes Model:Hidden Naive Bayes. IEEE Transactions on Knowledge and Data Engineering,2009,21(10):1361-1371.
    [45]Kotsiantis S and Tampakas V. Increasing the Accuracy of Hidden Naive Bayes Model. In: Proceedings of the 6th International Conference on Advanced Information Management and Service (IMS'10),2010, pp.247-252.
    [46]Campos C de. and Ji Q. Efficient Structure Learning of Bayesian Networks using constraints. Journal of Machine Learning Research,2011,12:663-689.
    [47]Jiang L. Random One-Dependence Estimators. Pattern Recognition Letters,2011,32(3): 532-539.
    [48]Chickering DM. Learning Bayesian Networks is NP-Complete. In Fisher, D. and Lenz, H., editors, Learning from Data:Artificial Intelligence and Statistics V, Springer Verlag,1996, pp.121-130.
    [49]Domingos P and Pazzani MJ. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning,1997,29(2-3):103-130.
    [50]Wang B and Zhang H. Probability Based Metrics for Locally Weighted Naive Bayes. In: Proceedings of the 20th Canadian Conference on Artificial Intelligence (CAAI'07).2007, pp. 180-191, Springer Press.
    [51]Bennett PN, Assessing the Calibration of Naive Bayes' Posterior Estimates. Technical Report No. CMU-CS00-155,2000.
    [52]Lowd D and Domingos P. Naive Bayes Models for Probability Estimation. In:Proceedings of the Twentysecond International Conference on Machine Learning (ICML'05),2005, pp. 529-536, ACM Press.
    [53]Grossman D and Domingos P. Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood. In:Proceedings of the 21st International Conference on Machine Learning (ICML'04),2004, pp.361-368, ACM Press.
    [54]Jiang L and Zhang H. Learning Naive Bayes for Probability Estimation by Feature Selection. In:Proceedings of the 19th Canadian Conference on Artificial Intelligence (CAAI'06),2006, pp.503-514, Springer Press.
    [55]Han EH, Karypis G and Kumar V. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification, University of Minnesota, Tech. Rep.,1999, dep. Comput.Sci.
    [56]Zhang H, Jiang L and Cai Z. Dynamic k-Nearest Neighbor Naive Bayes with Attribute Weighted. In:Proceedings of the 3rd International Conference on Fuzzy Systems and Knowledge Discovery,2006, pp.365-368, Springer Press.
    [57]Jiang L, Wang D and Cai Z et al. Scaling up the Accuracy of k-Nearest Neighbor Classifiers: A Naive-Bayes Hybrid. International Journal of Computers and Applications,2009,31(1): 36-43.
    [58]Huang Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In:Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'97),1997, pp.1-8, ACM Press.
    [59]Hall M. A Decision Tree-Based Attribute Weighting Filter for Naive Bayes. Knowledge-Based Systems,2007,20(2):120-126.
    [60]Lughofer E. On-line Incremental Feature Weighting in Evolving Fuzzy Classifiers. Fuzzy Sets and Systems,2011,163(1):1-23.
    [61]Chen X, Ye Y and Xu X et al. A Feature Group Weighting Method for Subspace Clustering of High-Dimensional Data.Pattern Recognition,2012,45(1):434-446.
    [62]Kira K and Rendell LA. A Practical Approach to Feature Selection. In:Proceedings of the 9th International Workshop on Machine Learning (ML'92),1992, pp.249-256, Morgan Kaufmann Press.
    [63]Cardie C. Using Decision Trees to Improve Cased-Based Learning. In:Proceedings of the 10th International Conferencce on Machine Learning (ICML'93),1993, pp.25-32, Morgan Kaufmann Press.
    [64]Almuallim H and Dietterich TG Learning Boolean Concepts in the Presence of Many Irrelevant Features. Artificial Intelligence,1994,69(1-2):279-305.
    [65]John R, Kohavi G and Perfleg K. Irrelevant Features and the Subset Selection Problem. In: Proceedings of the 11th international conferencce on machine learning,1994, pp.121-129, Morgan Kaufmann Press.
    [66]John G and Kohavi R. Wrappers for Feature Subset Selection. Artificial Intelligence, 1997,97(1-2):273-324.
    [67]Witten IH and Frank E. Data Mining:Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann, San Francisco,2005.
    [68]Nadeau C and Bengio Y. Inference for the Generalization Error. Machine Learning, 2003,52(3):239-281.
    [69]冯夏庭.智能岩石力学导论.北京：科学出版社,2000.
    [70]郭树林,姚香,严鹏等.中深井岩爆研究现状评述.黄金,2009,30(1)：18-21.
    [71]马平波,冯夏庭,张治强等.基于数据挖掘的深部采场岩爆知识的自动获取.东北大学学报(自然科学版),2000,21(6)：630-633.
    [72]冯夏庭,赵洪波.岩爆预测的支持向量机.东北大学学报(自然科学版),2002,23(1)：57-59.
    [73]袁继来,林建入,柯曾勇.岩爆可能性估计的贝叶斯网络方法In:Proceedings of the 3rd International Conference on Computational Intelligence and Industrial Application (PACIIA'10),2101, PP.92-95.
    [74]Friedman JH. On Bias, Variance,0/1-Loss, and the Curse-of-Dimensionality. Data Mining and Knowledge Discovery,1997(1):55-77.
    [75]Su J. and Zhang H. A Fast Decision Tree Learning Algorithm. In:Proceedings of the 21st National Conference on Artificial Intelligence (AAAI'06),2006, pp.500-505, AAAI Press.
    [76]Kohavi R. Scaling Up the Accuracy of Naive-Bayes Classifiers:Decision-Tree Hybrid. In: Proceedings of the2nd International Conference on Knowledge Discovery and Data Mining (KDD'96),1996, pp.202-207, AAAI Press.
    [77]Jiang L, Zhang H and Cai Z et al. One Dependence Augmented Naive Bayes. In: Proceedings of the 1st International Conference on Advanced Data Mining and Applications (ADMA'05),2005, pp.186-194, Springer Press.
    [78]Jiang L, Zhang H and Cai Z et al. Learning Tree Augmented Naive Bayes for Ranking. In: Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA'05),2005, pp.688-698, Springer Press.
    [79]Xiao J, He C and Jiang X. Structure Identification of Bayesian Classifiers Based on GMDH, Knowledge-Based Systems,2009,22 (6):461-470.
    [80]Merz C, Murphy P and Aha DW. UCI Repository of Machine Learning Databases, Department of ICS, University of California, Irvine,1997.
    [81]Su J and Zhang H. Representing Conditional Independence Using Decision Trees. In: Proceedings of the 20th National Conference on Artificial Intelligence (AAAI'05),2005, pp.874-879, AAAI Press.
    [82]Su J and Zhang H. Conditional Independence Trees. In:Proceedings of the 15th European Conference on Machine Learning (ECML'04),2004, pp.513-524, Springer press.
    [83].
    [84]Li C and Li H. One Dependence Value Difference Metric. Knowledge-Based Systems, 2011,24(5):589-594.
    [85]Chow CK and Liu CN. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Transactions on Information Theory,1968,14(3):462-467.
    [86]Jiang L and Zhang H. Weightily Averaged One-Dependence Estimators. In:Proceedings of the 9th Biennial Pacific Rim International Conference on Artificial Intelligence (PRICAI'06),2006, pp.970-974, Springer Press.
    [87]Langley P and Sage S. Induction of selective bayesian classifiers. In:Proceedings of the 10th Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'94) 1994, pp.339-406, Morgan Kaufmann Press.
    [88]Provost FJ and Domingos P. Tree Induction for Probability-Based Ranking. Machine Learning,2003,52(3):199-215.
    [89]Jiang L, Li C and Cai Z. Decision Tree with Better Class Probability Estimation. International Journal of Pattern Recognition and Artificial Intelligence,2009,23(4): 745-763.
    [90]Ling CX and Yan R. Decision Tree with Better Ranking, Proc. In:Proceedings of the 20th International Conference on Machine Learning (ICML'03),2003, pp.480-487, AAAI Press.
    [91]Zhang H and Su J. Learning Probabilistic Decision Trees for AUC. Pattern Recognition Letters,2006,27(8):892-899.
    [92]Jiang L, Li C and Cai Z. Learning Decision Tree for Ranking. Knowledge and Information Systems,2009,20(1):123-135.
    [93]Liang H, Yan Y and Zhang H. Learning Decision Trees with log Conditional Likelihood, International Journal of Pattern Recognition and Artificial Intelligence,2010,24(1): 117-151.
    [94]Quilan JR. Induction of Decision Trees. Machine Learning,1986, 1(1):81-106.
    [95]Press W, Flannery B and Teukolski S et al. Numerical Recipes in C. Cambridge University Press,1988.
    [96]李贤平.概率论基础.北京：高等教育出版社,2000.
    [97]Hall MA. Correlation-Based Feature Subset Selection for Machine Learning. University of Waikato, Tech. Rep.,1998, dep. Comput. Sci.
    [98]Hall MA. Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning. In:Proceedings of the 17th International Conference on Machine Learning (ICML'00),2000, pp.359-366, Morgan Kaufmann Press.
    [99]连承波,李汉林,渠芳等.基于测井资料的BP神经网络模型在孔隙度定量预测中的应用.天然气地球科学,2006,17(3)：382-384.
    [100]罗银河,刘江平,肖长安.优化遗传算法在孔隙度预测中的应用.地学前缘,2003,10(1)：213-218.
    [101]李谋杰,毛宁波,张敏知.基于灰色系统理论的储层参数预测方法与应用.石油天然气学报,2006,28(03)：64-65.
    [102]Zhang J, Li H and Yao F. Rock Critical Porosity Inversion and S-Wave Velocity Prediction. Applied Geophysics,9(1):57-64.
    [103]郭德勇,郑茂杰,鞠传磊等.采煤工作面瓦斯涌出量预测逐步回归方法.北京科技大学学报,2009,31(9)：1095-1099.
    [104]刘新喜,赵云胜.用灰色建模法预测矿井瓦斯涌出量.中国安全科学学报,2000,10(4)：51-54.
    [105]赵朝义,袁修干,孙金镖.遗传规划在采煤工作而瓦斯涌出量预测的应用.应用基础与工程科学学报,1999,7(4)：387-392.
    [106]李曲，蔡之华,朱莉等.基因表达式程序设计方法在采煤工作而瓦斯涌出量预测中的应用.应用基础与工程科学学报,2004,12(1)：49-54.
    [107]谷琼,蔡之华,朱莉等.一种基于PCA的GEP算法及在采煤工作而瓦斯涌出量预测中的应用.应用基础与工程科学学报,2007,15(4)：569-577.
    [108]冯夏庭,王泳嘉,卢世宗.边坡稳定性的神经网络估计.工程地质学报,1995,3(4)：54-61.
    [109]贺可强,雷建和.边坡稳定性的神经网络预测研究.地质与勘探,2001,37(6)：72-75.
    [110]付义祥,刘世凯,刘大鹏.RBF神经网络在边坡岩体稳定性中的预测研究.武汉理工大学学报(交通科学与工程版),2003,27(2)：170-173.
    [111]李梅,夏元友.模糊神经网络案例检索法在边坡稳定评估中的应用.武汉理工大学学报,2008,30(11)：136-139.
    [112]余志雄,周创兵,李俊平等.基于v -SVR算法的边坡稳定性预测.岩石力学与工程学报,2005,24(14)：2468-2475.
    [113]赵洪波,冯夏庭.支持向量机函数拟合在边坡稳定性估计的应用.岩石力学与工程学报,2003,22(2)：241-245.
    [114]Platt JC. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Scholkopf B, Burges C J C, Smola A J, editors, Advances in Kernel methods.Support Vector Learning. Cambridge, MA, MIT Press,1999, pp.185-208.
    [115]Keerthi SS, Shevade SK and Bhattacharyya C et al. Improvements to Platt's SMO Algorithm for SVM Classifier Design. Technical Report CD-99-14, Dept. of Mechanical and Production Engineering, Natl. Univ. Singapore, Singapore,1999.
    [116]Ratanamahatana CA and Gunopulos D. Scaling up the Naive Bayesian Classifier:Using Decision Trees for Feature Selection. In:Proceedings of the 1st International Workshop on Data Cleaning and Preprocessing (DCAP'02), at IEEE International Conference on Data Mining (ICDM'02),2002.
    [117]Lei Y and Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy. Journal of Machine Learning Research,2004,5:1205-1224.
    [118]Kabir Md M, Shahjahan Md and Murase K. A New Local Search Based Hybrid Genetic Algorithm for Feature Selection. Neurocomputing,2011,74(17):2914-2928.
    [119]Casado Yusta S. Different Metaheuristic Strategies to Solve the Feature Selection Problem. Pattern Recognition Letters,2009,30(5):525-534.
    [120]Peng H, Long F and Ding C. Feature Selection Based on Mutual Information:Criteria of Max-Dependency, Max-Relevance, and Min-Redundance. IEEE Trasaction on Pattern Analysis and Machine Intelegence,2005,27(8):1226-1238.
    [121]Loughrey J and Cunningham P. Overfitting in Wrapper Based Feature Subset Selection:the Harder You Try the Worse It Gets. In:Proceedings of the 24th SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence,2004, pp. 33-43, Springer Press.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700