基于半监督多示例的径向基函数网络模型及学习算法研究

英文题名：Research on RBF Neural Networks Model and Learning Algorithm Based on Semi-Supervied and Multiple-Instance
作者：于文韬
论文级别：硕士
学科专业名称：石油工程计算技术
中文关键词：半监督学习 ; 多示例学习 ; 聚类 ; Hausdorff距离 ; 径向基神经网络 ; 径向基过程神经元网络
英文关键词：Semi-Supervied Learning ; Multiple-Instance Learning ; Cluster ; Hausdorff distance ; RBF Neural Network ; RBF Process Neural Network
学位年度：2011
导师：许少华
学科代码：081202
学位授予单位：东北石油大学
论文提交日期：2011-03-14

摘要

实际生产生活中的数据集合存在着多样性与不确定性,处理具有多样性的海量数据成为目前机器学习的重点任务。因此,针对半监督多示例机器学习模型及算法的研究将成机器学习理论研究的热点方向。
     本文针对半监督多示例问题,首先,在非时序样本空间下,以径向基函数网络与聚类算法为基础,提出一种基于半监督多示例径向基网络的训练算法,并分析了样本空间中的孤立点问题。该算法的基本思想是通过定义一种可以衡量集合间距离的Hausdorff距离,进而在该距离的定义下提出一种基于半监督多示例的聚类算法。该算法充分借助已标记多示例样本的先验经验,对无标记样本进行标识,从而探明样本空间在聚类假设下的分布情况,再以Hausdorff距离作为径向基核函数中的泛数,利用径向基网络对整个样本集合进行训练学习,从而达到提高网络训练能力的目的。本文对该算法进行了仿真实验,证明了其实用性。
     其次,针对更为一般的时序样本空间,在径向基过程神经元网络、时序聚类算法与遗传算法的基础上,提出了一种基于半监督多示例的径向基过程神经元网络算法。该算法的基本思想是将Hausdorff距离做时空维的扩展,得到一种广义时序的Hausdorff距离,进而得到时序半监督多示例聚类算法,再采用径向基过程神经元网络对样本集合进行训练。在训练过程中,需要对核中心函数的系数进行调整,为解决min { }函数不可微的问题,引入遗传算法;同时利用遗传算法可以得到全局最优解的这一特性,可以减少网络训练时所需的迭代次数。并通过仿真实验,证明了其有效性。
     最后,本文针对神经网络大样本集训练普遍存在的效率低这一缺陷,提出基于MPI与OpenMp混合编程技术下的半监督多示例径向基过程神经元网络并行训练算法。该算法主要是针对半监督多示例径向基过程神经元网络中的聚类算子与遗传算子进行并行化计算。并针对不同规模的训练函数样本集和计算节点进行了对比实验。实验结果表明,根据网络和样本规模适当选取并行粒度,可以有效地降低网络训练时间,达到提高网络性能的目的。
There is too much diversity and uncertainty in the data of actual productive lifestyles, dealing with diversity and vast amounts of data become the focus of present machine learning tasks. So, the models and algorithms of machine learning base on semi-supervised and multi-instance will become a hotspot research direction.
     Firstly, aiming at the problems of semi-supervised and multi-instance, a training algorithm of RBF neural network based on semi-supervised and multi-instance was proposed in this paper, the proposed algorithm was in non- sequential sample space, which based on the RBF neural network, and the cluster algorithm. At the same time, the paper carried on the outlier analysis in the sample space. The basic ideas of the algorithm were recommended as follows: By defining a kind of Hausdorff distance which can measure the distance between two sets, then a clustering algorithm based on semi-supervised and multi-instance is proposed.The proposed algorithm marks the unlabeled sample with the help of transcendent experience of labeled sample. So that distribution of sample space under the cluster assumption was proved up, then the method used RBF neural network to train the whole sample set, among which Hausdorff distance is used as the norm of the RBF kernel function, so as to improve network training ability of neural networks. And in order to prove practicability of the algorithm, a simulation experiment is carried out.
     Secondly, in sequential sample space, base on the RBF process neural networks, timing clustering algorithm and genetic algorithm, a training algorithm of RBF process neural network based on semi- supervised and multi-instance was proposed in sequential sample space, which can be regard as a general case. The basic ideas of the algorithm were recommended as follows: By defining a kind of generalized timing Hausdorff distance which was extended by the Hausdorff distance, a timing clustering algorithm based on semi-supervised and multi-instance was proposed. Then, the method used RBF process neural network to train the sample set. In the training process, the neural network needed adjusting coefficients of kernel central functions. The method introduced genetic algorithm to solving non-differentiable problem of minimal function, at the same time, owing to the global optimal property of genetic algorithm, the proposed algorithm of neural network could reduces iteration times. And the practicability of this algorithm is proved by a simulation experiment.
     Finally, aiming at the problem in the ineffective training of neural networks, under the mixing MPI and OpenMP program technology, a parallel training algorithm based on semi-supervised and multi-instance is proposed. In the algorithm, the parallel computation would be realized in the clustering operator and genetic operators of the RBF process neural network base on semi-supervised and multi-instance. Under different magnitude samples and compute nodes, the comparative tests are carried out. The results show that the algorithm could reduce the training time, improve the property of neural network, when the parallel granularity was appropriate.

引文

[1] ZHU X J. Semi-supervised Learning Literature Survey[R]. Madison: University of Wis- consin, 2008.
    [2]周志华.半督学习的协同训练算法[M]//周志华,王珏.机器学习及其应用.北京:清华人学出版社,2007:259-275.
    [3]蔡自兴,李枚毅.多示例学习及其研究现状[J] .控制与决策,2004,19(6):607-615.
    [4]苏金树,张博铎,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859.
    [5]NIGAM K, MCCALLUM A K, THRUN S, MITCHELL T. Text Classification from Labeled and Unlabeled Documents using EM[J]. Machine Learning, 2000, 39: 103-134.
    [6]K.P.Bennett and A. Demiriz.Semi-supervised support vector machines, Advances in Neural Information Processing Systems, Cambridge, MA, 1998, 10: 368-374.
    [7]Tobias Scheffer and Stefan Wrobel.Active learning of partially hidden markov models. In Proceedings of the ECML/PKDD Workshop on Instance Selection, 2001.
    [8]Aharon Bar-Hillel, Tomer Hertz, Noam Shental, and Daphna Weinshall. Learning distance functions using equivalence relations. In Proc. of 20th International Conference on Machine Learning, 2003.11-18.
    [9]Xiaojin Zhu, Zoubin Ghahramani, John Lafferty, Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions, In Proc. of ICML2003.
    [10]Blum, A.and Chawla, S.Learning from Labeled and Unlabeled Data using Graph Minc- uts. In Proceedings of ICML 2001.
    [11]M.Belkin, P.Niyogi, Using Manifold Structure for Partially Labeled Classification, Advances in Neural Information Processing Systems 15(NIPS 2002), 2002.
    [12]Chen JX, Ji DH. Graph-Based semi-supervised relation extraction. Journal of Software, 2008, 19(11):2843-2852.
    [13]李凯,陈新勇.基于核策略的半监督学习方法[J].计算机工程,2009,(15):170-172.
    [14]K.Nigam and R.Ghani, Analyzing the effectiveness and applicability of Co-training, Proceedings of Information and Knowledge Management, 2000.86-93.
    [15]Ion Muslea, Steven Minton, Craig A.Knoblock, Active+Semi-Supervised Learning Ro- bust Multi-View Learning,ICML2002,2002.
    [16]Martin Szummer, Tommi Jaakkola&Tomaso Poggio, Learning from Partially Labeled Data, Artificial Intelligence Laboratory and The Center for Biological and Computational Learning, Massachusetts Institute of Technology Cambridge, Massachusetts 02139, http://www.ai.mit.edu
    [17]蔡自兴,李枚毅.多示例学习及其研究现状[J].控制与决策,2004,19(6):607-615.
    [18]Dietterich T.G., Lathrop R.H., Lozano-Pérez T.. Solving the multiple instance problem with axis-parallel rectangles[J]. Artificial Intelligence, 1997, 89(12): 31-71.
    [19]Long PM,Tan L.PAC Learning Axis-aligned Rectangles with respect to Product Distributions from Multiple-instance Examples.Machine Learning, 1998, 30(1): 7-21.
    [20]Maron O, Lozano-Pérez T. A Framework for Multiple-Instance Learning[M]. Cambri- dge, MA: MIT Press, 1998.
    [21]杨志武.多示例学习算法研究[D].郑州大学,2007.
    [22]Zhang Q, Goldman S A. EM-DD: an improved multiple-instance learning technique[C] //Neural Information Processing Systems. Cambridge, MA: MIT Press, 2002: 1073-1080.
    [23]葛永,吴秀清,洪日昌.基于多示例学习的遥感图像检索[J].中国科学技术大学学报,2009,(02):132-136.
    [24]李杰,程义民,葛仕明,曾丹.基于显著点特征多示例学习的图像检索方法[J].光电子.激光,2008,(10):1406-1409.
    [25]戴露,谭海樵,解洪胜.利用多示例学习技术的图像检索方法[J].能源技术与管理,2008,(03):118-120.
    [26]黎铭,薛晓冰,周志华.基于多示例学习的中文Web目录页面推荐[J].软件学报,2004,(09):1328-1335.
    [27]薛晓冰,韩洁凌,姜远,周志华.基于多示例学习技术的Web目录页面链接推荐[J].计算机研究与发展,2007,(03):406-411.
    [28]汤世平,樊孝忠.基于多示例学习的题库重复性检测研究[J].北京理工大学学报,2005,(12):1071-1074.
    [29]O. Maron. "Learning from ambiguity," PhD dissertation, Department of Electrical Eng- ineering and Computer Science, MIT, Jun 1998.
    [30]陈良维.数据挖掘中聚类算法研究[J].微计算机信息,2006,(21):209-211.
    [31]Mitchell TM.机器学习[M].曾华军,张银奎译.北京:机械工业出版社,2003:118-122,136-140.
    [32]李昆仑,曹铮,曹丽苹,张超,刘明.半监督聚类的若干新进展[J].模式识别与人工智能,2009,(05):735-742.
    [33]Basu S, Banerjee A, Mooney R J. Semi-Supervised Clustering by Seeding. //Proc of the 19th International Conference on Machine Learning Sydney Australia, 2002: 19-26
    [34]Q. Zhang, S. A. Goldman. "EM-DD: An improved multiple instance learning techniq- uee," In Neural Information Processing Systems 14, 2001.
    [35]Zhou, Z.-H., & Zhang, M.-L.: Ensembles of multi-instance learners. In: Proceedings of the 14th European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, 2003, 492-502.
    [36]阎平凡,张长水.人工神经网络与模拟进化计算[M].北京:清华大学出版社,2005,18-27.
    [37]Moody JE, Darken CJ. Fast learning in networks of locally-tuned processing units. Neural Computation, 1989, 1(2): 281-294.
    [38]Chen S, Cowan CFN. , and Grant PM. Orthogonal least squares learning algorithm fo- rradial basis function networks. IEEE Transactions on Neural Netwo-rks, 1991, 2(2): 302-309.
    [39]Cx A. Edgar. Measure, Topology, and Fractal Geometry, 3rd print, New York: Springer, 1995.
    [40]徐遵义,晏磊.基于Hausdorff距离的海底地形匹配算法仿真研究[J].计算机工程,2007,33(9):7-9.
    [41]谢红薇,李晓亮.基于多示例的K-means聚类学习算法[J].计算机工程,2009,(22):178-181.
    [42]曾文冲,欧阳健等编著.测井地层分析与油气评价.北京:石油工业部勘探培训中心,1982:418-423.
    [43]He Xin-gui, Liang Jiuzhen, Procedure Neural Networks, Proceedings of Conference on Intelligent Information Proceeding, 16th World Computer Congress 2000, pp. 143-146, August 21-25, 2000, Beijing, China, Publishing House of Electronic Industry.
    [44]许少华,何新贵.径向基过程神经元网络及其应用研究[J].北京航空航天大学学报,2004,(1):14-17.
    [45]Liang Jiuzhen, Zhou Jiaqing, He Xingui, Procedure Neural Networks with Supervised Learning, 9th International Conference on Neural Information Processing, Singapore, Nov. 2002, 523-527.
    [46]何新贵,梁久祯.过程神经元网络的若干理论问题.中国工程科学,2000,2(12):40-44.
    [47]许少华,何新贵,刘坤等.关于连续过程神经元网络的一些理论问题.电子学报,2006,34(10):1838-1841.
    [48]席裕庚,柴天佑,恽为民.遗传算法综述[J].控制理论与应用,1996,(06):697-708.
    [49]葛利.一种基于混合遗传算法学习的过程神经网络.哈尔滨工业大学学报,2005,37(7):986-988.
    [50]陈玉芳,雷霖.提高BP网络收敛速率的又一种算法.计算机仿,2004,21(11):74-79.
    [51]李敏生,刘斌.BP学习算法的改进与应用.北京理工大学学报,1999,19(6):721-724.
    [52]郭本俊,王鹏,陈高云,黄健.基于MPI的云计算模型[J].计算机工程,2009,(24):84-86.
    [53]霍旭光.基于MPI高性能计算方法的研究[D].中国地质大学(北京),2006.
    [54]陈永健.OpenMP编译与优化技术研究[D].清华大学,2004.
    [55]王惠春.基于SMP集群的MPI+OpenMP混合并行编程模型研究与应用[D].湘潭大学,2008.
    [56]张弦.基于数据并行的BP神经网络训练算法[D].华中科技大学,2008.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700