递归神经网络梯度学习算法的收敛性

英文题名：Convergence of Gradient Method for Recurrent Neural Networks
作者：徐东坡
论文级别：博士
学科专业名称：计算数学
中文关键词：递归神经网络 ; 梯度学习算法 ; 单调性 ; 收敛性 ; 等价性
英文关键词：Recurrent neural networks ; Gradient method ; Monotonicity ; Convergence ; Equivalence
学位年度：2009
导师：吴微 ; 李正学
学科代码：070102
学位授予单位：大连理工大学
论文提交日期：2009-03-01

摘要

人工神经网络(Artificial Neural Networks,简写为ANNs)是一种模拟生物神经网络结构进行信息处理的数学模型,也简称为“神经网络”(Neural network,NNs)。按照网络结构可分为两类:前向神经网络(FeedForward NNs)和递归神经网络(Recurrent NNs)。
     在前向神经网络中,前一层的输出为下一层的输入,信息的处理具有逐层传递进行的方向性,一般不存在反馈环路。前向神经网络实现输入向量x到输出向量y的映射,通常称之为静态映射,可用于处理与时间无关的对象,如文字识别,曲线逼近等问题。而在非线性动态系统建模、辨识、控制、故障诊断以及时间序列预测等许多领域中,经常涉及到两个离散时间序列x(t)和y(t)之间的映射,其中y(t)不仅依赖于x(t),而且还依赖于x(t-1),x(t-2),…,以及y(t-1),y(t-2),…,一般称之为动态映射。处理这类问题的网络本身应是一个动态系统,为此需要在网络中引入记忆功能。递归神经网络通过它们自身的暂态操作能够处理时变的输入和输出,它实现的是动态映射,比前向神经网络更适合于解决动态系统的问题。
     类似于前向神经网络,在训练递归神经网络时经常使用简单的梯度搜索算法。由于其递归的特性,致使对梯度的计算也是递归的,从而使其学习较前向网络要复杂得多。递归神经网络梯度学习算法的重要研究的课题之一便是其收敛性理论,对其开展研究不仅有助于我们理解方法的本质与特性,而且对其众多的具体应用也有着重要的指导意义。
     第一章回顾了有关神经网络的一些背景知识。
     第二章讨论了全递归神经网络梯度下降学习算法的收敛性。我们给出了误差函数单调性及收敛性定理,并给出了数值试验结果。
     第三章考虑有限样本集上Elman网络梯度学习算法的确定收敛性。证明了误差函数的单调递减性,在此基础上,给出了一个弱收敛性结果和一个强收敛结果,即误差函数的梯度收敛于零,权值序列收敛于固定点。数值试验验证了理论结果的正确性。
     第四章研究了在Elman神经网络的误差函数梯度中部分地去掉反馈项时对其性能的影响,主要目的是为了解决计算量太大的难题。我们分析了这种近似梯度算法的收敛性,得到了在学习过程中目标函数的单调性及近似梯度趋近于零的结果。
     第五章揭示了递归神经网络梯度学习算法的等价性。递归神经网络的两种经典学习算法分别为实时递归学习算法和随时间演化的反向传播算法,当权值更新为批方式时,我们证明这两种算法是等价的,二者所生成的权值增量相同。
     第六章针对递归神经网络的一些改进学习算法,给出了收敛性结果。
An artificial neural network(ANN),which is often called "neural network"(NN),is a mathematical model or computational model based on biological neural networks for processing information.According to structure,neural networks can be classified into two categories: Feedforward neural networks(FNNs) and Recurrent neural networks(RNNs).
     In FNNs,previous layer's output is the input of the next layer.The processing of the information has the direction of passing layer by layer.There are no cycles or loops in the network.FNNs achieve the mapping from input vector x to output vector y,which can be called as the static mapping.It can be used to deal with the time-independent objects such as character recognition and curve approximation.However,The mapping between two discrete time series x(t) and y(t) is often used in many fields such as nonlinear system modeling,controlling,fault diagnosis and prediction of time series,in which the output y(t) relies not only on x(t),but also on x(t-1),x(t-2),…,and y(t-1),y(t-2),…,which can be viewed as dynamic mapping.The networks dealing with such kind of problems should be a dynamic system,in which the memory function should be added.RNNs can cope with the time-varying input and output through their own delay.Thus,RNNs achieve the dynamic mapping.They are more appropriate to solve the problems in dynamic system than FNNs.
     As in the case of FNNs,the simple gradient searching algorithms is often used in training RNNs.The computation of gradient is also recursive for its recursiveness,whose learning is much more complicated than FNNs.One of the key research subjects of the gradient method for training recurrent neural networks is its convergence theory.The research on it not only helps us to understand the nature and character of the method but also provides the significant guidance for a large number of actual applications
     Chapter 1 reviews the background information about the neural networks.
     Chapter 2 discusses the convergence of gradient method for fully recurrent neural networks. In this chapter,we put forward the monotonicity of the error function and convergence. The theoretical results are supported by numerical experiments.
     Chapter 3 considers the convergence of gradient method for training Elman networks with a finite training sample set.Monotonicity of the error function in the iteration is shown,on the basis of which weak and strong convergence results are proved,that is,the gradient of the error function goes to zero and the weight sequence goes to a fixed point,respectively.A numerical experiment is given to support the theoretical findings.
     Chapter 4 studies the influences of cutting the recursion in the gradient of the error function, whose aim is to reduce greatly the computational effort.We analyse convergence of this approximated gradient method for training Elman networks,and obtain that the error function is monotonically decreasing and its approximated gradient goes to zero in the learning process.
     Chapter 5 shows the equivalence of gradient method for training recurrent neural networks. The two classical gradient-based algorithms for recurrent neural networks are Real-Time Recurrent Learning(RTRL) and Back-Propagation Through Time(BPTT),respectively.For batch scheme,we prove that RTRL and BPIT are equivalent.The weight increment(s) they produced is the same.
     Chapter 6 gives the conclusion about the convergence on some improved learning algorithms for Recurrent NNs.

引文

[1]McCulloch W S,Pitts W.A logical calculus of the ideas immanent in nervous activity[J].Mathematical Biophysics,1943,5:115-133.
    [2]Hebb D O.The Organization of Behavior:A Neuropsychological Theory[M].New York:Wiley,1949.
    [3]Rosenblatt F.The perceptron:a probabilistic model for information storage and organization in the brain[J].Psychological Review,1958,65:386-409.
    [4]Widrow B,Hoff M E.Adaptive switching circuits[C].IRE WESCON Convention Record.New York:Institute of Radio Engineers,1960,4:96-104.
    [5]Minsky M,Papert S.Perceptrons[M].Cambridge:MIT Press,1969.
    [6]Kohonen T.Correlation matrix memories[J].IEEE Transactions on Computers,1972,21:353-359.
    [7]Carpenter G A,Grossberg S.The ART of adaptive pattern recognition by a self-organizing neural network [J].Computer,1988,21:77-88.
    [8]Fukushima K,Miyake S.Neocognitron:a new algorithm for pattern recognition tolerant of deformations and shifts in position[J].Pattern Recognition,1982,15(6):455-469.
    [9]Hopfield J J.Neural networks and physical systems with emergent collective computational abilities[J].Proceedings of the National Academy of Sciences,USA,1982,79(8):2554-2558.
    10]Rumelhart D E,McClelland J L.Parallel Distributed Processing:Explorations in the Microstructure of Cognition[M],Cambridge,MA:The MIT Press,1986.
    [11]阎平凡,张长水.人工神经网络与模拟进化计算[M].北京:清华大学出版社,2000.
    [12]吴微.神经网络计算[M].北京:高等教育出版社,2003.
    [13]杨钟瑾,史忠科.快速自顶向下优化神经网络结构的方法[J].系统仿真学报,2005.17(1):162-165.
    [14]张建,史忠植.多层随机神经网络EM算法[J].计算机研究与发展,1996,33(11):808-815
    [15]Vapnik V N.The Nature of Statistical Learning Theory[M].New York:Springer,1995.
    [16]Vapnik V N著.张学工译.统计学习理论的本质[M].北京:清华大学出版社.2000.
    [17]Amari S.Information geometry of the EM and em algorithms for neural networks[J].Neural Networks,1995,8(9):1379-1408.
    [18]Chen T P,Chen H.Approximation capability to functions of several variables,nonlinear functionals,and operators by radial basis function neural networks[J].IEEE Transactions on Neural Networks,1995,6(4):904-910.
    [19]Chen T P,Chen H.Universal approximation to nonlinear operators by neural networks with arbitrary action functions and its application to dynamical systems[J].IEEE Transactions on Neural Networks,1995,6(4):911-917.
    [20]Yang J,Wu W.Is bias dispensable for fuzzy neural networks? Fuzzy Sets and Systems[J],2007,158(24):2757-2762.
    [21]Yu W W,Cao J D,Chert G R.Stability and hopf bifurcation of a general delayed recurrent neural network[J].IEEE Transactions on Neural Networks,2007,18(2):416-430.
    [22]Zhang L,Zhang Y.Yu J.Multiperiodicity and attractivity of delayed recurrent neural networks with unsaturating piecewise linear transfer functions[J].IEEE Transactions on Neural Networks,2008,19(1):158-167.
    [23]Liao X F,Chert G R,Sanchez E N.Delay-dependent exponential stability analysis of delayed neural networks:an LMI approach[J].Neural Networks,2002,15(8):855-866.
    [24]Shi X H,Liang Y C,Lee H P,et al.Improved elman networks and applications for controlling ultrasonic motors[J].Applied Artificial Intelligence,2004,18(7):603-629.
    [25]Wu W,Feng G R,Li X.Training multilayer perceptrons via minimization of sum of ridge functions[J].Advances in Computational Mathematics,2002,17:331-347.
    [26]Gori M,Maggini M.Optimal convergence of online backpropagation[J].IEEE Transactions on Neural Networks,1996,7(1):251-254.
    [27]Wu W,Xu Y S.Deterministic convergence of an online gradient method for neural networks[J].Journal of Computational and Applied Mathematics,2002,144(1-2):335-347.
    [28]Haykin S.Neural Networks:A Comprehensive Foundation[M].第2版.北京:清华大学出版社,2001.
    [29]Chen T B,Soo V W.A comparative study of recurrent neural network architectures on learning temporal sequences[J].IEEE International Conference on Neural Networks,1996,4:1945-1950.
    [30]Williams R J,Zisper J.A learning algorithm for continually running fully recurrent neural networks[J].Neural computation,1989,1:270-280.
    [31]Atiya A F,Parlos A G.New results on recurrent network training:Unifying the algorithms and accelerating convergence[J].IEEE Transaction on Neural Networks,2000,11:697-709.
    [32]Jesus O D,Hagan M T.Backpropation algorithms for a broad class of dynamic networks[J].IEEE Transactions on Neural Networks,2007,18:14-27.
    [33]Shao H M,Wu W,Li F,et al.Convergence of batch gradient algorithm for feedforward neural network training[J].Journal of Information and Computational Science,2007,4(1):251-255.
    [34]Liang Y C,Feng D R Lee H P,et al.Successive approximation training algorithm for feedforward neural networks[y].Neurocomputing,2002,42:311-322.
    [35]Wu W,Feng G R,Li Z X,et al.Deterministic convergence of an online gradient method for BP neural networks[J].IEEE Transactions on Neural Networks,2005,16:533-540.
    [36]Ku C C,Lee K Y.Diagonal recurrent neural networks for dynamic systems control[J].IEEE Transaction on Neural Networks,1995,6:144-156.
    [37]Kuan C M,Homik K,White H.A convergence results for learning in recurrent neural networks[J].Neural Computation,1994,6:420-440.
    [38]Aussem A.Sufficient conditions for error backflow convergence in dynamical recurrent neural networks [J].Neural Computation,2002,14:1907-1927.
    [39]Xu D R Li Z X,Wu W.Convergence of gradient descent algorithm for a recurrent neuron[j].Lecture Notes in Computer Science,2007,4913:117-122.
    [40]Gori M,Maggini M.Optimal convergence of on-line backpropagation[J].IEEE Transaction on Neural Networks,1996,7:251-254
    [41]Ortega J,Rheinboldt W.Iterative Solution of Nonlinear Equations in Several Variables[M].Academic Press,New York,1970.
    [42]Yuan Y,Sun W.Optimization Theory and methods[M].Beijing:Science Press,2001.
    [43]Elman J L.Finding structure in time[J].Cognitive Science,1990,14(2):179-211.
    [44]Tsoi A C,Back A D.Locally recurrent globally feedforward networks:a critical review of architectures [J].IEEE Transactions on Neural Networks,1994,5(2):229-239.
    [45]Li X,Chen Z Q,Yuan Z Z.Nonlinear stable adaptive control based upon Elman networks[J].Applied Mathematics - Journal of Chinese Universities Series B,2000,15(3):332-340.
    [46]Kremer S C.On the computational power of Elman-style recurrent networks[J].IEEE Transactions on Neural Networks,1995,6(4):1000-1004.
    [47]Pham D T,Liu X.Training of elman networks and dynamic system modeling[J].International Journal of Systems Science,1996,27(2):221-226.
    [48]Carding B,On the implicit acquisition of a context-free grammar by a simple recurrent neural network [J].Neurocomputing,2008,71(7-9):1527-1537.
    [49]Li X,Chert Z Q,Yuan Z Z,et al.Generating chaos by an Elman network[J].IEEE Transactions on Circuits and Systems-I,2001,48(9):1126-1131.
    [50]Ekici S,Yildirim S,Poyraz M.A transmission line fault locator based on Elman recurrent networks[J].Applied Soft Computing,2008,DOI:10.1016/j.asoc.2008.04.011.
    [51]Neto L B,Coelho P H G,Soares de Mello J C C B,et al.Flow estimation using an Elman networks[C].Proceedings of 2004 IEEE International Joint Conference on Neural Networks,IEEE Press,Budapest,Hungary,2004,831-836.
    [52]Demuth H B,Beale M H,Hagan M T.Neural Network Toolbox User's Guide[M],Natick,MA:The Mathworks Inc,2007.
    [53]Xu D P,Li Z X,Wu W,et al.Convergence of gradient descent algorithm for diagonal recurrent neural networks[C].International Conference on Bio-Inspired Computing:Theories and Applications,Zhengzhou,China,2007,in Press.
    [54]Wu W,Shao H M,Qu D.Strong convergence for gradient methods for BP networks training[C].Proceedings of 2005 International Conference on Neural Networks and Brains,IEEE Press,Beijing,China,2005,332-334.
    [55]Wang D L,Liu X M,Ahalt S C.On temporal generalization of simple recurrent networks[J].Neural Networks,1996,9(7):1099-1118.
    [56]Ham F M,Kostanic I著.叶世伟,王海娟译.神经计算原理[M].北京:机械工业出版社,2007.
    [57]Wu W,Xu D P,Li Z X.Convergence of gradient method for Elman network[J].Applied Mathematics and Mechanics,2008,29(9):1231-1238.
    [58]Xu D P,Li Z X,Wu W.Convergence of approximated gradient method for Elman network[J].Neural Network World,2008,18(3):171-180.
    [59]Werbos P.Baekpropagation through time,what it does and how do it[J].In:Proceedings of the IEEE,1990,78:1550-1560.
    [60]Williams R J,Peng J.An efficient gradient-based algorithm for on-line training of recurrent network trajectories[J].Neural Computation,1990,2(4):490-501.
    [61]Bersini H,Gorrini V.A simplification of the backpropagation-through-time algorithm for optimal neurocontrol [J].IEEE Transactions on Neural Networks,1997,8(2):437-441.
    [62]Bhaya A,Kaszkurewiez E.Steepest descent with momentum for quadratic functions is a version of the conjugate gradient method[J],Neural Networks,2004,17(1):65-71.
    [63]Phansalkar V V,Sastry P S.Analysis of the back-propagation algorithm with momentum[J],IEEE Transactions on Neural Networks,1994,5(3):505-506.
    [64]Qian N.On the momentum term in gradient descent learning algorithms[J],Neural Networks,1999,12(1):145-151.
    [65]Zhang N M,Wu W,Zheng G F.Convergence of gradient method with momentum for two-layer feed-forward neural networks[J].IEEE Transactions on Neural Networks,2006,17(2):522-525.
    [66]Setiono R.A penalty-function approach for pruning feedforward neural networks[J].Neural Computation,1997,9:185-204.
    [67]Kong J,Wu W.Online gradient methods with a punishing term for neural networks[J].Northeastern Mathematical Journal,2001,17(3):371-378.
    [68]Tadic V,Stankovic S.Learning in neural networks by normalized stochastic gradientalgorithm:local convergence[C].Proceedings of the 5th Seminar on Neural Network Applications in Electrical Enginoering,Belgrade,Yugoslavia,2000,11-17.
    [69]Mangasarian O L,Soiodov M V.Serial and parallel backpropagation convergence via nonmonotone perturbed minimization[J].Optimization Methods and Software,1994,4(2):103-116.
    [70]Bertsekas D P,Tsitsildis J N.Gradient convergence in gradient methods with errors[J].SIAM Journal on Optimization,2000,10(3):627-642.
    [71]王健,吴微.BP神经网络在线梯度算法的收敛性[J],出版中.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700