基于多维数据分析的神经网络与分布式计算研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
人工神经网络技术以其大规模并行处理、分布式存储、自适应性、容错性等优点吸引了众多领域科学家的广泛关注,被广泛地应用于生物、电子、计算机、数学等领域。随着网络通信技术和互联网的飞速发展,分布式计算成为影响当今计算机技术发展的关键技术力量之一,在现代社会和经济发展中得到越来越广泛的应用。这两项技术都离不开数据,而大量的数据来自数据仓库存储的多维数据;这两项技术都需要数据分析,都会涉及多维矩阵。因此,研究基于多维数据分析的神经网络与分布式计算有着重要的意义,使得本研究工作得到国家自然科学基金的支持。
     本文的工作主要分为以下四个方面。
     在多维数据分析与多维矩阵研究方面,针对数据仓库中进行多维数据分析处理的重要性,引入多维矩阵的概念,对应用最广泛的立体阵,讨论了它的运算性质,为在神经网络和分布式计算中的应用打下基础。
     在基于多维数据分析的神经网络研究方面,首先构造了一种无监督学习的凸约束神经网络模型,该网络具有特殊结构,能实现数据压缩与还原过程,经过训练后可以表示信息的主要特征。其次研究了一种贝叶斯神经网络,运用广义朴素贝叶斯方法来处理连续变量,构造一种正交多项式核函数对其先验分布的密度函数进行估计,进一步研究了密度函数及其导数的核估计的优良性。然后针对全要素生产率研究,构造了一个分岔神经网络,实现了利用随机前沿面模型进行TFP测度。最后,构造了一种通过相互影响而使输出结果一致的半监督异构神经网络来计算TFP贡献率,并且详细地讨论了该神经网络的结构与算法。
     在基于多维数据分析的分布式计算研究方面,首先针对结构方程模型改进了偏最小二乘算法,构造了确定性算法。其次研究了多对象结构方程模型,采用分布式计算来计算结构方程中每组的系数,使用带凸约束的广义线性模型建立新模型,给出了多对象结构方程模型的算法。然后研究了多元非参数回归曲线漂移模型,使用分布式计算进行多元曲线漂移模型销售曲线的预测。最后研究了若干具体的分布式计算的应用,包括一般分布函数表的Monte Carlo分布式计算,蛋白质分子构造的分布式计算问题以及MOS管寿命分布的负指数矩估计与分布式计算。
     最后,作为基于多维数据分析的神经网络与分布式计算的综合应用,本文介绍了我们团队研发的大型应用系统——顾客满意指数测评分析系统。它基于数据仓库与.NET技术开发,采用无监督学习的凸约束神经网络模型架构,实现了基于远程方法调用的分布式计算。
Artificial neural network technology is a topic concerned by scientists in many domains, because of its characteristics such as massive parallel process, distributed storage, self-adaptability, fault-tolerant and so on. It has been widely applied in many fields such as biology, electronics, computer science, mathematics and so on. With the rapid development of network communication technology and Internet, the distributed computing has become one of the key technologies influencing today's development in computer technology. And it has been used in modern society and economic development. Both of the technologies need data, however, lots of data come from the multidimensional data stored in data warehouse. Both of the technologies need data analysis, which will involve multidimensional matrix. Therefore, it has important meaning to study the artificial neural networks and distributed computing based on multidimensional data analysis, so our research was supported by National Natural Science Fund of China.
     This dissertation is divided into four parts as follows.
     The first part focuses on the study of multidimensional data analysis and multidimensional matrix. We introduce the concept of multidimensional matrix, according to necessity of using multidimensional data analysis in data warehouse. Then we discuss the properties of cubic matrix which has the most widely application in multidimensional matrix, so we establish basis for application in neural network and distributed computing.
     The second part focuses on the study of artificial neural networks based on multidimensional data analysis. At first, we proposes a kind of unsupervised learning neural network model with convex constraint which has special structure and can realize the compression of data and reduction process. The main characteristics of the neural network can represent information after being trained. Secondly, we study a kind of Bayes neural networks, and adopt general naive Bayes to handle continuous variables, then, propose a kind of kernel function constructed by orthogonal polynomials which is used to estimate the density function of prior distribution in Bayes network, furthermore, make researches into optimality of the kernel estimation of density and derivatives. Thirdly, aiming at research of total factor productivity (TFP), we construct a fork neural network to implement TFP measure by stochastic frontier model. Finally, in order to compute TFP contribution rate, we put forward a kind of semi-supervised heterogeneous neural networks which makes output results consistent by interaction. Also we discuss the construction and algorithm of this neural network in detail.
     The third part concerns distributed computing based on multidimensional data analysis. Firstly, we propose an improved partial least square algorithm in structural equation model (SEM), which constructs a deterministic algorithm. Then multi-group structural equation model is analyzed and distributed computing is adopted to calculate all the coefficients. Furthermore, a uniform model is built using the generalized linear model with convex constraint and an algorithm for the multi-group SEM is presented. Moreover, we put forward the multivariate nonparametric regression curve drift model, and apply distributed computing to forecast the sale curve of multivariate curve drift model. At last, we apply distributed computing to several fields, which include Monte Carlo distributed computing for general distribution function table of probability of statistics, distributed computing for modeling the decomposition products of a protein and bootstrap analysis of MOSFET life distribution with negative order moment estimate and its distributed computing.
     The final part is an integrated application of neural networks and distributed computing based multidimensional data analysis. This dissertation introduces customer satisfaction index measure analysis system which is a large application system developed by our team. The system is based on data warehouse and .NET technique, uses the structure of unsupervised learning neural network model with convex constraint, and realizes network remote calculation and distributed computing.
引文
[1]Gray J,Chaudhuri S,Bosworth A,et al.Data cube:a relational aggregation operator generalizing group-by,cross-tab,and sub-totals.Data Mining and Knowledge Discovery,1997,1(1):29-53
    [2]Agrawal R,Gupta A,Sarawagi S.Modeling multidimensional databases.In:Gray Alex,Larson Per-ke eds.Proceedings of the 13th International Conference on Data Engineering.Birmingham:IEEE Computer Society Press,1997,232-243
    [3]Harinarayan V,Rajaraman A,Ullman J D.Implementing data cube efficiently.In:Jagadish H V,Mumick lnderpal Singh eds.Proceedings of ACM SIGMOD International Conference on Management of Data.New York:ACM Press,1996,205-216
    [4]裴健,唐世渭,杨冬青等.联机分析处理数据立方体代数.软件学报,1999,10(6):561-569
    [5]李建中,高宏.一种数据仓库的多维数据模型.软什学报,2000,11(7):908-917
    [6]胡孔法,蒋蜂,宋爱波等.OLAP中聚集函数的更新.第十八界全国数据库会议论文集,2001:54-58
    [7]冯玉才,向隆刚,冯剑琳等.维.卜带层次的数据立方体.第十八界全国数据库会议论文集,2001:73-76
    [8]周丽娟.在数据仓库中使用实视图优化查询.计算机工程与应用,2004,40(16):181-183
    [9]李杰霍,剑青,王晓蒲.一种基于数据立方体的数据泛化算法.计算机工程与应用,2002(11:194-195
    [10]W.P.Yan,P.A.Larson,Eager Aggregation,et al.In:U.Dayal,P.M.D.Gray,S.Nishio.VLDB'95,Proceedings of the 21st lntl.Conference on VLDB.Zurich,Swizerland:Morgan Kaufmann,1995,345-357
    [11]B.Husemann,J.Lechtenborger,G.Vossen.Conceptual Data Warehouse Design.In:M.A.Jeusfeld,H.Shu,M.Staudt,et al,eds.Proceedings of the Second Intl.Workshop on Design and Management of Data Warehouses.Stockholm,Sweden:Technical University of Aachen,2000:201-211
    [12]Chen Y,Dong G,Han J,et al.Multi-dimensional regression analysis of time-series datastreams.In Proc.2002 Int.Conf.Very Large DataBases(VLDB'02),Hong Kong,China,2002,323-334
    [13]王加阳,李超良,李睿.线性回归法在时序数据聚集中的应用.企业技术开发,2003(7):7-9
    [14]Bates D M,Watts D G.Relative curvature measure of nonlinearity,Journal of the Royal Statistical Society,1980(42):1-25
    [15]Tsai C L.Contributions to the design and analysis of nonlinear models.Ph.D.Thesis,Univ.of Mimesota,1983
    [16]张利军,程代展,李春文.立体阵的一般结构.系统科学与数学,2005,25(4):439-450
    [17]张应山.多边矩阵理论.北京:中国统计出版社,1993
    [18]Cheng Daizhan,Semi-tensor product of matrices and its application to Morgen's problem,Science in China.(Series F),2001,44(3):195-212
    [19]廖晓峰,李传东.神经网络研究的发展趋势.国际学术动态,2006,43-44
    [20]赵耀文.人工神经元网络及其在计量经济学中的应用,数量经济技术经济研究,1995,(2):40-45
    [21]姚家奕等.多维数据分析原理与应用.北京:清华大学出版社,2004,5
    [22]Inmon,W H.数据仓库(第四版),王志海译.北京:机械工业出版社,2006,20-26
    [23]陈京民.数据仓库原理、设计与应用.北京:中国水利水电出版社,2004
    [24]李超良.多维数据模型及多维计算研究.[硕士学位论文].长沙:中南大学,6-7
    [25]裴健,柴玮,赵畅等.联机分析处理数据立方体代数[J].软件学报,1999,10(6):561-569
    [26]周春光,梁艳春.计算智能:人工神经网络·模糊系统·进化计算.长春:吉林大学出版社,2001.19
    [27]靳蕃.神经计算智能基础:原理·方法.成都:西南交通大学出版社,2000
    [28]阮晓钢.神经计算科学:在细胞的水平上模拟脑功能.北京:国防工业出版社,2006,28-36
    [29]J.M.Bonifacio,A.M.Cansian,A.C.Carvalho,et al.Neural networks applied in intrusion detection systems.In:Proceedings of the International Joint Conference on Neural Networks,1998,(1):205-210
    [30]王坤,郭云飞.基于PCA的无监督异常检测方法研究[J].郑州大学学报(理学版),2004,36(4):39-42
    [31]关健,刘大昕.一种基于多层感知机的无监督异常检测方法[J].哈尔滨工程大学学报,2004,25(4):495-498
    [32]方开泰,贺曙东.含有线性约束及非负回归系数的回归模型[J].计算数学,1985,(3):237 246
    [33]方开泰,王东谦,吴国富.一类带约束的回归--配方回归[J].计算数学,1982,(1):57 69
    [34]W.Niethammer,J.de,Pillis,R.S.Varga.Convergence of block iterative methods applied to sparse least squares problems.Linear Algebra Applic.1984,(58):327 342
    [35]H.Q.Tong.Evaluation model and its iterative algorithm by alternating projection.Math.Comput.Modelling,1993,18(8):55 60
    [36]Zhang H,Ling C.Numeric mapping and learnability of na'fve Bayes.Applied Artificial Intelligence,2003,17(5):507-518
    [37]Tang Y,Pan W M,Li H M.Fuzzy naive Bayes classifier based on fuzzy clustering.Proceedings of the IEEE International Conference on Systems,Manand Cybernetics,2002,5:452-458
    [38]李兴生,李德毅.一种基于密度分布函数聚类的属性离散化方法.系统仿真学报,2003(6):804-809
    [39]Dougherty J,Kohavi R,Sahami M.Supervised and unsupervised discretization of continuous features.Proc.of the 12th International Conference on Machine Learning.Morgan Kaufmann Publishers.San Francisco.CA.2003,194-202
    [40]李百策,苑森淼,王利民.贝叶斯网络的简约模式表达.仪器仪表学报,2005,10(26):1070-1073..
    [41]Lin P E.Rates of convergence in empirical Bayes estimation problems:continuous case.Ann.Statist,1975,(3):155-164
    [42]SoLow Robert.A contribution to the theory of economic growth.The Quarterly Journal of Economics,1956,70(1):65-94
    [43]Denison EF.Why growth rate differ.Washington,D.C.The Brookings Institution,1967
    [44]Romer,Paul M.Endogenous technological change[J].Journal of Political Economy,1990,98(5):71-102
    [45]Coe,Helpman,Elhanan.International R&D spillovers[J].European Economic Review,1995(39):859-869
    [46]Katharine Wakelin.Productivity growth and R&D expenditure in UK manufacturing firms [J].Research Policy,2001(30):1079-1087
    [47]Dominique Gueilec,Van Pottelsberghe De La Potterie.From R&D to productivity growth:do the institutional settings and the source of funds of R&D matter Oxford Bulletin of Economics & Statistics,2004,(66):353-365
    [48]P.Schmidt,R.C.Sickles.Production Frontiers and Panel Data.Journal of Business &Economic Statistics,1984,2(4):367-374
    [49]C.Cornwell,P.Schmidt,R.Sickles.Production Frontiers with Cross-Sectional and Time-Series Variation in Efficiency Levels.Journal of Econometrics,1990,46:185-200
    [50]B.U.Park,L.Simar.Efficient Semiparametric Estimation in a Stochastic Frontier Model.Journal of the American Statistical Association,1994,89(427):929-936
    [51]Gursel Serpen,Yifeng Xu.Simultaneous recurrent neural network trained with non-recurrent backpropagation algorithm for static optimisation.Neural Computing & Applications,2006,12(1):1-9
    [52]Chih-Ming Chert,Yung-Feng Lu,Chin-Ming Hong.Minimal Structure of Self-Organizing HCMAC Neural Network Classifier.Neural Processing Letters,2006,(4):201-228
    [53]A.Lusigi,C.Thirtle.Total factor productivity and the effects of R&D in African agriculture.Journal of International Development,1998,9(4):529-538
    [54]T.J.Coelli,D.S.P.Rao.Total factor productivity growth in agriculture:a Malmquist index analysis of 93 countries,1980-2000.Agricultural Economics,2005,(32):115-134
    [55]Pirkko Aulin-Ahmavaara a.Effective Rates of Sectoral Productivity Change.Economic Systems Research,1999,11(4):349-363
    [56]Antreas D.Athanassopoulos,Stephen P.Curram.A comparison of data envelopment analysis and artificial neural networks as tools for assessing the efficiency of decision making units.Journal of the Operational Research Society,1996,(8):1000-1016
    [57]Parag C.Pendharkar.A data envelopment analysis-based approach for data preprocessing.IEEE Transactions on Knowledge and Data Engineering,2005,10(10):1379-1388
    [58]Burhan Ozkan,Handan Akcaoz,Cemal Fert.Energy input-output analysis in Turkish agriculture.Renewable Energy,2004,1(1):39-51
    [59]Clark W.Bullard Ⅲ,Anthony Ⅴ.Sebald.Effects of Parametric Uncertainty and Technological Change on Input-Output Models.Review of Economics and Statistics,1997,2(1):75-81
    [60]Ahmad N.Experimental constant price input-output supply-use balances:an approach to improving the quality of the national accounts.Economic Trends,1999,(7):29-36
    [61]Whitley D.An overview of evolutionary algorithms:Practical issues and common pitfalls.Information and Software Technology,2001,(14):817-831
    [62]Eiben A E,Smith J E.Introduction to Evolutionary Computing(Natural Computing Series).Springer,2003
    [63]Yao X,Xu Y.Recent Advances in Evolutionary Computation.Journal of Computer Science and Technology,2006,(1):1-18
    [64]Gao F,Tong H Q.UEAS:A Novel United Evolutionary Algorithm Scheme.Lecture Notes in Computer Science,2006:772-780
    [65]Claes Fornel,Michael D Johnson,et al.The American customer satisfaction index:nature,popurse,and findings.Journal of Marketing,1996,(60):7-18
    [66]国家质检总局质量管理司,清华大学中国企业研究中心编著.中国顾客满意指数指南.北京:中国标准出版社,2003
    [67]S.Y.Lee,Hongtu Zhu.Maximum likelihood estimation of nonlinear structural equation models.Psychometrika,2002,67(2):189-210
    [68]S.Y.Lee,Bin Lu.Case-Deletion diagnostics for nonlinear structural equation models.Multivariate Behavioral Research,2003,38(3):375-400
    [69]Andrew J.Tomarken,Niels G.Waller.Structural equation modeling:strengths,limitations and misconceptions.Annu.Rev.Clin.Psychol,2005,(1):31-65
    [70]J.Liang,P.M.Bentler.An EM algorithm for fitting two-level structural equation models.Psychometrika,2004,(69):101-122
    [71]Michel Tenenhaus,Vincenzo Esposito Vinzi et al.PLS path modeling,computational statistics & data Analysis.2005,48(1):159-205
    [72]Philippe Bastien,Vincenzo Esposito Vinzi,Michel Tenenhaus.PLS generalised linear regression.Computational Statistics & Data Analysis,2005,48(1):17-46
    [73]S.Y.Sohn,T.H.Moon.Structural equation model for predicting technology:commercialization success index(TCSI).Technological Forecasting & Social Change,2003,(70):885-899
    [74]Zhiling Lan,Valerie E,Taylor,et al.DistDLB:Improving cosmology SAMR simulations on distributed computing systems through hierarchical load balancing.Journal of Parallel and Distributed Computing,2006,66(5):716-731
    [75]Renato C.Durra,Valmir C.Barbosa.Finding routes in anonymous sensor networks.Information Processing Letters,2006,98(4):139-144
    [76]V.P.Plagianakos,G.D.Magoulas,M.N.Vrahatis.Distributed computing methodology for training neural networks in an image-guided diagnostic application.Computer Methods and Programs in Biomedicine,2006,81(3):228-235
    [77]Feng Zhang,Andryas Mawardi et al.Examination of load-balancing methods to improve efficiency of a composite materials manufacturing process simulation under uncertainty using distributed computing.Future Generation Computer Systems,2006,22(5):571-587
    [78]A.Kneip,T.Gassa.Convergence and consistency results for self-modeling nonlinear regression.Ann.Statist,1988,16(1):82-112
    [79]A.Kneip,J.Engle.Model estimation in nonlinear regression under shape invariance.Ann.Statist,1995,23(2):551-570
    [80]Hengqing Tong.Convergence rates for empirical Bayes estimators of parameters in multiparameter exponential families.Communications in Statistics,1996,25(6):1325-1334
    [81]童恒庆,多元曲线漂移模型与曲线预测,数量经济技术经济研究,2001,1 1,45-47
    [82]童恒庆,The generalized ridge estimate of parameters in variance component model,Communications in Statistics,2002,31(1),119-128
    [83]刘莹,吴建平,刘三阳等.求解有度约束多播路由问题的分布式算法.软件学报,2002,13(6):1130-1134
    [84]杨健,邵高平.分布式算法实现高吞吐量低功耗FIR核的设计.计算机应用研究,2006,11:152-154
    [85]宁葵,滕金芳.新一代的分布式计算技术--Web服务.计算机工程,2003,29(3):192-194
    [86]顾冠群,汪芸.分布处理技术的现状和展望,世界科技研究与发展,1999,21(3):8-12
    [87]D.P.Anderson,J.Cobb,E.Korpela,et al.SETl@home:an experiment in public-resource computing,Communications of the ACM,2002,45(11):56-61
    [88]D.Bonacorsi.Towards the operation of INFN Yierl for CMS:Lessons learned from CMS Data Challenge(DC04).Nuclear Instruments and Methods in Physics Research Section A:Accelerators,Spectrometers.Detectors and Associated Equipment,2006,559(1):26-301
    [89]Y.Matsumoto,T.Tokumasu.Parallel computing of diatomic molecular rarefied gas flows.Parallel Computing,1997,23(9):1249-1260
    [90]R.Valkanov.Functional Central Limit Theorem approximations and the distribution of the Dickey-Fuller test with strongly heteroskedastic data.Economics Letters,2005,86(3):427-433
    [91]A.S.DownesH.Leon.Testing for unit roots:An empirical investigation.Economics Letters,1987,24(3):231-235
    [92]Y.Wang,M.D.Xiong.Monte Carlo simulation of LEACH protocol for wireless sensor networks.IEEE 10.1109/PDCAT,2005
    [93]K.Jonsson.Using panel data to increase the power of modified unit root tests in the presence of structural breaks.Applied Mathematics and Computation,2005,171(2):832-842
    [94]M.Mascagni,Y.H.Li.Computational infrastructure for parallel,distributed,and grid-based Monte Carlo computations,Lecture Notes in Computer Science,2004,29(7):39-52
    [95]E.A.Johnson,C.Proppe,B.F.Spencer,et al.Parallel processing in computational stochastic dynamics.Probabilistic Engineering Mechanics,2003,18(1):37-60
    [96]A.C.Poulain.Distributed computing with personal computers.AICHE journal,1996,42(1):290-294
    [97]阮强,卢翰.蛋白质氨基酸的组合问题.北京联合大学学报,1994,8(4):24-30
    [98]Hengqing Tong.Modelling the Decomposition Products of a Protein.Mathematical and Computer Modelling,1995,20(9):45-50
    [99]Cohen D I A.Basic techniques of combinatorial theory.John Wiley & Sons,1978
    [100]Goulden I P,Jackson D M.Combinational enumeration.John Wiley & Sons,1983
    [101]Liu M L.分布式计算原理与应用.顾铁成,工亚丽,叶保留译.北京:清华大学出版社,2004
    [102]王美清,郑守淇,郑文波.JDCS:实现高性能计算的分布式计算系统,计算机工程与应用,2002,21:79-82
    [103]刘丹.一种基于RMI的分布式架构研究及其在MIS开发中的应用.[硕士学位论文].武汉:华中师范大学 计算机应用专业 2004
    [104]郝宁,余雪丽.分布式虚拟实验系统实现技术的研究,太原理工大学学报,2003,34(5):578-581
    [105]Ming Hsiang Chiou,Klaus Y.J.Hsu.Wideband modeling technique for deep sub-micron MOSFETs.Solid-State Electronics,2004,48(10-11):1891-1896
    [106]Nuditha Vibhavie Amarasinghe,Anna Zlotnicka,Fang Wang.Model for random telegraph signals in sub-micron MOSFETS.Solid-State Electronics,2003,47(9):1443-1449
    [107]Ren Chuen Chen,Jinn-Liang Liu.An iterative method for adaptive finite element solutions of an energy transport model of semiconductor devices.Journal of Computational Physics,2003,189(2):579-606
    [108]Wu Zhang,Wang Qinan.Bootstrap Control Charts.Quality Engineering,1996,9(1):143-150
    [109]B.Efon.Bootstrap methods:another look at the Jackknife.The Annals of Statistics,1979,7(1):1-26
    [110]马家善,罗国梁.社会经济学原理(增订本).上海:力信会计出版社,1996
    [111]童恒庆.理论计量经济学.北京:科学出版社,2005
    [112]童恒庆.数据分析&统计计算软件(DASC).北京:科学出版社,2005

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700