基于XML的信息管理系统的数据集成技术研究

英文题名：Research on the Data Integration Engineering of the Information Management System Based on XML
作者：翟学敏
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：粒子群算法 ; 蚁群算法 ; 信息素 ; 路径离散 ; XML概率查询 ; XML文档集 ; XML键 ; 数据清洗 ; 隐马尔可夫模型
英文关键词：PSO ; ACO ; Pheromone ; Path Scatter ; XML Probabilistic Query ; XML Document Set ; XML Key ; Particle Swarm Optimization ; Data Cleaning ; Hidden Markov Model
学位年度：2008
导师：刘渊
学科代码：081203
学位授予单位：江南大学
论文提交日期：2008-06-01
答辩委员会主席：王士同

摘要

随着Web技术及其应用的快速发展,XML已经成为互联网上信息表示和数据交换的一个重要标准,XML在电子商务、数据交换、科学数据表示、数据建模与搜索引擎等领域有着广泛的应用,其作用已深入到网络社区的每个角落;而且当前数据库的发展呈现三个主要特征:支持XML数据格式,具有商业智能,支持SOA(服务导向架构)。随着大量XML数据的涌现与传递,产生了对XML数据管理的需求,因此如何有效地表示、存储、管理、查询与挖掘这些XML数据或数据流已成为当前XML数据库领域中一个重要挑战,具有十分重要的理论和应用价值,本文正是基于此背景研究XML数据智能管理的。
     本文围绕XML数据/数据流的表达、查询和聚集等问题展开研究,研究内容和取得的成果主要体现在数据智能清洗与查询方面:
     数据清洗是提高数据质量、并提高数据查询效率的一种有效手段。随着互联网的发展,XML数据智能清洗与查询的重要性逐渐为人们所认识;针对以往XML数据清洗检测繁锁及灵活性差的缺陷,本文尝试通过合理组合XML键、融入粒子群算法、通过引入贝叶斯学习方法及隐马尔可夫模型信息抽取策略构建XML数据清洗过程的元数据模型,综合清洗结构化数据中相似重复记录的思想,提出一种利用粒子群算法改进XML数据清洗的新方法;同时引入群智能算法提高XML数据查询的智能性与有效性,特别是粒子群算法具有快速随机的全局搜索能力,但无法利用反馈信息,而蚁群算法通过信息素的累积和更新收敛于最优路径上,具有分布式并行全局搜索能力,但初期信息素匮乏,求解速度慢等特征,采用启发式方法,结合XML半结构化的特点,将粒子算法与蚁群算法融入于XML概率查询上,并进行相应的改进,采用粒子群算法快速生成信息素分布,利用蚁群算法精确求解,达到优势互补,提高数据查询的范围和收敛的效率。
Along with the fast development of web technology and the application,XML already became the important standard of the information expression and data exchange on the Internet,XML has the widespread application in the electronic commerce,the data exchange, the scientific data expressed, data model and search engine and so on,its function penetrated into each corner of the network community. Moreover the current database's development presents three chief features:Supporting the XML data format,having the commercial intelligence,supporting SOA (service-oriented architecture).Along with the massive XML data's emergence and transmission,demand to the XML data management has produced, therefore how to express the effectively the memory,the management,inquire and unearth these XML data or the data stream have become in the current XML database domain an important challenge,has the very important theory and the application value,this article gives a research on the XML data intelligence management precisely based on this background.
     This article gives the research on expression, inquiry and clustering of the XML data/data stream's,the achievement which the research content and obtains mainly manifests in the Intelligent clustering and inquiry of the data aspects:
     The data cluster is an effective measure to improve the data quality and raise the data inquiry efficiency. Along with development of Internet, the importance of XML data intelligent clustering and inquiry is known gradually by the people; in view of the multifarious detection and bad flexible formerly of the XML data clustering, this article attempts a new clustering method using PSO algorithm through combining the XML key, integrating the PSO algorithm, introducing Bays studying method and the hidden Markov model information extraction strategy reasonably constructs the meta-data model in the process of XML data clustering, and considering the similar redundant record in the clustering of structured data;Simultaneously introduces intelligent algorithm to enhance the intelligence and the validity of the XML data inquiry,specially the PSO algorithm has the fast stochastic overall search ability, but is unable to use the feedback information,but the ant swam algorithm has the distributional parallel overall search ability through accumulating and renewing the information of the element to restrain in the optimal choice,but in the initial period information is deficient,solution speed is slow,so use the heuristic method,unify the XML half structure characteristic,integrates the PSO algorithm and the ant swam algorithm in the XML probability inquiry,and makes the corresponding improvement,produce information distribution using PSO algorithm fast,solve precisely using the ant swam algorithm,achieve the survival of the fittest, enhances the scope of data inquiry and the efficiency of restraining.

引文

1. Jian W, Xue YC, Qian JX. An improved particle swarm optimization algorithm with disturbance. [A]. IEEE International Conference on Systems, Man and Cybernetics [C], New York: IEEE, 2004,6: 5900-5904.
    2. Taplak H, Uzmay I, Yildirim S.An artificial neural network application to fault detection of a rotor bearing system. Industrial Lubrication and Tribology, 2006, 58(1):32-44.
    3. Parrilla M, Aranda J, Dormido-Canto S. Parallel evolutionary computation: Application of an EA to controller design. Lecture Notes in Computer Science,2005, 3562:153-162.
    4.万常选,刘云生,徐升华等.基于区间编码的XML索引结构有效实现结构连接[J].计算机学报,2005, 28(1):113-126
    5.吕建华,王国仁,于戈. XML数据的路径表达式查询优化技术[J].软件学报,2003,16(9):1615-1620
    6. Helmer S,Kanne C C,Moerkotte G.Lock-based Protocols for Cooperation on XML Documents [A].In:Proceedings of 14th International Workshopon Database and Expert Systems Applications (DEXA′03)[C],Los Alamitos:IEEE Computer Society,2003,230-234
    7.李晓磊,邵之江,钱积新.一种基于动物自治体的寻优模式:鱼群算法[J].系统工程理论与实践,2002,22(11):32-38
    8.陈岐,章春芳.适应的并行蚁群算法[J].小型微型计算机系统,2006,27(9):1695-1699.
    9.钟伟才,刘静,刘芳.组合优化多智能体进化算法[J].计算机学报,2004,27 (10): 1341-1354
    10. Hong Zhang, Heng Li, C.M. Tam. Particle swarm optimization for resource constrained project scheduling. International Journal of Project Management, 2006,24: 83-92
    11. Ingo Paenke, Jürgen Branke, Yaochu Jin.Efficient Search for Robust Solutions by Means of Evolutionary Algorithms and Fitness Approximation. IEEE Transactions on Evolutionary Computation,2006,10(4): 405-420
    12. Daniel Parrott, Xiaodong Li. Locating and Tracking Multiple Dynamic Optima by a Particle Swarm Model Using Speciation. IEEE Transactions on Evolutionary Computation,2006,10(4): 440-458
    13. K.C.Tan, Y.J Yang, C.K.Goh.A. Distributed Cooperative Coevolutionary Algorithm for Multiobjective Optimization. IEEE Transactions on Evolutionary Computation,2006,10(5): 527-549
    14. Buhwan Jeong, Daewon Lee, Hyunbo Cho, etc. A novel methodfor measuring semantic similarity for XML schema matching. Expert Systems with Applications,2007,25(1): 1-8
    15. Taplak H, Uzmay I, Yildirim S. An artificial neural network application to fault detection of a rotor bearing system. Industrial Lubrication and Tribology, 2006,58(1): 32-44.
    16.钟将.基于人工免疫的入侵分析技术研究[D]:博士学位论文.重庆:重庆大学计算机软件与理论系, 2005
    17.丁永生.计算智能的新框架:生物网络结构[J].智能系统学报,2007,2 (2):26-30
    18.孟小峰,王宇,王小峰. XML查询优化研究[J].软件学报.2006,17 (10),2069-2086
    19.严和平,刘兵,汪卫. XML查询的推理审计[J],计算机学报,2006,29 (8),1308-1317
    20.路燕,张亮,段起阳等.一种基于DTD的XML索引方法[J],计算机研究与发展,2005,24(1): 30-37
    21.郭志懋.XML数据的查询转换和集成[D]:博士学位论文.上海:复旦大学计算机软件与理论系,2005
    22. Curtmola E, Amer-Yahia S, Brown P, etc.GalaTex: A conformant implementation of the XQuery FullText language[A]. In Florescu D,Pirahesh H,eds. Proc. of the 2nd Int’l Workshop on XQuery Implementation,Experience, and Perspectives (XIME-P)[C]. Baltimore: ACM Press,2005.1024?1025
    23. Xu Y, Papakonstantinou Y. Efficient keyword search for smallest LCAs in XML databases [A]. In: Ozcan F, ed. Proc. of the ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD)[C]. Baltimore: ACM Press, 2005. 537?538.
    24. RéC, Siméon J, Fernández MF. A complete and efficient algebraic compiler for XQuery[A]. In: Liu L,Reuter A,Whang KY,et al.,eds. Proc. of the 22nd Int’l Conf. on Data Engineering (ICDE)[C]. Atlanta: IEEE Computer Society,2006. 14
    25. Zhang SH, Dyreson C. Symmetrically exploiting XML [A]. In: Carr L, Roure DD, IyengarA, et al.,eds. Proc. of the 15th Int’l Conf. on World Wide Web (WWW) [C]. Edinburgh: ACM Press,2006.103?111.
    26. Curtmola E,Amer-Yahia S,Brown P,Fernàndez M. GalaTex: A conformant implementation of the XQuery FullText language[A]. In: Florescu D,Pirahesh H,eds. Proc. of the 2nd Int’l Workshop on XQuery Implementation, Experience, and Perspectives (XIME-P)[C]. Baltimore: ACM Press, 2005. 1024?1025
    27. Amer-Yahia S, Curtmola E, Deutsch A. Flexible and efficient XML search with complex full-text predicates[A]. In: Proc. of the ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD) [C]. Chicago: ACM Press, 2006. 575?586
    28.赵克立等. XML 1.X编程培训课程.北京:清华大学出版社, 2002.
    29.曾春平,王超,张鹏. XML编程从入门到精通.北京:希望电子出版社, 2002
    30.王国仁,于戈,杨晓春等. XML数据管理技术.北京:电子工业出版社,2007
    31.吴斌,史忠植.一种基于蚁群算法的TSP问题分段求解算法[J].计算机学报,2001,24(12): 1328-1333.
    32.贺旭鹏.粒子群算法改进及其优化应用[D]:硕士学位论文.广州,华南理工大学
    33. Kennedy J, Eberhart R.C.. Particle Swarm Optimization[A].Proceedings of the IEEE International Conference on Neural Networks[C],1995:1942-1948
    34. Shi Y.H, Eberhart R.C. A Modified Particle Swarm Optimization[A].1998 IEEE International Conference on Evalutionary Computation[C], Anchorage, Alaska, 1998:69-73
    35. Shi Y.H, Eberhart R.C. Fuzzy Adaptive Particle Swarm Optimization [A]. Proceedings of the Congress on Evalutionary Computation[C].Seoul,Korea,2001:101-106
    36. Clerc M. The SWARM AND THE Queen: Towards a Deterministic and Adaptive Particle Swarm Optimization[R]. Proceedings of the Congress on Evalutionary Computation Washinton D.C.,1999:1951-1957
    37. Trelea I.C. The particle swarm optimization algorithm:convergence analysis and parameter selection. Information Processing Letters,2003, 85(6):317-325
    38. http://www.sigmod.org/record/xml/index.xml. 2007.
    39.郭志懋,周傲英.数据质量和数据清洗研究综述[J],软件学报,2002,13(11): 2076-2083
    40.郑仕辉,周傲英,张龙. XML文档的相似测度和结构索引研究[J].计算机学报,2003,26(9): 1116-1123
    41.陈伟,丁秋林.一种XML相似重复数据的清理方法研究[J].北京航空航天大学学报,2004,30(9): 835-838
    42.陆凤霞,王静秋,王宁生.一种开放式数据清理框架[J].南京航空航天大学学报,2006,38(4):459-463
    43.王桐,刘大昕.一种基于改进粒子群优化的XML结构聚类方法[J].小型微型计算机系统,2007,28(5): 871-875
    44. Richi Nayak,Wina Iryadi.XML schema clustering with semantic and hierarchical similarity measures . Knowledge-Based Systems,2007,20(6):336–349
    45. Jorge Riera-Ledesma,Juan-JoséSalazar-González.A branch-and-cut algorithm for the continuous error localization problem in data cleaning.Computers & Operations Research,2007,34(9): 2790-2804
    46. Jorge Riera-Ledesma,Juan-JoséSalazar-González.A heuristic approach for the continuous error localization problem in data cleaning.Computers & Operations Research,34(8):2370-2383
    47. Chien-Sing Lee.Diagnostic,predictive and compositional modeling with data mining in integrated learning environments ,Computers & Education, 2007,49(3):562-580
    48. Roberto Gemello, Franco Mana,Stefano Scanzio,Pietro Laface etc.Linear hidden transform- ations for adaptation of hybrid ANN/HMM models. Speech Communication,2007,49 (10):827 -835
    49. Theodore Charitos,Peter R. de Waal,Linda C. van der Gaag. Convergence in Markovian models with implications for efficiency of inference, International Journal of Approximate Reasoning,2007,46(2): 300-319
    50.冯玉才,桂浩,李华等.数据分析和清理中相关算法研究[J],小型微型计算机系统,2005,26(6): 1018?1022
    51. Ye Zhou, Wang Dong. Rules Engine Based Data Cleansing. Computer Engineering,2006,32(23): 52-55
    52. Pawan Lingras, Mofreh Hogo, Miroslav Snorek, etc.Temporal analysis of clusters of supermarket customers: conventional versus interval set approach. Information Sciences,2005,72(1):215-240

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700