分布式数据库查询优化技术

作者：张旭中
论文级别：硕士
学科专业名称：计算机系统结构
中文关键词：分布式数据库 ; 分布式查询代价 ; 全局查询代价 ; 代价统计
英文关键词：Distributed Database ; distributed query cost ; global query cost ; cost statistic.
学位年度：2003
导师：刘心松
学科代码：081201
学位授予单位：电子科技大学
论文提交日期：2003-03-01

摘要

计算机网络的发展和信息的共享，使得分布式数据库的发展成为必然和热点。人们对数据存储和检索的高可靠性和高速度性，要求越来越高，传统数据库的局限已经暴露得越来越明显。因此分布式数据库便迎合了这一需求。
     在分布式数据库中，由于高可靠性和高速度性是其重要特点，所以对查询执行的要求也就更高。而查询执行中查询优化是执行的关键环节，查询优化在很大程度上决定查询的效率或快慢，因此查询优化技术一直是许多数据库专家学者研究的重要课题。传统的数据库查询优化主要是从查询的底层执行流程和实现技术出发，通过关系代数的手段进行理论上的探讨，而且大量研究集中在查询执行的语法分析阶段。其核心思想是查询编译器利用元数据和关于数据的统计数据来确定哪一个操作序列可能是最快的。例如，从物理查询计划的底层磁盘输入输出到语法分析阶段的语法分析树、用于改进查询计划的代数定律、逻辑查询计划的改进，以及操作代价的估计、基于代价的计划和连接顺序的选择等全过程，都进行了不懈的努力。在这方面的研究已经非常成熟。
     但是，对分布式数据库的查询优化还很不成熟，这不仅因为分布式数据库技术目前发展还不完善，还因为分布式数据库本身的复杂性，它涉及的因素多且变化多端。存在于网络环境的分布式数据库系统，节点之间的通信代价和分布式计算处理，成为不可回避的重要内容。本文讨论的分布式数据库优化仅从上层入手，并假定下层的优化工作已经完善，即在分布式的全局处理层，重点是对分布式查询执行的全局处理策略进行优化，尽可能避免通信代价的开销，并着眼于查询执行的实际代价，从分布式系统中选出一个最优的执行节点。它从查询执行的效果出发，通过统计的方式，不断从最近的查询执行代价学习纠正最近查询执行的统计代价，为查询的全局处理提供参考，以达到优化执行、提高执行效率和速度的目的。
     全文分为六章：第一章对分布式数据库做总体概述，第二章回顾了数据库优化技术的发展，第三章介绍了本文基于的数据库系统模型DPSQL，第四章在分析分布式数据库查询执行的基础上对基于局域网的分布式系统进行了优化研究，第五章是对第四章讨论的优化的实现，第六章对优化系统做了性能上的分析和探讨，最后的结语总结了全文。
Databases are used widely in many fields. Because Centralized Database has many innate disadvantages when applied to Internet, the application of Distributed Database gets more and more popular. With the development of computer networks and information technology, Distributed Database Systems has become one of the research hotspots of computer science.
    However, there are still many challenging problems in this field that attract many researchers. It is well known that the performance of a database relies heavily on the efficiency of query execution. To obtain efficient query execution, optimization is the most important step. Many researches have carried on research on this subject by going deep into the bottom of Query Engine. Many mature technologies on this level have been brought out, such as the relational algebra law, the improved logical query plan, the cost estimation of operation, the selective plan based on cost and order of joint, etc. Although many methods have been tried out, no remarkable result or noteworthy technology has come to reality because of complexity of data decomposition and network effects.
    This article concentrates on how to optimize the global query at an upper level: database-level distribution. Based on statistical methods, the optimizing algorithm try to find a light-loaded server that can process the query with less cost. In fact, it uses the historical records of previous execution. Then, according to some algorithm, the optimizing processor can determine which node among the system is the best to execute this query. The whole system is based on MYSQL, an open source database, which is widely used in Internet application.
    The balance of this paper is organized as following: the first chapter reviews the progress of the Distributed Database. The second chapter discusses the conventional query optimizing technologies. In third chapter, we introduce the prototype of DP-SQL and its major features. The forth chapter discusses the optimization of Distributed Database in details. Then, in the next chapter, the implementation details are discussed. The last chapter analyzes the algorithm performance, and then draws the conclusion.

引文

1． Abraham Silberschatz等著，杨冬青等译，2000，数据库系统概念，北京，机械工业出版社
    2．贾焰，王志英，韩伟红，李霖．分布式数据库技术，2000／07 北京，国防工业出版社，
    3． Hector Garicia-Molina等著，杨冬青等译，2001，数据库系统实现，北京，机械工业出版社
    4． C．J．Date著，孟小峰等译，2000，数据库系统导论，北京，机械工业出版社
    5．沈娟，赵雄芳，对分布式数据库发展方向的分析，计算机工程与科学1994年01期
    6．昌月楼，杨利，分布式数据库技术的现状和发展方向，计算机工程与科学1995年03期
    7．阳国贵，金辉，王怀民，分布式数据库技术回顾，计算机工程与科学1995年03期
    8．毛法尧，分布式数据库系统，小型微型计算机系统1995年08期
    9．王于同，并行数据库性能研究，计算机工程与应用1997年01期
    10．李霖，周兴铭，分布式数据库研究新趋势，计算机工程与科学1997年03期
    11．杨利，昌月楼等编著，2000，并行数据库技术，长沙，国防科技大学出版社
    12．阳富民，冯玉才，吴永英，吴恒山，一种分布式数据库管理系统体系结构，计算机工程与应用1995年02期
    13．周龙骧，1998，分布式数据库系统实现技术，北京，科学出版社
    14．邵佩英，2000，分布式数据库系统及其应用，北京，科学出版社
    15．郑振楣，于戈，郭敏，1998，分布式数据库，北京，科学出版社
    16．萨师煊，王珊，1992，数据库系统概论，北京，高等教育出版社
    17. M. Tamer Ozsu, Patr4ick Valduriez, Principles of Distributed Database Systems, Prentice-Hall, Inc.
    18. Michael Stillger3*, Guy Lohmanl, Volker Markll, Mokhtar Kandit2, LEO- DB2' s LEarning Optimizer, http://www. almaden. ibm. com/software/dm/SMART/
    19. Rakesh igrawal, Jerry Kiernan, Ramakrishnan Srikant, Yirong Xu, Hippocratic Databases, http://www. almaden. ibm. com/software/dm/SMART/
    20. Guy M. Lohman, Sam S. Lightstone, SMART: Making DB2 (More) Autonomic, http://www. almaden. ibm. com/software/dm/SMART/
    21. T. Urban, M. J. Franklin and L. Amsaleg, Costbased Query Scrambling for Initial Delays, SIGMOD, 1998, UFA98
    22. A. N. Swami, K. B. Schiefer, On the Estimation of Join Result Sizes, EDBT 1994, pp. 287-300, SS94


    23. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, T. G. Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979, pp. 23-34
    24. V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita, Improved histograms for selectivity estimation of range predicates, SIGMOD. 1996, pp. 294-305
    25. Y. Poosala and Y. Ioannidis, Selectivity Estimation without the attribute value independence assumption, VLDB, 1997
    26. C. Lynch, Selectivity Estimation and Query Optimization in Large Databases with Highly Skewed Distributions of Column Values, VLDB, 1988
    27. N. Kabra and D. DeWitt, Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans, SIGMOD, 1998
    28. Y. E. Ioannidis and S. Christodoulakis. On the Propagation of Errors in the Size of Join Results, SIGMOD, 1991
    29. DB2 Universal Data Base V7 Administration Guide, IBM Corp., 2000
    30. P. Haas and A. Swami, Sampling-Based Selectivity Estimation for Joins - Using Augmented Frequent Value Statistics, IBM Research Report, 1993
    31. A. Van Gelder, Multiple Join Size Estimation by Virtual Domains, PODS, pp. 180-189.
    32. C. M. Chen and N. Roussopoulos, Adaptive Selectivity Estimation Using Query Feedback, SIGMOD, 1994
    33. R. Ahad, K.V.B. Rao, and D. McLeod, On Estimating the Cardinality of the Projection of a Database Relation, TODS 14(1), pp. 28-40.
    34. A. Aboulnaga and S. Chaudhuri, Self-tuning Histograms: Building Histograms Without Looking at Data, SIGMOD, 1999
    35. Panagos E., Biliris A., Jagadish H. V., Rastogi R., Client-based logging for high performance distributed architectures, Data Engineering, 1996. Proceedings of the Twelfth International Conference on, 1996, Page(s): 344-351
    36. Date, C. J and H. Darwen, A Guid to the SQL Standard, Fourth Edition, Addison-Wesley, Reading MA ,1997.
    37. David Bell and Jane Grimson: Distributed Database Systems. Reading, Mass.:Addison-Wesley(1992)
    38. C. J. Date: "What is a Distributed Database System?" ,in Relational Database Writings 1985-1989, Reading, Mass.:Addison-Wesley(1990)
    39. C. J. Date: "Distributed Database: A Closer Look." in C. J. Date and Hugh Darwen, Relational Database Writings 1989-1991, Reading, Mass.:Addison-Wesley(1986)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700