海量数据分布式存储技术的研究与应用

英文题名：Research on Distributed Storage Technology Based on Mass Data
作者：李存琛
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：海量数据存储 ; 分布式数据库 ; MPP架构 ; 并行处理
英文关键词：Mass data storage ; distribute database system ; MPP architecture parallel processing
学位年度：2013
导师：杨俊
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2012-12-25

摘要

近年来,随着信息技术的蓬勃发展,互联网上业务不断地扩张,用户不断地增加,存储空间不断地增大,数据呈现出无法想象的增长趋势。然而存储容量往往同存储性能总成反比,传统数据库在应付海量数据时显得十分吃力,暴露出并发性低、扩展性差、效率低下等问题。因此,海量数据存储成为重点研究对象,基于MPP(Massive Parallel Processing)架构的并行处理分布式数据库就是其中的一个研究方向。本文对海量数据存储技术做了探索性的研究,选题自“十一五"国家科技重点支撑项目——安全可信的电信级生殖健康服务运营支撑体系关键技术研究,主要解决项目中数据量不断扩大带来的存取性能问题,为项目提供高并发性、高可用性、高扩展性的存储技术支持。
     本文的所做的研究工作主要包括以下几个方面：1、基于海量数据存储技术、关系型数据与NoSQL数据模型、分布式数据库存储和基于MPP架构的并行处理模式的理论,总结了海量数据存储的方案和应用到的新技术。2、分析了海量数据存储技术特点、比较了国内外常用的分布式海量数据存储技术的优缺点,设计了海量数据的分布存储模型,并详细阐述了SQL解析模块、数据切分模块、并行查询模块以及结果模块的实现方法。3、在海量数据存储模型设计和数据并行查询存储技术的基础上,自主研发了基于MPP架构的存储架构‘'DB Mapping"系统,实现了具有良好的扩展性和大规模并行处理的优势的海量数据存储解决方案。
     论文主要贡献是,提出了一种基于MPP架构的并行处理的海量数据存储方法,提出了从客户端发起请求到数据持久化的全程的数据存储方式,并融合了Map/Reduce的思想,将工作分发到各个数据节点,实现了数据的高可扩展性、高可用性、高并发性。并通过搭建分布式数据节点进行仿真测试,验证了该海量数据存储方式的可行性。
In the recent years, with the burgeoning development of the information technology, the data on the Internet is growing in an incredible speed. There is a continuing increase in the Internet business, the number of Internet users and the space of online storage. However, the storage capacity is inversely proportional to the storage performance. As the traditional centralized database can hardly deal with the huge amount of data, it failed to meet the expanding demands of abundant information and high system performance. Therefore, mass data storage became a key research topic and MPP (Massive Parallel Processing) architecture-based parallel processing distributed database is one of the related research directions. Based on the subject of "Research on key technologies of safety trusted telecom-level operation supporting architecture on reproductive health services", this paper mainly focuses on the mass data storage technology. It aims to provide a storage solution with high concurrency, high availability, and high scalability.
     The present study has addressed:1. Summed up the mass data storage and the corresponding application of new technology based on the massive data storage technology, relational data, NoSQL data model, distributed database storage and MPP architecture-based parallel processing mode theory;2.Analyzed the characteristics of mass data storage technology, compared the advantages and disadvantages of distributed mass data storage technology commonly used at home and abroad, and designed the distribution of mass data storage model. The system is composed of four modules:SQL parsing module, sharding module, parallel query module, and results summarizing module; and3.Combined with existing distributed database design method, independently developed the storage system of "DB Mapping" based on MPP architecture which has good scalability and the advantages of highly efficient processing.
     The primary contributions of this paper are summarized as follows. We proposed a mass data storage solution based on MPP parallel processing and provided a complete process of the data storage from the client request to the database. By integrating the MapReduce thought, the system can work on the distribution data node and satisfy the demands of high scalability, high availability and high concurrency. The feasibility of this solution was verified by a simulation test.

引文

[1]郭斯杰,贾鸿飞,熊劲.互联网海量数据存储和处理技术综述.信息技术快报,Vo1.7No.5 2009 Sum No.54
    [2]姜宇鸣.海量数据存储系统研究.《电脑知识与技术》2011年08期
    [3]NOSQL 2009. Blog.sym-link.com.2009-05-12
    [4]李文虎.分布式数据库系统的设计浅析.科技资讯,2009年第34期
    [5]Cunchen Li, Jun Yang, Jing Han, Haihong E. The Distributed Storage System Based On MPP For Mass Data [C]//2012 IEEE Asia-Pacific Services Computing Conference. GuiLin, Guangxi:IEEE,2012:384-387
    [6]邵佩英.分布式数据库系统及其应用.科学出版社ISBN 7-03-015113-5
    [7]MySQL Cluster.http://apps.hi.baidu.com/share/detail/23926897
    [8]陈思儒Amoeba技术介绍amoeba
    [9]HiveDB. http://www.hivedb.org
    [10]James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google's Globally-Distributed Database. OSDI'12:Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October,2012TIAN Hai-sheng Mass Data Searching. Microcomputer Development,2005.10
    [11]覃雄派,王会举,杜小勇,王珊.大数据分析——RDBMS与MapReduce的竞争与共生.软件学报ISSN 1000-9825, CODEN RUXUEW Journal of Software,2012,23(1):32-45 [doi:10.3724/SP.J.1001.2012.04091]
    [12]Nancy Lynch and Seth Gilbert.Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, Volume 33 Issue 2 (2002):51-59
    [13]事务四大特征：原子性, 一致性, 隔离性和持久性(ACID). http://baike.baidu.com/view/1896120.htm
    [14]党鹏飞.浅谈分布式数据库在电视台管理信息系统中的应用.计算机光盘软件与应用2012年第14期
    [15]徐甲同,李学干.并行处理技术.西安电子科技大学出版社
    [16]王亚刚,杨康.大规模并行处理技术应用综述ISSN 1009-3044 Computer Knowledge and Technology, Vol.5,No.12,April 2009:3298-3299,3302
    [17]陈吉. 真正的数据库是一个分布式解决方案http://wenku.it168. com/d_000011822.shtml
    [18]19.庞惠,翟正利.论分布式数据库ISSN 1009-3044
    [19]Computer Knowledge and Technology Vol.7, No.2, January 2011:271-273
    [20]20.彭宏,杜楠.基于并行数据库的海量商务数据管理系统研究.计算机应用研究Vol.26 No.2 Feb.2009
    [21]许新华,黄胜运,唐胜群,张少锋.基于Agent的分布式数据库查询优化研究Journal of Computer Research and Development, ISSN 1000-1239/CN 11-1777/TP 49(Suppl.):216-219,2012
    [22]Shan Wang, XiaoYong Du, XiaoFeng Meng, Hong Chen. Database Rearch:Achievements and Challenges. J.Comput.Sci.&Technol.Sept.2006,Val.21,No.5:823-837
    [23]David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan Dhanidina, Ken Iwamoto, Brian Kim, Luke Matkins, Yoav Yerushalmi. "Web caching with consistenthashing". MIT Laboratory for Computer Science,545 Technology Square, Room 321, Cambridge, MA 02139, USA
    [24]Kai Fan, "Suvey on Nosql", Programmer,2010(6):76-78
    [25]A Larson, B Ramsey, D Shakib, S Weaver, J Zhou. SCOPE:Easy and efficient parallel processing of massive data sets (2008). International Conference of Very Large Data Bases
    [26]刘云生,覃飙.分布式实时事务提交协议.计算机研究与发展,第39卷,第7期,2002年7月
    [27]http://blog.csdn.net/bluishglc/article/details/6161475.MySQL性能调优与架构设计
    [28]汪剑,郭朗.分布式远程教育数据库系统的设计与实现.成都大学学报(自然科学版)2009年第04期
    [29]周敏Anthill:一种基于MapReduce的分布式DBMS暨南大学硕士论文.2010
    [30]http://www.cnblogs.com/forfuture1978/archive/2010/11/14/1877086.html.Hadoop学习总结之三：Map-Reduce入门

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700