数据网格环境下的数据传输及缓存技术研究及实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
开放的互联网环境中存在容量巨大、形式多样、分散存储的数据资源,对这些数据资源实施有效的管理是一个挑战性问题。数据网格以广域环境下海量、异构的数据资源为处理对象,结合高性能计算设施和大规模存储设备,实现了数据存储、数据传输、数据访问、副本管理、高性能数据处理等功能,为用户提供了一个数据管理与处理的基础设施。
     由于数据网格先天的广域分布性,使得在广域网中进行高效、可靠的数据传输成为了进行数据共享的必然要求。针对这种情况,我们设计和实现了网格数据传输系统,提供了并行传输、条状传输、普通第三方传输、间接第三方传输、带路由的数据传输等功能,并支持现有的主流传输协议FTP、HTTP以及HTTPS等,从高效性、能行性、稳定性、可靠性及安全性等方面满足了数据网格中分布、异构、海量数据的传输需求,改善了数据共享性能。
     另外,随着计算机技术的发展,CPU和系统主存的性能得到了极大的提高。然而由于IO设备的发展相对滞后,磁盘性能逐渐成为了影响计算机整体性能的瓶颈。特别是在内存密集型和I/O密集型应用中,磁盘访问的巨大延迟将严重影响应用程序的性能。因此在数据网格环境下数据的访问有可能因为磁盘的巨大延迟而导致性能的急剧下降。针对这种情况,本课题组提出了内存网格用于解决此类问题。由于不同大小的文件在数据网格环境下具有不同的访问特征,为了进一步提高内存网格的可用性,我们结合大规模网络存储系统中数据布局策略提出基于内存网格的文件分类缓存服务,在保证内存网格公平性和高可用性的前提下,对内存网格系统中的文件进行分类缓存,扩展内存网格的可用性。通过基于真实应用的实验模拟,证明了文件分类缓存可有效提高现有内存网格的性能。
     网格数据传输模块为底层的数据资源开凿了一条连通四面八方、数据高速流动的沟渠,使得数据网格环境下不同节点的数据可以进行有效共享;而使用内存网格对于数据进行缓存则可以有效提高数据访问的性能,因此两者从不同方面提高了数据网格的数据访问性能。
In the Internet environment, there are massive data resources scattered at distributed locations with different types. The management of massive data becomes a challenging problem because of its high heterogeneity, decentralization and complexity of sharing. Data Grid integrates with high-performance computing facilities and massive storage equipments. It realizes many data management functionalities, such as data storage, data access, data transport, and replica management. It is regarded as a novel infrastructure with justice, self-adaptability and inter-activity for massive data management and sharing.
     According to inherent wide area distributed characteristic of data grid, developing a data transferring module which supports high speed and reliable data transferring is a necessary but difficult job .In order to suffer this challenge, we design a module which provides these functions: Parallel data transfer, Stripe data transfer, Third-party control of data transfer, Reliable skip data transfer and so on. Also this module supports FTP, HTTP and HTTPS, which are the popular network data transfer protocols. We consider our design and development on module stability, reliability and security in order to meet the target that offering fast and reliable data transfer.
     With the development of computer technologies, great improvements have been achieved in CPU and main memory. The magnetic disk, however, becomes performance bottleneck of the whole computer system because of the relative delay of IO devices. Data-intensive applications with large and random disk access, such as web server and DBMS, require frequently disk accessing, which can decrease the application performance. In order to improve system performance, RAM Grid is proposed to address performance issue in data-intensive applications, which can shares and utilizes huge memory resources in inter-network. The actual RAM Grid helps improving system performance greatly. But there are still some disadvantages inside such as load balance. And we know different size files present different accessing characteristic. So we try to improve combine this into our design. Large scale network storage data placement policy is a research hot point which will be introduced into our RAM Grid design. Our idea is caching different files (page or data block) in appointed remote caching node. This design is good for our system’s load balance, scalability and performance. And our experiment results show great improving performance on RAM Grid.
     Data transfer module is designed for fast and reliable data transfer, which makes data sharing in data grid environment easier and more effective, especially for great volumes of data. And RAM Grid is a good choice for caching some data being swapped out by local cache (memory). Fetching data blocks from remote caching nodes is better. These technologies are both useful for improving data accessing performance.
引文
[1] I. Foster, C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, 1999: 1~45.
    [2] I. Foster. The Grid: A New Infrastructure for 21st Century Science.Physics Today, 2002, 55(2): 42~47.
    [3] I. Foster, C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. International Journal of High Performance Computing Applications, 1997, 11(2): 114~128.
    [4] Globus project. http://www.globus.org/.
    [5] Open Grid Forum. http://www.ogf.org/.
    [6] L. Pearlman, C. Kesselman, S. Gullapalli, J. R. Spencer, B. Futrelle, J. Kathleen, R. I. Foster, P. Hubbard, C. Severance. Distributed hybrid earthquake engineering experiments: Experiences with a ground-shaking Grid application. In Proceedings of the 13th IEEE Symposium on High Performance Distributed Computing (HPDC-13). Honolulu: IEEE Press, Los Alamitos, CA, 2004: 23~30.
    [7] Ben Segal. Grid Computing: The European Data Project. IEEE Nuclear Science Symposium and Medical Imaging Conference, Lyon: IEEE Press, 2000: 128~133.
    [8] Ben Segal. The European DataGrid project. First Presentation At Data Mining Workshop - CSC Scientific Computing - Otaniemi 04-APR-01. http://web.datagrid.cnr.it/pls/portal30/docs/902.PDF, 2009.
    [9] P. Avery, I. Foster. The GriPhyN project: Towards petascale virtual-data Grids. Tech. Rep. Chicago: The GriPhyN Collaboration. 2001: 12~49.
    [10] B. Huffman, T. Mcnulty, Shears. T. DENIS, R. S. Andwaters. The CDF/D0 UK GridPPproject. http://www.gridpp.ac.uk/datamanagement/metadata/SubGroups/UseCases/docs%/cdf5858.ps.gz, 2002.
    [11] L. Winton. Data Grids and high energy physics—A Melbourne perspective. Space Science Reviews, 2003(6): 1~2, 107, 523~540.
    [12] B. Allcock, I. Foster, V. Nefedov, A. Chervenak, E. Deelman, C. Kesselman, J. Lee, A. Sim, A. Shoshani, B. Drach, D. Williams. High-Performance Remote Access to Climate Simulation Data: A Challenge Problem for Data Grid Technologies. Denver, Texas: ACM Press, 2001: 35~48.
    [13] Ann. L. Chervenak, Ewa. Deelman, Carl. Kesselman. High-performance remote access to climate simulation data: a challenge problem for data grid technologies. Parallel Computing, 2003, 29(10): 1334~1356.
    [14] A. Szalay, J. Gray. The world-wide telescope. Science, 2001: 293, 5537, 2037~2040.
    [15] A. Szalay, S. Ed. In Proceedings of SPIE Conference on Virtual Observatories. Waikoloa, HI: IEEE Press, 2002( 4846): 68~80.
    [16] Chris Stark, Bobby Joe Breitkreutz. BioGRID: a general repository for interaction datasets. Nucleic Acids Research, 2006, (34): 534~539.
    [17] M. BRADY, D. GAVAGHAN, A. SIMPSON, M. M. PARADA, R. HIGHNAM. Chapter eDiamond: A Grid-Enabled Federated Database of Annotated Mammograms. Hoboken, NJ: Wiley Publishing, 2003: 923~943.
    [18] Butterfly Net. http://www.butterfly.net.
    [19] I. Foster, C. Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco, CA, USA: Morgan Kaufmann P, 2003: 15~124.
    [20] I. Foster. The Grid: Computing without Bounds. Scientific American, 2003, (7): 14~102.
    [21] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke. The data grid: Towards architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications, 2001, (23): 187~200.
    [22] W. Allcock. Data management and transfer in high performance computational grid environments. Parallel Computing, 2002, 28(5): 749~771.
    [23] Srikumar Venugopal, Rajkumar Buyya, Kotagiri Ramamohanarao. A taxonomy of Data Grids for distributed data sharing, management, and processing. ACM Computing Surveys (CSUR). 2006, 38(1).
    [24] R. Aydt, D. G, W. Smith, M. Swany, V. Taylor, B. Tierney, R. Wolski. A Grid Monitoring Architecture. Global Grid Forum, Performance Working Group Document, GWD-Perf-16-1, 2001.
    [25] R. Wolski, J. Hayes. The Network Weather Service: A Distributed Resource Performance Forecasting Service for metacomputing. In Journal of Future Generation Computing Systems, 1999, (15): 757~768.
    [26]查礼,徐志伟,林国璋,刘玉树,刘东华,李伟.基于LDAP的网格监控系统.计算机研究与发展, 2002, 39(8): 930~936.
    [27] M. L. Massie, D. E. Culler. Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, 2004(3).
    [28] Wei Jie, Terence Hung, Wentong Cai. An Information Service for Grid Virtual Organization: Architecture, Implementation and Evaluation. The Journal of Supercomputing, 2005, 34(3).
    [29] NASA Information Power Grid project. http://www.dl.ac.uk/TCSC/Subjects/ Parallel_Algorithms/steer_survey/node16.html.
    [30] Dan Gunter, Brian. L. Tierney, E. Crai, Tull Vibha Virmani. On-Demand Grid Application Tuning and Debugging with the NetLogger Activation Service. NewYork: IEEE Press, 2003, (11).
    [31] Antoine Petitet, Susan Blackford, Jack Dongarra, Brett Ellis, Graham Fagg, Kenneth Roche, Sathish Vadhiyar. Numerical libraries and the grid: the GrADS experiments with ScaLAPACK. USA: ACM Press, 2001, (11).
    [32] GGF Performance Work Group, http://www-didc.lbl.gov/GGF-PERF/ GMA-WG/.
    [33] A. W. Cooke. The Relational Grid Monitoring Architecture: Mediating Information about the Grid. Journal of Grid Computing, 2004, (2): 207 ~ 222.
    [34] X. Zhang, J. F, J. Schopf. A performance study of monitoring and information services for distributed systems. Seattle, WA, USA : IEEE Computer Society Press, 2001: 270~282.
    [35] Storage Resource Broker. http://www.sdsc.edu/srb/index.php/Main_Page.
    [36]王意洁,肖侬,任浩,卢锡城.数据网格及其关键技术研究.计算机研究与发展, 2002,8:942~947.
    [37] Peter Kunszt, Erwin Laure, Heinz Stockinger, Kurt Stockinger. File-based replica management. Future Generation Computer Systems, 2005, 23(1):17~29.
    [38]储瑞.基于虚拟计算环境的内存资源共享技术研究.长沙:国防科学技术大学计算机学院, 2007:1~69.
    [39] Radu Prodan, Thomas Fahringer. From Web Services to OGSA: Experiences in Implementing an OGSA-based Grid Application. New York: IEEE Press, 2003, (11).
    [40] Ann. L. Chervenak, Robert Schuler, Carl Kesselman, Scott Koranda, Brian Moe. Wide area data replication for scientific collaborations. San Francisco: IEEE Press, 2005: 1~8.
    [41] Rui Chu, Nong Xiao. A Distributed Paging RAM Grid System for Wide-area Memory Sharing. Rohed Island: In proceedings of IPDPS 2006, 2006: 154~166.
    [42] R. Buyya. High Performance Cluster Computing: Architecture and Systems. Melbourne: Prentice Hall PTR, 2005.
    [43] M. J. Feeley, W. E. Morgan, F. H. Pighin, R. Karlin, H. M. Levy, C. A. Thekkath. Implementing Global Memory Management in a Workstation Cluster. Copper Mountain Resort, Colorado: ACM Press, 1995.
    [44] Nong Xiao, Dongsheng Li, Wei Fu, Bin Huang, Xicheng Lu. GridDaen: A Data Grid Engine. Shanghai: In proceedings of The 2nd GCC Conference, 2003: .
    [45]肖侬,黄斌,付伟,卢锡城.GridiDaen数据网格系统的设计与关键技术实现.北京:清华大学出版社,2003:1024~1029.
    [46] EGEE Homepage. http://www.eu-egee.org/.
    [47] Wolfgang Hoschek, Javier Jaen-Martinez, Asad Samar, Heinz Stockinger, Kurt tockinger. Data Management in an International data Grid project. 2003.
    [48] gLite. http://glite.web.cern.ch/glite/.
    [49] Globus Toolkit Homepage. http://www.globus.org/
    [50] I. Foster, C. Kesselman, J. Nick, S. Tucke; January, The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. 2002.
    [51] WSRF, Web Services Resource Framework (WSRF).
    [52] RFC2389. http://rfc.net/rfc2389.html.
    [53] RFC2228. http://rfc.net/rfc2228.html.
    [54] RFC959. http://rfc.net/rfc959.html.
    [55] GridFTP. http://www.globus.org/toolkit/docs/4.0/data/gridftp/GridFTPFacts.html.
    [56] GridFTP striping. http://www.globus.org/toolkit/docs/4.0/data/gridftp/GridFTP_Glossary.html#striping.
    [57]中国国家网格.http://www.cngrid.org/web/guest/home.
    [58]中国教育科研网格ChinaGrid. http://www.edu.cn/zhong_dian_ke_ti_5168/20061012/t20061012_199825.shtml.
    [59]中国国家网格软件3.0版总体设计通过评审. http://www.edu.cn/IT_kuai_xun_1127/20070615/t20070615_238261.shtml.
    [60]查礼.织女星网格系统软件研究与应用.http://lib.ict.ac.cn/ITL/data/2006/4/织女星网格系统软件研究与应用.doc.
    [61] ChinaGrid第二届学术年会. http://chinagrid.hust.edu.cn/conference2007/welcome_to_ChinaGrid_2007.htm.
    [62]武永卫,吴松.ChinaGrid核心中间件CGSP2.0提供高效服务.中国教育网络,2006,(06).
    [63] URI. RFC2396. http://rfc.net/rfc2396.html.
    [64] E. P. Markatos, G. Dramitinos.Implementation of a Reliable Remote Memory Pager. In Proceedings of USENIX, 1996.
    [65]何军,田范江.一种机群网络文件系统的合作高速缓存技术.计算机学报, 1997(5).
    [66]褚瑞,肖侬,卢锡城.一种基于内存服务的内存共享网格系统.计算机学报,2006(12).
    [67] Rui Chu, Jiancong Xie, Nong Xiao, Xicheng Lu. RAM Grid Middleware for Autonomic Cooperative Caching. Hulumuqi: In proceedings of The 6th GCC Conference, 2007.
    [68]褚瑞,卢锡成,肖侬.一种内存网格的数据预取算法.软件学报,2006(18):1858~1869.
    [69]禇瑞,谢健聪,肖侬,卢锡城.内存网格中的自主协同缓存技术研究.计算机工程与科学, 2008.
    [70]付长胜.基于数据网格的数据传输技术研究.长沙:国防科学技术大学计算机学院,2007:1~45.
    [71]卢凯,金士尧,卢锡城.并行文件系统中适度贪婪的Cache预取一体化算法.计算机学报, 1999. 22(11): 1172~1177.
    [72] Nong Xiao, Tao Chen, Fang Liu. RSEDP: Reliable, Scalable and Efficient Data Placement Algorithm. To be appeared on Journal of Super Computing.