高性能计算集群文件系统的优化技术研究

英文题名：The Research of Optimization Technologies for the File System of High Performance Computing Cluster
作者：张钰森
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：高性能计算 ; 分布式存储技术 ; Lustre文件系统 ; 优化技术
英文关键词：HPC ; Distributed Storage Technique ; Lustre Filesystem ; Optimization Technology
学位年度：2010
导师：吴庆波
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2010-11-01

摘要

随着高性能计算技术的飞速发展,越来越多的领域开始使用该技术来解决生产和科研中所遇到的实际问题,例如气象数值模拟与预报、地震预报、生物信息、环境科学、空间科学、金融等重要领域。高性能计算技术的发展水平已经逐渐成为衡量一个国家综合国力和国际竞争力的重要指标。在构建高性能计算系统的过程中,存储系统的性能是影响其计算性能的主要因素。因此,研究高性能计算文件系统并对其进行优化具有重要意义。
     本文对高性能计算文件系统的存储原理、存储结构以及相关存储技术进行了深入研究。在此基础上,对其实际应用过程中存在的不足进行分析。针对这些不足之处,对高性能计算文件系统进行优化,以提高存储系统的I/O性能。
     在存储资源分配策略方面,本文将经济学模型引入高性能计算文件系统。利用相关经济学理论对文件系统进行建模,并在该模型基础上设计了相应的算法对文件系统的存储资源进行分配。优化之后的文件系统能够根据应用场景的不同,动态调整其存储资源分配策略。不仅简化了文件系统的调优工作,还提高了系统资源利用率。
     在数据访问控制方面,本文提出了一种基于状态感知的数据访问控制方法。状态感知访问控制方法的关键在于客户端能够感知到整个系统的负载状态,并能够根据负载状态信息动态调整其请求发送策略。这种数据访问控制方法能够在一定程度上避免拥塞发生,并使文件系统工作在最优负载状态,充分发挥其I/O性能。
     在元数据访问控制方面,分布式元数据存储结构是消除单元数据服务器瓶颈的有效解决方案。本文对这种存储结构进行了优化设计,并在此基础上对文件系统元数据访问策略进行了优化。为提高元数据服务器的响应速度,本文对元数据的操作进行了适当的松弛处理。优化之后的文件系统能够更好地满足高性能计算对存储系统的需求。
     最后,本文基于上述工作设计了原型系统SA-Lustre,并在Lustre模拟器上实现了该原型系统。通过对SA-Lustre原型系统的测试可以发现,优化之后的文件系统在I/O性能、并发I/O带宽以及吞吐率方面有了很大的提高。
With the rapid development of the HPC (High Performance Computing) technology, it has been widely used by more and more areas to solve practical problems, such as Weather Forecast, Earthquake Prediction, Bioinformatics, Environmental Science, Space science, Finance and other important areas. The state of HPC Techniques’development has gradually become a major significant of a country’s comprehensive national strength, and the indicators of its international competitiveness. During the progress of building HPC Systems, the performance of its storage system is one of the main factors of its computational performance. Therefore, it is necessary to study the storage system of HPC Systems and to do optimization to it.
     In this paper, the storage principles, storage system architecture and related storage technologies have been researched in depth. Based on this, the shortcomings, coming out from the progress of the practical applications, have been analyzed. In response to the disadvantages, the HPC Filesystem has been optimized to improve the I/O performance of the storage system.
     About the strategies of storage resource allocation, the economic model will be introduced to the HPC Systems. With the help of the economics theory, the file system has been modeled. Besides, algorithms for storage resource allocating have been designed based on the model. The file system which has been optimized could adjust the strategy for storage resource allocating dynamically, according to different scenarios. The tuning work of the file system has been simplified by the optimization; the utilization of system resource has also been improved.
     About the data access control, this paper presents a technique for data access control based on the state of system. The key point of this method is that client could sense the state of server’s load and adjust its request sending strategy according to it. This method could make the file system avoid congestion and work in the optimal load status. So file system could give full play to its I/O performance.
     About the metadata accessing, this paper introduces MDCache (Metadata Cache) to HPC Filesystem. Optimization about the metadata access strategy has been done to it based on this. Besides, the operation about metadata has been relaxation treated to reduce the response time of the MDS (Metadata Server). File system which has been optimized could meet the HPC’s growing demand in an even better fashion.
     Finally, the prototype system SA-Lustre has been designed based on the optimization techniques above, and implemented with the help of Lustre Simulator. Compare the testing result between SA-Lustre and Lustre; it could be found that the I/O performance, concurrent I/O bandwidth and throughput have been greatly improved after the optimization.

引文

[1] HP OpenVMS Systems [DB/OL]. http://h71000.www7.hp.com, 2010-09-05
    [2] Rajkumar Byuua. High-Performance Computing Cluster [M].北京:电子工业出版社2001:1-68.
    [3] Linux HPC Cluster [DB/OL]. http://www.redbooks.ibm.com/abstracts/sg246041.html, 2010-09-05
    [4] Computer cluster [DB/OL]. http://en.wikipedia.org/wiki/Computer_cluster, 2010
    [5]赵毅,朱鹏,迟学斌,牛铁,曹宗雁.浅析高性能计算应用的需求与发展[J].计算机研究与发展,2007,44(10)
    [6]黄度.并行文件系统调研报告[R].北京:中科院软件所并行计算实验室.2006:1-17.
    [7] David Leong. Collaborative Object Caching for Heterogeneous OSD Clusters [R]. Singapore: School of Information & Communications Technology Republic Polytechnic. 2007: 425-436
    [8] Brandt S, Xue L, Miller L. Efficient metadata management in large distributed file systems[C]. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies. 2003(5): 290-298
    [9] Frank Schmuck, Roger Haskin. GPFS: A shared-disk file system for large computing clusters [C]. In Proceedings of the First USENIX Conference on File and Storage Technologies, 2002(1): 231-244
    [10] Direct-attached storage [DB/OL]. http://en.wikipedia.org/wiki/Direct-attached_storage, 2010-09-09
    [11] Farley M,孙功星等译.SAN存储区域网络[M].北京:机械工业出版社,2001:42-45
    [12]徐学雷.网络存储技术及其新进展[J].北京电子科技学院学报,2005,13(4):7-11
    [13] IP-SAN: A complete Storage solution [DB/OL]. http://www.networkmagazineindia.com/200212/vendor.shtml, 2010-09-15
    [14]张江陵,冯丹.海量信息存储[M].北京:科学出版社,2009:95-107
    [15] Gibson G, Vanmeter R, Network Attached Storage architecture [J]. Communications of the ACM. 2000, 43(11): 37-45
    [16] Garth Gibson, Brent Welch, David F, Bruce C. Object Storage: Scalable Bandwidth for HPC Clusters [R]. Panasas Inc. New York 2004:1-16
    [17] Panasas Inc. Object Storage Architecture [DB/OL]. http://www.panasas.com, 2010-09-15
    [18] Feng Wang. Storage management in Large Distributed Object-Based Storage System [J]. 2006(12):50-59
    [19] General Information and References for the NFSv4 protocol [DB/OL]. http://www.nfsv4.org/, 2010-09-16
    [20] xFS: Serverless Network File Service [DB/OL]. http://now.cs.berkeley.edu/Xfs/xfs.html, 2010-09-16
    [21] PVFS Project [DB/OL]. http://www.pvfs.org/, 2010-09-16
    [22] Cams P, Ligon W, Ross R, Thakur R. PVFS: a Parallel File System for Linux Clusters [C]. Proceedings of the 14th Annual Linux Showcase and Conference. Adanta: 2000:147-154
    [23] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System [DB/OL.]. http://labs.google.com/papers/gfs.html, 2010-09-16
    [24] Lustre: a Network Clustering FS [DB/OL]. http://www.lustre.org, 2010-09-16
    [25] Top500 List [DB/OL]. http://www.top500.org/, 2010-09-20
    [26] Peter J. Braam, Andreas E. Dilger. Object Based Storage [DB/OL]. http://www.lustre.org/docs/obdspec.pdf, 2010-09-20
    [27] Lustre 1.8 Operations Manual [DB/OL]. http://wiki.lustre.org/images/0/09/821-0035_v1.3.pdf, 2010-09-21
    [28] Lustre Clustered Meta-Data (CMD) [DB/OL]. http://wiki.lustre.org/images/2/23/LUG2008-Lustre-cmd.pdf, 2010-09-25
    [29] Configuring Lustre File Striping [DB/OL]. http://wiki.lustre.org/index.php/Configuring_Lustre_File_Striping, 2010-09-27
    [30] Peter J. Braam.The Lustre Storage Architecture [DB/OL]. http://www.cs.hku.hk/cluster2003/presentation/vt/lustre.pdf, 2010-09-28
    [31] Rajkumar Buyya, David Abramson, Jonathan Giddy, Heinz Stockinger. Economic models for resource management and scheduling in Grid computing [J]. CONCURRENCY AND COMPUTATION. 2002(14):1507-1542
    [32] M. Stonebraker, R. Devine, M. Kornacker, W. Litwin, A. Pfeffer, A. Sah, C. Staelin. An Economic Paradigm for Query Processing and Data Migration in Mariposa [C]. Proceedings of 3rd International Conference on Parallel and Distributed Information Systems, Austin, TX, USA, 28-30 Sept. 1994. Los Alamitos, CA, USA: IEEE Comput. Soc. Press, 1994
    [33] Ali Haydar Ozer. Combinatiorial Auction Based Resource Co-Allocation Model for Grids [D]. Bogazici: Bogazici University, 2004: 5-7.
    [34] Network congestion [EB/OL]. [2010-06-07]. http://en.wikipedia.org/wiki/Network_congestion
    [35] Congestion Control [EB/OL]. http://www.eventhelix.com/
    [36] Floyd S, Jacobson V. Random early detection gateways for congestion avoidance. IEEE/ACM Trans. On Networking, 1993, 1(4): 397-413
    [37] Braden B, Clark D, Crowcroft J, Dzvie B, Deering S, Estrin D, Floyd S, Jacobson V, Minshall G, Partridge C, Peterson L, Ramakrishnan K, Shenker S, Wroclawski J, Zhang L. Recommendations on queue management and congestion avoidance in the internet. RFC 2309 1998
    [38] Demers A, Keshav S, Shenker S. Analysis and Simulation of a fair queuing algorithm [C]. Communications Architectures and Protocols. New York: ACM Press, 1989: 1-12
    [39] Congestion Control Triggers [EB/OL]. http://www.eventhelix.com/, 2010-09-30
    [40] Christiansen M, Jeffay K, Ott D, Smith FD. Tuning RED for Web Traffic [J]. IEEE/ACM Trans. On Networking, 2001, 9(3): 249-264
    [41] Jie Yan, Yao Long. A Design of Metadata Server Cluster in Large Distributed Object-based Storage [C]. Conference on Mass Storage Systems and Technologies. Gooege Prak, MD, 2004
    [42] David Kotz. File System Workload on a Scientific Multiprocessor [J]. IEEE Parallel & Distributed Technoogy. 1995
    [43] Chandramohan A Thekkath, John Wikes, Edward D Lazowska. Techniques for file system simulation [J]. Software-Practice and Experience, 1994, 24(11): 981-990
    [44] Lustre simulator [EB/OL]. [2010-06-12]. https://bugzilla.lustre.org/show_bug.cgi?id=%2013634, 2010-10-25
    [45] Richard Lundeen, Steve C. High Performance Computing and I/O Architectures for Database and Knowledge Discovery [J]. The system Design Perspective. 2006
    [46] Hachiro Fujita, Kohichi Sakaniwa. Modified Low-Density MDS Array Codes for Tolerating Double Disk Failures in Disk Arrays [J]. IEEE TRANSACTIONS ON COMPUTERS. 2007, 56(4): 563-566
    [47] Peter J, Ron Brightwell, Phil Schwan. Portals and Networking for the Lustre File System [R]. 2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700