摘要
在现有的开源分布式文件存储系统HDFS上,构建一个智能大数据存储系统IHDFS.该系统提出了大数据去重模块、大数据放置模块、大数据智能迁移模块和大数据编码模块,构造了智能分布式文件存储系统,可以提高用户访问效率,节省集群的存储空间.实验结果表明,数据去重模块很好地节省了存储空间;数据放置模块合理地分配文件上传的存储层,使数据上传速度提高一倍;数据智能迁移模块提高了用户在高等存储层上文件的命中率,提高了用户获取数据的效率;数据编码模块节省了集群的储存空间,节省了大约原来存储空间的三分之一.
This paper establishes an intelligent big data storage system IHDFS,based on the existing open source distributed file storage system HDFS. The system proposes and implements big data de-duplication module,big data placement module,big data intelligent migration module,and big data encoding module,which improves the efficiency of user visits and saves the storage space of the cluster. Experimental results show that the data de-duplication module can save the storage space. The data placement module provides a reasonable distribution of file upload storage layer,which twice the uploading speed; the data intelligent migration module improves the hit rate of files on the upper storage layer,which improves the efficiency of obtaining data; the data encoding module saves the storage space of the cluster about one third of the original.
引文
[1]怀特T. Hadoop权威指南[M].北京:清华大学出版社,2015.(White T. Hadoop:the definitive guide[M]. Beijing:Tsinghua University Press,2015.)
[2] Harter T,Borthakur D,Dong S,et al. Analysis of HDFS under HBase:a facebook messages case study[C]//Usenix Conference on File and Storage Technologies. Santa Clara,2014:199-212.
[3] Islam N S,Lu X,Wasi-Ur-Rahman M, et al. In-memory I/O and replication for HDFS with Memcached:early experiences[C]//IEEE International Conference on Big Data.Washington,DC,2014:213-218.
[4] Bok K S,Oh H K,Lim J T,et al. An efficient distributed caching for accessing small files in HDFS[J]. Cluster Computing,2017,20(4):3579-3592.
[5]朱媛媛,王晓京.基于GE码的HDFS优化方案[J].计算机应用,2013,33(3):730-733.(Zhu Yuan-yuan,Wang Xiao-jing. HDFS optimization program based on GE coding[J]. Journal of Computer Applications,2013,33(3):730-733.)
[6]宋宝燕,王俊陆,王妍.基于范德蒙码的HDFS优化存储策略研究[J].计算机学报,2015, 38(9):1825-1837.(Song Bao-yan,Wang Jun-lu,Wang Yan. Optimized storage strategy research of HDFS based on Vandermonde code[J].Chinese Journal of Computers,2015,38(9):1825-1837.)
[7] Sun Z,Shen J,Yong J. A novel approach to data deduplication over the engineering-oriented cloud systems[J]. Integrated Computer-Aided Engineering,2013,20(1):45-57.
[8]项亮.推荐系统实践[M].北京:人民邮电出版社,2012.(Xiang Liang. Practical recommender systems[M]. Beijing:The People's Posts and Telecommunications Press, 2012.)
[9] Breese J S,Heckerman D,Kadie C. Empirical analysis of predictive algorithms for collaborative filtering[C]//The Fourteenth Conference on Uncertainty in Artificial Intelligence. Madison,1998:43-52.
[10] Duursma I,Dau H. Low bandwidth repair of the RS(10,4)Reed-Solomon code[C]//Information Theory and Applications Workshop. San Diego,CA,2017:1-10.