De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization

详细信息查看全文

作者：Yujuan Tan (1)
Zhichao Yan (2)
Dan Feng (3)
Xubin He (4)
Qiang Zou (5)
Lei Yang (1)

1. College of Computer Science ; Chongqing University ; Chongqing ; China
2. Department of Computer Science and Engineering ; University of Nebraska-Lincoln ; Lincoln ; NE ; USA
3. School of Computer Science and Technology ; Huazhong University of Science and Technology Wuhan ; Hubei ; China
4. Department of Electrical and Computer Engineering ; Virginia Commonwealth University ; Richmond ; VA ; USA
5. School of Computer ; Southwest University ; Chongqing ; China
关键词：Data deduplication ; Data placement de ; linearization ; Spatial locality
刊名：Cluster Computing
出版年：2015
出版时间：March 2015
年：2015
卷：18
期：1
页码：79-92
全文大小：1,301 KB
参考文献：1. Zhu, B., Li, K., Patterson, H.: Avoiding the disk bottleneck in the Data Domain deduplication file system, in FAST鈥?8, Feb. 2008
2. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar,V., Trezise, G., Campbell, P.: Sparse Indexing: Large scale, inline deduplication using sampling and locality, in FAST鈥?9, Feb. 2009
3. Bhagwat, D., Eshghi, K., Long, D.D., Lillibridge, M.: Extreme binning: scalable, parallel deduplication for chunk-based file backup, HP Laboratories, Tech. Rep. HPL-2009-10R2, Sep. 2009.
4. Srinivasan, K., Bisson, T., Goodson, G., Voruganti, K.: iDedup: latency-aware, inline data deduplication for primary storage, in FAST鈥?2, Feb. 2012.
5. Nam, Y.J., Park, D., Du, D.: Assuring demanded read performance of data deduplication storage with backup datasets, in MASCOTS鈥?2, Aug. 2012.
6. Kaczmarczyk, M., Barczynski, M., Kilian, W., Dubnicki, C.: Reducing impact of data fragmentation caused by in-line deduplication, in SYSTOR鈥?2, Jun. 2012.
7. Li, X, Lillibridge, M, Uysal, M (2011) Reliability analysis of deduplicated and erasure-coded storage. ACM SIGMETRICS Perform Eval Rev 38: pp. 4-9 CrossRef
8. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, in ICS鈥?9, Jun. 2010.
9. Bhagwat, D., Pollack, K., Long, D.D.E., Schwarz, T., Miller, E.L., 猫aris, J.P.: providing high reliability in a minimum redundancy archival storage system, in MASCOTS鈥?6, Sep. 2006.
10. Xia, W., Jiang, H., Feng, D., Hua, Y.: SiLo: a similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput, in USENIX鈥?1, Jun. 2011.
11. Rabin, M.O.: Fingerprinting by random polynomials, Center for Research in Computing Technology, Technical Report, Harvard University, TR-15-81, 1981.
12. NIST, 鈥淪ecure Hash Standard鈥? in FIPS PUB 180鈥?, May 1993.
13. Dong, W., Douglis, F., Li, K., Patterson, H.,: TradeOffs in scalable data routing for deduplication clusters, in FAST鈥?1, Feb. 2011.
14. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., Zhou, G.: SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup, in ICPP鈥?0, Sep. 2010.
15. Clements, A.T., Ahmad, I., Vilayannur, M., Li, J.: Decentralized deduplication in SAN cluster file systems, in USENIX鈥?9, Jan. 2009.
16. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., Welnicki, M.: Hydrastor: a scalable secondary storage. in FAST鈥?9, Feb. 2009.
17. You, L.L., Pollack, K.T., Long, D.D.E.: Deep Store: An archival storage system architecture, in ICDE鈥?5, Apr. 2005.
18. Vrable, M., Savage, S., Voelker, G.M.: Cumulus: Filesystem backup to the cloud, in FAST鈥?9, Feb. 2009.
19. Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z.: CABdedupe: A Causality-based deduplication performance booster for cloud backup services, in IPDPS鈥?1, May. 2011.
20. Adya, A., Bolosky, W.J., Castro, M., Cermak, G., Chaiken, R., Douceur, J.R., Howell, J., Lorch, J.R., Theimer, M., Wattenhofer, R. P.: FARSITE: federated, available, and reliable storage for an incompletely trusted environment, in OSDI鈥?2, Dec. 2002.
21. Bolosky, W.J., Corbin, S., Goebel, D., Douceur, J.R.: Single instance storage in windows 2000, in USENIX 鈥?0, Aug. 2000.
22. E. CORPORATION.: EMC Centera: Content Addressed Storage System, 2003.
23. Quinlan, S., Dorward, S.: Venti: A new approach to archival storage, in FAST鈥?2, Jan. 2002.
24. Muthitacharoen, A., Chen, B., Mazi猫res, D.: A low-bandwidth network file system, in SOSP鈥?1, Oct. 2001.
25. Deepak, R., Bobbar, J., Suresh, J.: Improving duplicate elimination in storage systems, ACM Trans Storage, 2(4), 2006.
26. Eshghi, K.: A framework for analyzing and improving content based chunking algorithms, Hewlett Packard Laboratories, Tech. Rep. HPL-2005-30, Feb. 2005.
27. Liu, C., Gu, Y., Sun, L., Yan, B., Wang, D.: ADMAD: Application-driven metadata aware de-deduplication archival storage systems, in the 25th IEEE Conference on Mass Storage Systems and Technologies, Sep. 2008.
28. Rhea, S., Cox, R., Pesterev, A.: Fast, inexpensive content-addressed storage in Foundation, in USENIX鈥?8, Jun. 2008.
29. Debnath, B., Senguptaz, S., Li, J.: ChunkStash: speeding up inline storage deduplication using flash memory, in USENIX鈥?0, Jun. 2010.
30. Guo, F., Efstathopoulos, P.: Building a high-performance deduplication system, in USENIX鈥?1, Jun. 2011.
31. Tan, Y., Yan, Z., Feng, D., Sha, E.H.M.: Reducing the de-linearization of data placement to improve deduplication performance, in International Workshop on Data-Intensive Scalable Computing Systems (DISCS, in conjunction with the 2012 ACM/IEEE Supercomputing Conference), Nov. 2012.
刊物类别：Computer Science
刊物主题：Processor Architectures
Operating Systems
Computer Communication Networks
出版者：Springer Netherlands
ISSN：1573-7543

文摘

Data deduplication has become a commodity in large-scale storage systems, especially in data backup and archival systems. However, due to the removal of redundant data, data deduplication de-linearizes data placement and forces the data chunks of the same data object to be divided into multiple separate units. In our preliminary study, we found that the de-linearization of data placement compromises the data spatial locality that is used to improve data read performance, deduplication throughput and deduplication efficiency in some deduplication approaches, which significantly affects deduplication performance and makes some deduplication approaches become less effective. In this paper, we first analyze the negative effect of data placement de-linearization to deduplication performance, and then propose an effective approach called De-Frag to reduce the de-linearization of data placement. The key idea of De-Frag is to choose some redundant data to be written to the disks rather than be removed. It quantifies the spatial locality of each chunk group by spatial locality level (SPL for short) and writes the redundant chunks to disks when SPL value is smaller than a preset value, thus to reduce the de-linearization of data placement and enhance the spatial locality. As shown in our experimental results driven by real world datasets, De-Frag effectively enhances data spatial locality and improves deduplication throughput, deduplication efficiency, and data read performance, at the cost of slightly lower compression ratios.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700