面向海量数据的副本定位与副本放置技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着数字化革命和网络技术的飞速发展,Internet上的数据量呈现出指数增长的趋势,大量的数据密集型应用会产生海量的数据,如何有效存储管理Internet上的海量数据,已经成为研究的热点问题之一。在海量数据管理中采用数据复制技术能够减少访问延迟,提高数据可用性,降低结点失效。在研究海量数据及现有的海量数据管理系统的基础上,对海量数据的特点和海量数据管理系统进行了总结和分析。从副本定位和副本放置两个方面对海量数据管理中的数据复制技术进行了深入研究。
     针对结构化P2P系统中常用的DHT定位技术的重叠网拓扑与底层网络拓扑不一致以及存在网络热点问题等不足之处,提出了基于几何空间划分的层次式DHT定位方法(HLSP)。HLSP采用基于坐标的网络距离预测技术GNP,构建基于地理邻近性的重叠网拓扑;构造基于地理范围的层次式DHT,把数据对象的位置信息存放到网络中具有某种逻辑层次关系的多个结点上;在此基础上对资源进行发布、撤消和查找,采用一种贪婪算法向前转发消息,并对结点加入和离开时导致的局部结点邻居列表变化进行动态维护。模拟实验结果表明,HLSP方法是一种有效的DHT定位方法,基于地理邻近性的重叠网拓扑解决了重叠网拓扑与底层网络拓扑不一致的问题,基于地理范围的层次式DHT能够消除网络热点并且实现高效查找。
     针对满足延迟限制的副本放置问题,提出了基于延迟估算的动态副本放置算法(DLDRP),DLDRP中副本总是被放置在满足副本请求结点延迟限制范围内的结点上。DLDRP首先设计了一个抽象的几何模型,通过该模型来确定候选副本结点所处的区域,从而将副本放置问题转变成在几何空间中求解满足某些条件限制的区域位置的问题;然后根据新副本结点上的负载来源情况动态调整副本放置:在某个结点上新增副本或者合并多个旧副本。模拟实验结果表明,DLDRP能够满足客户端延迟限制,减少数据访问响应时间,实现副本个数最小化,副本分布均匀且结点负载平衡。
With the high-speed development of digital revolution and network technology, the scale of data sets in Internet presents a trend of exponential growth. A large number of data-intensive applications often produce massive data, which makes it become a research hot that how to store and manage massive data effectively in Internet. Data replication can reduce access latency, improve data availability and depress node failure rate in massive data management. In this paper, based on the research of massive data and existing massive data management systems, the features are summarized and analyzed, and replica location and replica placement are studied.
     One of the shortcomings for popular DHT location technologies in structured P2P systems is topological disparity between overlay and underlying networks, another is the problem of network hotspot. A hierarchical DHT location method based on geometrical space partition called HLSP is proposed according to that. HLSP utilizes Global Network Positioning (GNP) to construct overlay networks on the basis of "geographical proximity". HLSP designs hierarchical DHT based on geographical scope so that location information of data objects can be distributed on several nodes which meet some logical hierarchy in network. HLSP offers several operations to data objects, such as object publish, withdraw and query operation. The actual packet forwarding in HLSP is based on a greedy algorithm guided by a destination coordinate stamped in the packet header. HLSP supports dynamic maintenance of local node-neighbor lists altered with node joining or leaving. The simulation results show that HLSP is an effective DHT location method, and overlay topology based on geographical proximity can resolve topological disparity between overlay and underlying networks, and hierarchical DHT based on geographical scope can avoid network hotspots and achieve high efficient searches.
     When it comes to replica placement problem under latency constraints, a dynamic latency driven replica placement algorithm called DLDRP is proposed, in which replicas are always placed on the nodes that are under latency constraints. In DLDRP, it first gives an abstract geometric model, following which the regions of candidate replicas are determined. In this way, the problem is transformed into how to get the region that meets some restrictive conditions in geometric space. Afterwards, DLDRP adjusts the placements of replicas dynamically according to the load source of new replica, that is, a new replica is added in a certain node or several existing replicas are merged. The simulation results show that DLDRP can meet the delay bound of client effectively, reduce access latency, minimize the number of replicas and achieve a uniform distribution of replicas and load balance of nodes.
引文
[1]. J Gray, P Helland, P O'Neil, et al. The dangers of replication and a solution. ACM SIGMOD Record, June 1996,25(2): 173 - 182.
    [2]. A L Chervenak, I Foster, C Kesselman, et al. Data management and transfer in high performance computational grid environments. Parallel Computing Journal, 2002, 28(5): 749-771.
    [3]. J. Kubiatowicz, D. Bindel, Y. Chen, et al. OceanStore: an architecture for global-scale persistent storage. ACM SIGARCH Computer Architecture News, 2000,28(5): 190-201.
    [4]. Ben Segal. Grid Computing: The European Data Project. In: IEEE Nuclear Science Symposium and Medical Imaging Conference, Lyon, 15-20 October 2000.
    [5]. Frank Dabek, M. Frans Kaashoek, David Karger, et al. Wide-area cooperative storage with CFS. ACM SOSP 2001, Banff, October 2001.
    [6]. A. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. Proc. of. ACM SOSP'01, Banff, Canada, Oct. 2001.
    [7]. W. H. Bell, D. G. Cameron, L. Capozza, et al. Simulation of Dynamic Grid Replication Strategies in OptorSim.Int. Journal of High Performance Computing Applications, 2003,17(4): 46 - 57.
    [8]. M. Carman, F. Zini, L. Serafini, et al. Towards an Economy-Based Optimisation of File Access and Replication on a Data Grid. In Int.Workshop on Agent based Cluster and Grid Computing at CCGrid 2002, Berlin, Germany,May 2002, IEEE-CS Press, pages 340 - 345.
    [9]. DOCT Technical Report Task C3: Data Replication. San Diego Supercomputer Center.
    [10]. Moore, R., A. Rajasekar. Common Consistency Requirements for Data Grids, Digital Libraries, and Persistent Archives. Grid Procotol Architecture Research Group draft, Global Grid Forum, April 2003.
    
    [11]. PPDG, The Particle Physics Data Grid. http://www.cacr.caltech.edu/ppdg/, 1999.
    [12]. ASCI, Accelarated Strategic Computing Initiative. http://www.llnl.gov/asci/, 1999.
    [13]. IPG, Information Power Grid, A NASA Project, http://www.ipg.nasa.gov/, 2000.
    [14]. GriPhyN, The Grid Physics Network. http://www.griphyn.Org/proj-descl.0.html, 2000.
    [15]. Richard W.Watson and Robert A. Coyne. The parallel I/O architecture of the High-Performance Storage System (HPSS). In IEEE MSS Symposium, 1995.
    
    [16]. B. Tierney, W. Johnston, J. Lee, et al. An overview of the distributed parallel storage server (DPSS). Available at http:// www-didc.lbl.gov/ DPSS/ Overview/ DPSS.handout.fm.html.
    [17]. B. Fuller and I. Richer. The MAGIC Project: From Vision to Reality. IEEE Network, May, 1996, Vol. 10, no. 3. http://www.magic.net/.
    [18]. Gnutella Protocol Specification, version 0.4. http://www.clip2.com/GnutellaProtocol04.pdf, 2001.
    [19]. I. Clarke, O.Sandberg, B.Wiley, and T. Hong. Freenet: A Distributed Anonymous Information Storage and Retrieval System. Proc. of ICSI Workshop on Design Issues in Anonymity and Unobservability, Berkeley, California, June 2000.
    [20]. FastTrack Product Description. Http://www.fasttrack.nu/index_int.html, 2002
    [21]. S. Ratnasamy, S. Shenker, and I. Stoica. Routing Algorithms for DHTs: Some Open Questions. Proc. of First International Workshop on Peer-to-Peer Systems (TPTPS'02), Cambridge, USA, 2002.
    [22]. Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica. Looking up Data in P2P Systems. Communications of the ACM, Vol. 46, No. 2, pp. 43 - 48,2003.
    [23]. Ben Y. Zhao, Ling Huang, Jeremy Stribling, et al. Tapestry: A Resilient Global-scale Overlay for Service Deployment. IEEE Journal on Selected Areas in Communications (JSAC), Vol. 22, No. 1, pp. 41 - 53,2004.
    [24]. A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object Location and Routing for Large-scale Peer-to-Peer Systems. Proc. of IFIP/ACM Middleware'2001, pp. 329 - 350, Heidelberg, Germany, 2001.
    [25]. Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, et al. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions on Networking, Vol. 11, No. 1, pp. 17 - 32, February 2003.
    [26]. S. Ratnasamy, P. Francis, M. Handley, et al. A Scalable Content-Addressable Network. Proc. of ACM SIGCOMM' 2001, New York, USA, pp. 149 - 160, 2001.
    [27]. A. Rowstron and P. Druschel. Pastry: Scalable, Distributed Object Location and Routing for Large-scale Peer-to-Peer Systems. Proc. of IFIP/ACM Middleware'2001, pp. 329 - 350, Heidelberg, Germany, 2001.
    [28]. Barbosa, V C . An Introduction to Distributed Algorithms. The MIT Press, 1996.
    [29]. Pitoura E, Bhargava B. Maintaining Consistency of Data in Mobile Distributed Environments. Proc. of IEEE ICDCS'95 ,1995 , 404 - 413.
    [30]. Bong-Jun Ko and Dan Rubenstein. Distributed, self-stabilizing placement of replicated resources in emerging networks. In: 11th IEEE International Conference on Network Protocols (ICNP), 2003.
    [31]. Bong-Jun Ko and Dan Rubenstein. A greedy approach to replicated content placement using graph coloring. In SPIE ITCom Conference on Scalability and Traffic Control in IP Networks II, July 2002.
    [32]. Y. Chen, R. H. Katz, and J. D. Kubiatowicz. Dynamic replica placement for scalable content delivery. Proc. of IPTPS, March 2002.
    [33]. Y. Chen, R. H. Katz, and J. D. Kubiatowicz. Scan: A dynamic, scalable, and efficient content distribution network. Proc.of the International Conference on Pervasive Computing (Pervasive 2002), (Zurich, Switzerland), August 2002.
    [34]. M.R. Korupolu, and M. Dahlin. Coordinated Placement and Replacement for Large-Scale Distributed Caches. Proc.of IEEE Workshop on Internet Applications. pages 62-71, July 1999.
    [35]. M.R. Korupolu, G. Plaxton, and R. Rajaraman. Placement Algorithms for Hierarchical Cooperative Caching. Journal of Algorithms,38(l):260 - 302, January 2001.
    [36]. K. Ranganathan, A. Iamnitchi, and I. Foster. Improving Data Availability through Dynamic Model-Driven Replication in Large Peer-to-Peer Communities. Proc.of Global and Peer-to-Peer Computing on Large Scale Distributed Systems Workshop, Berlin, Germany, May 2002.
    [37]. P. Francis, S. Jamin, V. Paxson, et al. An architecture for a global Internet host distance estimation service. Proc.of IEEE INFOCOM '99, New York, NY, Mar. 1999.
    [38]. P. Francis, S. Jamin, et al. IDMaps: A Global Internet Host Distance Estimation Service. IEEE/ACM ToN, October 2001.
    [39]. T.S. Eugene Ng and H. Zhang. Global Network Positioning: A New Approach to Network Distance Prediction. CCR, 2001.
    [40]. T.S. Eugene Ng and H. Zhang. Predicting Internet Network Distance with Coordinates-Based Approaches. In 21st IEEE INFOCOM, June 2002.
    [41]. M. Szymaniak, G. Pierre, and M. van Steen. Scalable cooperative latency estimation, In Tenth International Conference on Parallel and Distributed Systems (ICPADS), Newport Beach, CA, USA, July 2004.
    [42]. J.A. Nelder and R. Mead. A simplex method for function minimization, Computer Journal, vol. 7, pp. 308 - 313,1965.
    [43]. Yinzhe Yu, Sanghwan Lee, Zhi-Li Zhang. Leopard: A Locality-Aware Peer-To-Peer System with No Hot Spot (Extended Abstract). In the 24th Annual Conference of IEEE Communications Society (INFOCOM), Miami, Florida, March 2005.
    
    
    [44]. Global Nework Positioning. http://www-2.cs.cmu.edu/eugeneng/research/gnp/.
    [45]. M. karlsson, C. Karamanolis, M .Mahalingam. A Framework for Evaluation Replica Placement Algorithms. Technical Report HPL-2002, HP Laboratories, July 2002.
    [46]. W. H. Bell, D. G. Cameron, R. Carvajal-Schiaffmo, et al. Evaluation of an Economy-Based File Replication Strategy for a Data Grid. Proc.of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003 (CCGrid 2003), Tokyo, Japan, May 2003.
    
    [47]. M. Carman, F. Zini, L. Serafini, et al. Towards an Economy-Based Optimisation of File Access and Replication on a Data Grid, Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02).
    
    [48]. C. Ernemann, V. Hamscher, R. Yahyapour. Economic Scheduling in Grid Computing. In Job Scheduling Strategies for Parallel Processing, Edinburgh, Scotland, July 2002. Springer. LNCS 2537.
    
    [49]. C. A. Waldspurger, T. Hogg, B. A. Huberman, et al. A Distributed Computational Economy. IEEE Transactions on Software Engineering, 18(2), 1992.
    
    [50]. N. Nisan, S. London, O. Regev, and N. Camiel. Globally Distributed Computation over the Internet - the POPCORN Project. Proc.of the Int. Conference on Distributed Computing Systems (ICDCS'98), Amsterdam, The Netherlands, May 1998. IEEE CS-Press.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700