数据仓库架构设计及其缓存管理策略研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
最近几年,商业智能、数据仓库技术的研究和应用已引起了研究人员、开发人员和用户的普遍注意。它已经成为近年来在计算机应用领域中发展最为迅速的几项新技术之一。随着国内信息化建设程度的日益提高,国内对商业智能、数据仓库解决方案的需求日益迫切。
     目前世界上已经出现了许多领先的数据仓库产品供应商,但是这些商业产品价格高昂,并不适合国内绝大多数中小企事业单位或政府机关的使用,且由于其源码封闭,不利于研究。与此同时,数据仓库领域的开源项目也得到了迅速的发展,ETL、OLAP、Data Mining等方面均出现了一些优秀的开源产品。本文便基于开源产品对数据仓库系统的架构设计进行了研究。
     数据仓库有着不同于传统数据库的分析处理和超大容量的特点,因此,如何建立一个高性能的数据仓库系统正成为当前数据仓库领域研究的热点。数据仓库系统的性能优化涉及模式设计、并行处理、缓存管理等多个方面,本文主要对其中的缓存管理策略进行了研究。
     本文首先介绍了数据仓库的概念及其相关技术,讨论了数据仓库领域商业产品及开源产品的开发现状。然后提出了一个基于多层J2EE架构的开源数据仓库架构,数据层为关系型数据库MySql,数据采集基于CloverETL开发,OLAP引擎基于Mondrian开发,OLAP前端展现基于Jpivot开发,元数据管理基于Eclipse插件Mondrian Schema Editor Plugin开发,考虑到方案无须EJB支持,J2EE服务器采用Tomcat。
     本文对Mondrian、Jpivot、Clover ETL等开源工具进行了源码分析,分析了通用缓存管理的相关策略,重点分析了数据仓库系统缓存管理的特点,实现了一种基于LRU替换算法的缓存管理策略,并提出了基于预先读算法的改良方案。
     基于本架构设计的杭州市劳动力市场数据仓库平台已成功运行,实现了面向杭州市就业服务局领导及各级工作人员的决策支持和报表查询。
In recent years, research and applications in Business Intelligence and Data Warehouse have attracted more and more attention from research fellows, programmers and users. It has become one of the most rapidly-developed new technologies in the field of computer application.With the development of the domestic informatization construction, the needs for Data Warehouse solutions have been in a dramatic rise in China.Now, some leading providers of Data Warehouse solutions,have emerged. However, these business solutions' price is very high.Therefore,these solutions are generally not a good choice for the small or media corporations and government departments.Furthermore,the research on business product is not easy while this prodcuts' code is not open.Meanwhile, in the field of data warehouse, many open source projects have enjoyed a rapid development Providers like ETL, OLAP and Data Mining have done a successful job.So,this paper research the construction of Data Warehouse by the open source solution.The Data Warehouse system differ from the normal database system in it's analysis abality and huge capacity,so,how to construct a Data Warehouse system with high performance is now become the research hot point.There are many factors which are concerned with the Data Warehouse system's performance,such as schema design、 concurrent processing and cache management.This paper select cache management strategy for the research.This paper first introduces the concept of data warehouse and its relative technologies, and then discusses the status of some excellent business and open source products in the field of data warehouse. Then this paper presents an open source data warehouse solution based on the multi-layer J2EE construction. The data layer uses Mysql, and ETL system is built based on CloverETL, OLAP engine on Mondrian, OLAP representation layer on Jpivot, metadata management system on Eclipse's Plugin Mondrian Schema Editor Plugin. Considering that this solution doesn't use EJB technology, Tomcat is selected as the J2EE
    server.Further, the paper gives a code analysis for the core part, which includes Mondrian, Jpivot and Clover ETL.Then this paper analyze the common stragegies of cache management,with an emphasis on Data Warehouse system's cache management strategies.The next,This paper present and realize a cache management strategies base on LRU exchange algorithm. Finally,we give the improved solution xombine PrePaging scheduling algorithm with improved LRU exchange algorithm.In the last part, the implementation process of the data warehouse system for the Hang Zhou Human Resources Market is introduced.
引文
[1] Inmon,W H.数据仓库.北京.机械工业出版社.2003.21
    [2] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger.数据仓库设计.北京.机械工业出版社.2004.12
    [3] 张宁,贾自艳,史忠植.数据仓库中ETL技术的研究.计算机工程与应用.2002.24.213-216
    [4] OLAP Council. OLAP Council White Paper. 1997
    [5] Jiawei Han. Data Mining Concepts and Techniques. Academic Press. 2000. 48-51
    [6] http://www.idc.com.cn/
    [7] http://www.businessobjecta.com/products/businessobjectaxi/default.asp
    [8] http://cloveretl.berlios.de/
    [9] http://octopus.objectweb.org/index.html
    [10] http://sourceforge.net/projects/mondrian/
    [11] http://sourceforge.net/projects/jpivot/
    [12] http://sourceforge.net/projects/weka/
    [13] http://sourceforge.net/projects/himalaya-tools/
    [14] http://www.eclipse.org/birt/phoenix/
    [15] http://www.pentaho.org/
    [16] Eduardo Pelegri-Llopart. JavaServer Pages Specification. Version: 1.2. August 27, 2001
    [17] Danny Coward. Java Servlet API Specification Version: 2.3. September 17, 2001. 43-48
    [18] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger. 数据仓库设计.北京.机械工业出版社.2004.12
    [19] Abraham Silberschatz, Henry F. Korth, S. Sudarshan. 数据库系统概念.北京.机械工业出版社.2005.3
    [20] Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite. 数据仓库生命周期工具箱:设计、开发和部署数据仓库的专家方法.北京.电子工业出版社.2004.1
    [21] Tom Soukup, lan Davidson. 可视化数据挖掘.北京.电子工业出版社.2004.1
    [22] George M. Marakas. 数据仓库、挖掘和可视化—核心概念.北京.清华大学出版社.2004.10
    [23] Olivia Parr Rud. 数据挖掘实践.北京.机械工业出版社.2003.9
    [24] Margaret H. Dunham. 数据挖掘教程.北京.清华大学出版社.2005.5
    [25] Sid Adelman, Larissa Terpeluk Moss. 数据仓库项目管理.北京.清华大学出版社.2003.2
    [26] Ralph Kimball, Margy Ross. 数据仓库工具箱——维度建模的完全指南(第二版).北京.电子工业出版社.2003.11
    [27] 赵源,李竹,裘鸿林,程东年.基于垃圾收集的Java程序性能改善方法.计算机应用研 究.2005.10.217-219
    [28] 谌宁,覃征.基于嵌入式Java虚拟机的垃圾回收算法.计算机应用.2005.1.218-219
    [29] 童亚凤,王庆君.数据库的查询优化策略.计算机系统应用.2004.4.67-70
    [30] KatzR. H, GlbsonGA, PatersonD. A Disk system architectures for high performance computing. Proc of the IEEE. 1989, 77 (12) : 1842-1858
    [31] Loukopoulos T. , Kalnis P. , Ahmad I. , Papadias D. . Active Caching of On-Line-Analytical-Proceesing Querles in WWW Proxies. Proc of Int. Conf on Parallel Processing(ICPP). 2001, pp. 419-426
    [32] 李善平,刘文峰,李程远等.Linux内核2.4版源代码分析大全.北京.机械工业出版社.2002 第一版
    [33] 谢长生,田智勇.磁盘高速缓存技术(DCO)对服务器小写性能的提高.计算机应用.2003,5:53-54
    [34] 叶德谦,马勤勇.优化MDX提高多维数据分析系统查询性能的研究.微处理机.2002.103(3):56-58
    [35] Horvath Kosiol. OLAP System Perfomance Measurement. Munchen: ForeHean Press. 1999: 235-263

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700