基于Hadoop/Hive的海量web日志处理系统的设计与实现

英文题名：Design and Implementation of Massive Web Log Analysis System Based on Hadoop/Hive
作者：刘永增
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：web日志 ; 云计算 ; Hadoop ; Hive
英文关键词：web log ; cloud computing ; Hadoop ; Hive
学位年度：2011
导师：张晓景
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2011-11-01

摘要

Web日志处理一直是一个热点的研究问题。随着互联网技术的飞速发展,网络产生的信息量越来越大,web日志处理也面临着新的问题。对于一个数据中心而言,它不仅会产生相当大规模的web日志数据,而且各个web服务器生成的日志格式都各有差异。怎样存储并高效地处理一个数据中心产生的海量、异构的web日志,是本文研究的主要内容。
     Hadoop是一个流行的大规模数据处理框架,它能够运行于多种平台上,并且具有良好的健壮性和可扩展性。Hadoop实现了MapReduce算法,因此,使用Hadoop的用户可以编写特定的MapReduce程序来完成自己的任务。
     MapReduce程序处于比较低的层次,每完成一个特定的任务,用户都必须编写大量代码。Hive是基于Hadoop的一个开源的数据仓库工具。它引入了传统数据库中的一些概念,并且支持用一种类SQL语言。熟悉传统数据库开发的用户能够快速的进行开发,并且显著的减少代码量。
     本文对这两种工具分别进行了深入的研究,包括与它们各自相关的概念、技术。还包括这两种工具技术的使用,如,怎样配置一个基于Hadoop/Hive的一个环境,怎样维护由它们组成的集群系统。以及如何基于它们进行开发,例如如何开发MapReduce程序,如何利用Hive提供的语言来进行数据处理等等。
     根据对这两种工具及相关技术的研究,本文开发了一个基于Hadoop/Hive的web日志处理系统。该系统在逻辑功能上划分成四个模块。日志采集模块通过将数据中心中各个前端web站点产生的日志数据同步到日志采集站点,并运行后台脚本将数据导入到已经建立的表中。查询分析模块负载完成对web日志的预处理,并接收用户发出的查询请求、返回查询结果。存储处理模块完成对数据的实际存储,包括原始的日志数据,清洗过后的日志数据和种种其他临时数据。并执行转化后的MapReduce程序。在结果输出模块中,我们选择了一种客户端语言负责与Hive进行通信,完成统计功能的代码,并最终以web页面的形式表现出查询结果。本系统的开发,既利用了Hadoop在海量数据处理方面的优势,又利用了Hive在简化应用开发方面的强项。通过实际的测试比较,该系统在大规模数据处理方面有着明显的优势,并且有较高的实用价值。
Web log processing has been a hot research question. With the rapid development of Internet technology, the amount of information generated by the network is becoming more and more. Moreover, web log processing is also facing new problems. For a data center, it will not only produce massive web log data, but also generate log files of different formats. How to store and deal with massive, heterogeneous web log generated by the data center is the main content of this thesis.
     Hadoop is a popular large scale data processing framework. It can run on multiple platforms, and has good robustness and scalability. Hadoop implement the MapReduce algorithm. The users have to write MapReduce programs that are specific to their tasks.
     MapReduce programs are at a relatively low level, users must write a lot of codes in order to complete a specific task. Hive is an open source data warehouse tools that is based on Hadoop. It introduces some concepts of the traditional database, and it supports a kind of SQL like language. So that, users who familiar with traditional database development can develop quickly, and the amount of code can be reduced significantly.
     This thesis takes in-depth study on these two tools, including their respective associated concept and technology. This study also includes the use of these two tools, including how to configure an environment based on Hadoop/Hive, how to maintain the cluster system composed by Hadoop and Hive and how to develop on the platform based on Hadoop/Hive, for example, how to develop MapReduce programs, how to use Hive to solve problem data processing by the SQL-like language which provided by the Hive.
     This thesis designed and implemented a web log analysis system based on Hadoop/Hive according the study of these two tools. This system is logically divided into four functional modules. The log data collecting module synchronize the web log data that generated by all the various front-end web site to the log collecting site, and then, it run background scripts to import data to the table that has been established. Query analysis module completes the preprocessing of the web log, receives the query requests and returns query results. Storing and processing module is designed to complete the actual storage of data, including the original data, the cleaned data and various other temporary data. In the results outputting module, we choose a kind of language that is responsible for communicating with Hive, completes codes of statistics and shows results in the form of web pages eventually. This web log analysis system makes full use of the data processing ability of Hadoop and advantage of simplifying application development. The system has a clear advantage in Big Data processing, and has high practical value.

引文

[1]Raymond Kosala, Hendrik Blockeel, ACM SIGKDD Explorations Newsletter[C]. ACM, New York,1990.
    [2]朱珠.基于Hadoop的海量数据处理模型研究和应用[D].北京：北京邮电大学,2008.
    [3]Savitha Srinivasan, Vikas Krishna, Holmes,S. Web log driven business activity monitoring[J]. Computer,2005,38(3):61-68.
    [4]周则顺,水俊峰,夏红霞,范斌.基于Web日志挖掘的智能站点体系[J].武汉理工大学学报,2003,25(6)：72-75.
    [5]Kolari P, Joshi, A. Web mining:research and practice[J]. Computing in Science & Engineering,2004,6(4):49-53.
    [6]凌志泉.Web日志挖掘技术的研究与自适应Web站点的构建[D].天津：天津大学,2003.
    [7]Kavita Sharma, Gulshan Shrivastava, Vikas Kumar. Web Mining Today and Tomorrow[C]. Electronics Computer Technology (ICECT).2011 3rd International Conference on.
    [8]Raymond Kosala, Hendrik Blockeel. Web Mining Research:A Survey[J]. Acm SIGKDD, 2000,2(1):1-15.
    [9]Nasraoui 0, Soliman M, Saka E, Badia A, Germain R. A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites[J]. Knowledge and Data Engineering,2008,20(2):202-215.
    [10]Tasawar Hussain, Dr. Sohail Asghar, Dr. Nayyer Masood. Web Usage Mining:A Survey on Preprocessing of Web Log File[C]. Information and Emerging Technologies (ICIET). 2010 International Conference on.
    [11]Awstats在sourceforge中的介绍.http://awstats.sourceforge.net/
    [12]金松昌,方滨兴,杨树强,贾焰.基于Hadoop的网络安全日志分析系统的设计与实现[C].全国计算机安全学术交流会论文集.2010.
    [13]王宏宇.Hadoop平台在云计算中的应用[J].软件,2011,32(4)：36-38.
    [14]陈苗,陈华平.基于Hadoop的Web日志挖掘[J].计算机工程,2011,37(11)：37-39.
    [15]Ashish Thusoo, Joydeep Sen Sarma, Namit Jain. Hive-A Petabyte Scale Data Warehouse Using Hadoop[C]. Data Engineering (ICDE),2010 IEEE 26th International:996-1005.
    [16]贾文娟.基于hive分布式计算与数据挖掘的关联性营销的设计与实现[D].北京：北京交通大学,2011.
    [17]叶文宸.基于hive的性能优化方法的研究与实践[D].南京：南京大学,2011.
    [18]Michael Dorf. http://www.learncomputer.com/why-hadoop/.
    [19]BobGourley. http://ctovision.com/2010/12/background-on-lucene-nutch-and-hadoop/.
    [20]Hadoop在Apache上的介绍.http://hadoop. apache. org/#What+Is+Apache+Hadoop%3F.
    [21]Mei A, Mancini L V, Jajodia S. Secure dynamic fragment and replica allocation in large-scale distributed file systems[J]. Parallel and Distributed Systems,2003, 14(9):885-896.
    [22]Melamed A S. Performance Analysis of Unix-based Network File Systems[J]. Micro, 1987,7(1):25-38.
    [23]Yahoo关于HDFS的介绍.http://developer.yahoo.com/hadoop/tutorial/module2.html.
    [24]Tom White. Hadoop权威指南[M].曾大聃,周傲英,译.北京：清华大学出版社,2010.
    [25]Konstantin Shvachko, Hairong Kuang, Sanjay Radia. The Hadoop Distributed File System[C]. Mass Storage Systems and Technologies (MSST).2010.
    [26]Yahoo关于MapReduce的介绍.http://developer.yahoo.com/hadoop/tutorial/module4.html.
    [27]Chuck Lam. Hadoop in Action[M]. Manning Publications Co,2011.
    [28]Apache对Hive的介绍.https://cwiki.apache.org/confluence/display/Hive/Home.
    [29]Venkatahari Shankar. http://www.learncomputer.com/hadoop-with-hive/.
    [30]csdn社区.http://www.csdn.net/.
    [31]Rick Grehan. http://www.infoworld.com/d/cloud-computing/open-source-hive-large-scale-Distributed-data-processing-made-easy-126.
    [32]Apache对HiveQL的介绍.https://cwiki.apache.org/confluence/display/Hive/LanguageManual.
    [33]Zahed K S,Rani P S, Saradhi U V, Potluri A. Reducing storage requirements of snapshot backups based on rsync utility[C]. Communication Systems and Networks and Workshops,2009.
    [34]Sakr S, Liu A, Batista D M, Alomari M. A Survey of Large Scale Data Management Approaches in Cloud Environments[J]. Communications Surveys & Tutorials,2011, 13(3):311-336.
    [35]Dosil M, Farilla A, Gallas M, Giangiobbe V, Orellana F. Massive Data Processing for the ATLAS Combined Test Beam[C]. Nuclear Science,2006,53(3):2887-2891.
    [36]有关PHP. http://www. php. net/.
    [37]Hive支持的客户端接口.https://cwiki.apache.org/confluence/display/Hive/HiveClient.
    [38]有关Thrift. http://thrift.apache.org/.
    [39]POSIX的wiki介绍.http://en.wikipedia.org/wiki/POSIX.
    [40]Nagios的官网.http://www.nagios.org/.