基于MapReduce的好友推荐系统的研究与实现

英文题名：Research and Implementation of Recommendation System Based on Mapreduce
作者：杨婷
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：云计算 ; 大数据 ; 图算法 ; 社交网络 ; Hadoop ; MapReduce
英文关键词：Cloud Computing ; Big Data ; Graph Algorithms ; Social
英文关键词：Networking Service ; Hadoop ; MapReduce
学位年度：2013
导师：商彦磊
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2013-01-03

摘要

随着互联网Web2.0技术的兴起,视频网站、社交网站、微博等得到了广泛应用,用户在上网体验的过程中,产生了大量的数据。面对如此庞大的数据集,信息过量已经成为很多系统面临的问题。从海量数据中找到真正有用的信息,不仅能够帮助用户节省时间,而且还能带给用户更好的上网体验。
     现有的Web数据挖掘技术应用十分广泛,例如在电子商务中,利用用户购买和浏览的数据,挖掘出用户的购买喜好和购买趋势；社交网站中,通过分析用户的信息、发布的内容、评论等,挖掘出有价值的信息,从而为用户提供更好的服务；利用社交网络用户之间的关系,抽象出社交网络关系图,再通过分析社交网络关系图发掘出潜在的规律等。在这种背景下,本文基于云计算技术提出了使用大规模数据处理算法的用户好友推荐系统,且基于Hadoop平台设计并实现了该系统。
     本文讨论的用户好友推荐系统由数据采集、数据处理和策略推荐三个部分组成。数据采集模块抓取系统需求的用户数据,如社交网络中用户的id、用户好友的id、用户Follow用户的id等,用户数据存储在HDFS中；数据处理模块,使用并行的处理算法,处理在云计算环境下的海量数据,Dijkstra算法计算被推荐用户到其他用户的距离,PageRank算法计算所有用户在该社交网络中的影响力；策略推荐模块,利用数据处理模块获得的数据进行推荐,以用户影响力作为排序因素对被推荐用户好友的好友进行排序,按照此排序结果进行推荐。
     基于本系统,社交网站司‘以为用户推荐潜在好友,以增加用户活跃度及用户对社交网络的粘着性；用户可以认识新的好友,扩充自己的人脉,加大用户的影响力。另外,本系统以Twitter数据作为例子进行运算,实际上满足格式要求的数据,都能用本系统进行大规模数据的运算处理。本系统基于Hadoop平台设计,利用MapReduce计算框架实现了推荐算法,能够处理海量的数据集。
With the developping of Web2.0, video sharing, social networking services, and microblog become popular applications. While surfing the Internet, users leave a large amount of data. Faced with such a large data set, information overload has almost become to a problem which many users will meet, therefore, finding out useful information from massive data, not only can help users save time, but also gives users a better Internet experience.
     Web data mining has a wide range of usage, in the e-commerce, we can use the shopping data of users to mining the users'buying preferences and buying trends, as for social networking services, we can dig out the potential value through analysis users'information, microblog comments. Relationship in social network can be abstracted to a graph composed with persons and relations, through analyzing the graph we can unearth potential law. In this context, we proposed a recommendation system using large-scale data processing algorithm in cloud computing environment, and the acquisition and processing of data is designed and implemented on Hadoop platform.
     The recommendation system discussed in this article is composed of three parts, data acquisition, data processing, and the strategy recommendation. The function of data acquisition module is to capture users'data that system required, such as social network users'id, the users'friends'id and followers'id, the users' information will be handled and be stored in HDFS; data processing module uses large-scale data processing algorithms to processing data under the cloud computing environment, the distance between presentee and other users is calculated by Dijkstra's algorithm, PageRank algorithm is used to calculate the influence of users in the social network; strategy recommendation module, use the result of data processing module to recommend, the user's influence is choosen as the factors to sort friends of friends of the presentee.
     Based on this system, the social networking service can recommend strangers whom users may want to add as friend to users, which can keep users active and spending more time on social network sites; users can meet new friends by taking advantage of this system, and alse increase their influence and expand their contacts. The system takes Twitter's data as an example while doing experiment, actually the the system can be used for some other large-scale data processing, as the data meet the requirements of the format of the data processing, and this system based on Hadoop platform, which means it has good scalability and can be able to handle big data.

引文

[1]Jeffrey Dean,Sanjay Ghemawat,USENIX Association et al.MapReduce:Simplified Data Processing on Large Clusters[C].Proceedings of the Sixth Symposium on Operating Systems Design and Implementation(OSDI'04).2004:137-149.
    [2]Wen-Yen Chen,Jon-Chyuan Chu,junyi Luan et al.Collaborative filtering for orkut communities[C]. Proceedings of the 18th international conference on World wide web.2009:681-690.
    [3]http://hadoop.apache.org/
    [4]Songqing Duan, Bin Wu, Bai Wang, Juan Yang, Design and implementation of parallel statiatical algorithm based on Hadoop's MapReduce model. IEEE CCIS.2011.
    [5]Jimmy Lin, Chris Dyer.Data-Intensive Text Processing with MapReduce.2010
    [6]Tom White. Hadoop The Definitive Guide 3rd Edition.2012
    [7]项亮,推荐系统实践,2012年6月
    [8]Guojun Liu, Ming Zhang, Fei Yan. Large-Scale Social Network Analysis based on MapReduce.2010 International Conference on Computational Aspects of Social Networks
    [9]Dexter H. Hu, Yinfeng Wang, Cho-Li Wang. BetterLife 2.0 Large-scale Social Intelligence Resoning on Cloud.2nd IEEE International Conference on Cloud Computing Technology and Science.
    [10]Bernard J. Jansen, Kate Sobel,Geoff Cook et al.Classifying ecommerce information sharing behaviour by youths on social networking sites[J].Journal of Information Science,2011,37(2):120-136.
    [II]Nancy A. Van House.Feminist HCI meets facebook:Performativity and social networking sites[J].Interacting with computers,2011,23(5):422-429.
    [12]Huber, Markus, Mulazzani, Martin,Weippl, Edgar et al.Friend-in-the-Middle Attacks: Exploiting Social Networking Sites for Spam[J].IEEE internet computing,2011,15(3):28-34.
    [13]Toon De Pessemier, Kris Vanhecke,Simon Dooms et al.CONTENT-BASED RECOMMENDATION ALGORITHMS ON THE HADOOP MAPREDUCE FRAMEWORK[C]. Proceedings of the 7th International Conference on Web Information Systems and Technology.2011:237-240.
    [14]Kalyanaraman, Ananth,Cannon, William R.,Latt, Benjamin et al.MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification[J].Bioinformatics,2011,27(21):3072-3073.
    [15]Yang, H.,Luan, Z.,Li, W. et al.MapReduce workload modeling with statistical approach[J]. Journal of grid computing,2012,10(2):279-310.
    [16]Md. Rezaul Karim,Md. Azam Hossain,Md. Mamunur Rashid et al.A MapReduce Framework for Mining Maximal Contiguous Frequent Patterns in Large DNA Sequence Datasets[J].IETE technical review,2012,29(2):162-168.
    [17]沈洁,薛贵荣.一种基于XML的Web数据挖掘模型[J].系统工程理论与实践,2002,22(9)：74-77.
    [18]乔秀全,杨春,李晓峰等.社交网络服务中一种基于用户上下文的信任度计算方法[J].计算机学报,2011,34(12)：2403-2413.DOI：10.3724／SRJ.1016.2011.02403.
    [19]崔建,李强,刘勇等.基于决策树的快速SVM分类方法[J].系统工程与电子技术,2011,33(11)：2558-2563.DOI：10.3969／j.issn.1001-506X.2011.11.40.
    [20]李清,沈彤,关毅等.面向大规模同志数据的聚类算法研究[J].智能计算机与应用,2012,02(5)：42-45.
    [21]潘巍,李战怀,伍赛等.基于消息传递机制的MapReduce图算法研究[J].计算机学报,2011,34(10)：1768-1784.DOI：10.3724／SP.J.1016.2011.01768.
    [22]刘平,马本强,曹营刚等.基于协同过滤推荐的SNS网站好友系统优化方法[J].中国数字电视,2011,(4)：65-67.
    [23]童海妙.一种基于协同标签系统与用户建模的个性化好友推荐方法[D].浙江大学,2011.
    [24]于海群,刘万军,邱云飞等.基于用户话题偏好的社会网络二级人脉推荐[J].计算机应用,2012,32(5)：1366-1370.DOI：10.3724／SP.J.1087.2012.01366.
    [25]张中峰,李秋月。.社交网站中潜在好友推荐模型研究[J].情报学报,2011,30(12)：1319-1325.DOI：10.3772／j.issn.1000-0135.2011.12.012.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700