基于Spark的混合协同过滤算法改进与实现

英文篇名：New improvement and implementation of hybrid collaborative filtering algorithm based on Spark platform
作者：王源龙 ; 孙卫真 ; 向勇
英文作者：Wang Yuanlong;Sun Weizhen;Xiang Yong;Dept. of Computer Science & Technology,College of Information Engineering,Capital Normal University;Dept. of Computer Science & Technology,Tsinghua University;
关键词：集成学习 ; 协同过滤 ; 稀疏性 ; 扩展性 ; Spark流式计算 ; 增量模型 ; 分类
英文关键词：integrated learning;;collaborative filtering;;sparsity;;extensibility;;Spark streaming;;incremental model;;classification
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：首都师范大学信息工程学院计算机科学与技术系;清华大学计算机科学与技术系;
出版日期：2018-02-09 11:16
出版单位：计算机应用研究
年：2019
期：v.36;No.329
基金：北京市教委科技计划项目(KM201310028014)
语种：中文;
页：JSYJ201903043
页数：6
CN：03
ISSN：51-1196/TP
分类号：222-227

摘要

针对传统协同过滤在推荐过程中存在的稀疏性、扩展性以及个性化问题,通过引入算法集成的思想,旨在优化和改进一种新型的基于Spark平台下的混合协同过滤。借鉴Stacking集成学习思想,将多个弱推荐器线性加权组合,形成综合性强的推荐器。算法基于近邻协同过滤,结合分类、流行度、好评度等对近邻相似度计算策略进行优化,旨在改善相似度的合理性以及相似度计算的复杂度,在一定程度上改善了评分稀疏性的问题;算法结合Spark分布式计算平台,充分借鉴分布式平台的优点,利用其流式处理以及分布式存储结构等特性,设计并实现一种推荐算法的增量迭型,解决了协同过滤算法扩展性和实时性问题。实验数据采用UCI公用数据集MovieLens和Net Flix电影评分数据。实验结果表明,改进算法在推荐个性化、准确率以及扩展性上都有不错的表现,较以前同类型算法均有不同程度的提高,为推荐系统的应用提供一种可行的算法集成方案。
Aiming at optimizing and improving a hybrid collaborative filtering based on Spark platform for its sparsity,scalability and personalized recommendation by using the method of algorithm integration,this paper took the model of Stacking integration to integrate multiple weak recommender units in a linearly weighted into a comprehensive recommender. Firstly,this algorithm optimized the collaborative filtering based on the nearest neighbor by presorting and adjusting the similarity calculation strategy with popularity and praise degree,and improved the rationality and complexity of similarity calculation. It solved the problem of score sparsity to some extent. At the same time,this algorithm integrated closely distributed computing platform,which could make full use of the advantages of distributed platform to design and implement an increment iterative model of recommendation algorithm by using the Spark streaming and distributed storage structure. It solved the problem that collaborative filtering algorithm was hard to expand and made poor real-time performance. The experimental data used UCI public data set named MovieLens and NetFlix films' score. The experimental results show that the improved algorithm has a good performance and makes great progress in personalized recommendation,accuracy and scalability compared with the previous algorithms. It provides a feasible algorithm integration scheme for the application of the recommended system.

引文

[1] Ricci F,Rokach L,Shapira B,et al. Recommender systems handbook[M]. New York:Springer,2011:39-184.
    [2] Cheung K W,Tian L F. Learning user similarity and ratings for colla-borative recommendation[J]. Information Retrieval,2004,7(3-4):395-410.
    [3] BalabanovicM,Shoham Y. Fab:content-based collaborative recommen-dation[J]. Communications of the ACM,1997,40(3):66-72.
    [4]王成,朱志刚,张玉侠,等.基于用户的协同过滤算法的推荐效率和个性化改进[J].小型微型计算机系统,2016,37(3):428-432.(Wang Cheng,Zhu Zhigang,Zhang Yuxia,et al. Improvement inrecommendation efficiency and personalized of user-based collabora-tive filtering algorithm[J]. Journal of Chinese Computer Sys-tems,2016,27(3):428-432.)
    [5]谭云志,张敏,刘奕群,等.基于用户评分和评论信息的协同推荐框架[J].模式识别与人工智能,2016,29(4):359-366.(TanYunzhi,Zhang Min,Liu Yiqun,et al. Collaborative recommendationframework based on ratings and textual reviews[J]. Pattern Recog-nition and Artificial Intelligence,2016,29(4):359-366.)
    [6]张宇,程久军.基于MapReduce的矩阵分解推荐算法研究[J].计算机科学,2013,40(1):19-23.(Zhang Yu,Cheng Jiujun. Study onrecommendation algorithm with matrix factorization method based onMapReduce[J]. Computer Science,2013,40(1):19-23.)
    [7] Koren Y. Factorization meets the neighborhood:a multifaceted collabo-rative filtering model[C]//Proc of the 14th ACM SIGKDD Interna-tional Conference on Knowledge Discovery and Data Mining. NewYork:ACM Press,2008:426-434.
    [8] Deshpande M,Karypis G. Item-based top-N recommendation algo-rithms[J]. ACM Trans on Information Systems,2004,22(1):143-177.
    [9] Linden G,Simth B,York J. Amazon. com recommendations:item-to-item collaborative filteing[J]. IEEE Internet Computing,2003,7(1):76-80.
    [10]吴毅涛,张兴明,王兴茂,等.基于用户模糊相似度的协同过滤算法[J].通信学报,2016,37(1):198-206.(Wu Yitao,Zhang Xing-ming,Wang Xingmao,et al. User fuzzy similarity-based collaborativefiltering recommendation algorithm[J]. Journal on Communica-tions,2016,37(1):198-206.)
    [11]方耀宁,郭云飞,丁雪涛,等.一种基于局部结构的改进奇异值分解推荐算法[J].电子与信息学报,2013,35(6):1284-1289.(Fang Yaoning,Guo Yunfei,Ding Xuetao,et al. An improved singularvalue decomposition recommender algorithm based on local structures[J]. Journal of Electronics&Information Technology,2013,35(6):1284-1289.)
    [12]胡俊,胡贤德,程家兴.基于Spark的大数据混合计算模型[J].计算机系统应用,2015,24(4):214-218.(Hu Jun,Hu Xiande,ChenJiaxing. Big data hybrid computing mode based on Spark[J]. Com-puter System&Application,2015,24(4):214-218.)
    [13]Apache Kafka. Kafka 2. 0 documentation[EB/OL].[2017-10-23].http://kafka. apache. org/documentation/#introduction.
    [14]Apache Spark. Spark streaming programming guide[EB/OL].[2017-10-23]. http://spark. apache. org/docs/latest/streaming-program-ming-guide. html.
    [15]Apache HBase Team. Apache HBase reference guide[EB/OL].[2017-10-25]. https://hbase. apache. org/book. html.
    [16]陈吉荣,乐嘉锦.基于Hadoop生态系统的大数据解决方案综述[J].计算机工程与科学,2013,35(10):25-35.(Chen Jirong,LeJiajin. Reviewing the big data solution based on Hadoop ecosystem[J]. Computer Engineering&Science,2013,35(10):25-35.)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700