基于MapReduce的三元N-gram算法的并行化研究

英文篇名：Research on parallelization of trigram N-gram algorithm based on MapReduce
作者：龚永罡 ; 田润琳 ; 廉小亲 ; 夏天
英文作者：Gong Yonggang;Tian Runlin;Lian Xiaoqin;Xia Tian;School of Computer and Information Engineering,Beijing Technology and Business University;School of Information Resource Management,Renmin University of China;
关键词：中文文本查错 ; 三元N-gram算法 ; MapReduce计算模型 ; 并行化算法 ; Hadoop集群 ; 语料库
英文关键词：Chinese text ternary;;trigram N-gram;;MapReduce framework;;parallelization;;Hadoop clusters;;corpora
中文刊名：DZJY
英文刊名：Application of Electronic Technique
机构：北京工商大学计算机与信息工程学院;中国人民大学信息资源管理学院;
出版日期：2019-05-06
出版单位：电子技术应用
年：2019
期：v.45;No.491
基金：国家重点研发计划项目(2017YFC0820100)
语种：中文;
页：DZJY201905018
页数：5
CN：05
ISSN：11-2305/TN
分类号：76-79+83

摘要

大规模语料库的训练是使用三元N-gram算法进行中文文本自动查错中一个重要的基础工作。面对新媒体平台每日高达百万篇需处理的语料信息,单一节点的三元N-gram语言模型词库的构建存在计算瓶颈。在深入研究三元N-gram算法的基础上,提出了基于MapReduce计算模型的三元N-gram并行化算法的思想。MapReduce计算模型中,将运算任务平均分配到m个节点,三元N-gram算法在Map函数部分的主要任务是计算局部字词分别与其前两个字词搭配出现的次数,Reduce函数部分的主要任务是合并Map部分统计字词搭配出现的次数,生成全局统计结果。实验结果表明,运行在Hadoop集群上的基于MapReduce的三元N-gram并行化算法具有很好的运算性和可扩展性,对于每日120亿字的训练语料数据集,集群环境下该算法得到训练结果的速率更接近于线性。
The training of large-scale corpora is an important basic work for the automatic detection of Chinese texts using the trigram N-gram algorithm. Faced with up to one million pieces of data to be processed by the new media platform per day, there is a computational bottleneck in the construction of a single-node trigram N-gram language model lexicon. Based on the deep research of the trigram N-gram algorithm, the idea of trigram N-gram parallelization algorithm based on MapReduce programming model is proposed. In the MapReduce programming model, the arithmetic tasks are evenly distributed to m nodes. The main task of the trigram N-gram algorithm in the Map function part is to calculate the number of times the local words are matched with the first two words, while the main part of the Reduce function, its task is to merge the number of occurrences of the statistical word matching in the Map part to generate global statistical results. The experimental results show that the MapReduce-based trigram N-gram parallelization algorithm running on Hadoop clusters has good performance and scalability. For a 12 billion word-per-day training corpus data set, the algorithm is obtained in a cluster environment. The rate of training results is more linear.

引文

[1]黄伟建.异构云环境下MapReduce高效性的优化研究[J].科学技术与工程,2014,14(31):73-77.
    [2]李书豪.基于N-gram模型的中文分词前k优算法[J].智能计算机与应用,2016,6(6):31-35.
    [3]骆聪.基于改进的n-gram模型的URL分类算法研究[J]计算机技术与发展,2018(9):1-5.
    [4]沈涛.结合N-gram模型与句法分析的语法纠错[D].南京:东南大学,2017.
    [5]钮亮,张宝友.MapReduce求解物流配送单源最短路径研究[J].电子技术应用,2014,40(3):123-125.
    [6]胡爱娜.基于MapReduce的分布式期望最大化算法[J].科学技术与工程,2013,13(16):4603-4606.
    [7]刘晓群,邹欣,范虹.基于并行云计算模式的建筑结构设计[J].电子技术应用,2011,37(10):123-125.
    [8]刘杰,沈微微,戈军,等.基于MapReduce的并行抽样路径K-匿名隐私保护算法[J].电子技术应用,2017,43(9):132-136.
    [9]吴信东.MapReduce与Spark用于大数据分析之比较[J].软件学报,2018(6):1770-1791.
    [10]刘云霞.MapReduce下相似性连接算法改进的研究[D].大连:大连海事大学,2017.
    [11]李学明.基于3-gram模型和数据挖掘技术的元数据预取[J].重庆大学学报,2008,31(6):658-662.
    [12]Li Ning.Parallel improvement of Apriori algorithm based on MapReduce[J].Computer Technology and Development,2017,27(4):64-68.
    [13]LI B,ZHAO H,LV Z H.Parallel ISODATA clustering of remote sensing images based on MapReduce[C].International Conference on Cyber-enabled Distributed Computing&Knowledge Discovery.IEEE Computer Society,2010.
    [14]Li Jianjian.Survey of MapReduce parallel programming model research[J].Electronic Journals,2011,39(11):2635-2642.
    [15]BABU S.Towards automatic optimization of MapReduce programs[C].ACM Symposium on Cloud Computing,2010

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700