基于规则和N-Gram算法的新词识别研究

英文篇名：Research on new word recognition based on rules and N-Gram algorithm
作者：姜如霞 ; 黄水源 ; 段隆振 ; 罗丽娟
英文作者：JIANG Ruxia;HUANG Shuiyuan;DUAN Longzhen;LUO Lijuan;School of Information Engineering,Nanchang University;
关键词：新词识别 ; N-Gram算法 ; 构词规则 ; 中文分词 ; 碎片库 ; 召回率
英文关键词：new word recognition;;N-Gram algorithm;;word formation rule;;Chinese word segmentation;;fragment library;;recall rate
中文刊名：XDDJ
英文刊名：Modern Electronics Technique
机构：南昌大学信息工程学院;
出版日期：2019-02-15
出版单位：现代电子技术
年：2019
期：v.42;No.531
基金：国家自然科学基金资助项目(61070139);国家自然科学基金资助项目(81460769)~~
语种：中文;
页：XDDJ201904040
页数：5
CN：04
ISSN：61-1224/TN
分类号：174-178

摘要

当前的分词工具分词后会出现很多单字碎片,分词之后意义与原意相差甚远。同时因为新词的构词规则具有自由度大的特点,当前分词方法不能有效识别网络中的新词。在ICTCLAS2016分词系统的基础上,结合新词结构制定规则构建碎片库,利用Bi-gram和Tri-gram模式提取碎片库中的候选字串,再采用左右邻接熵进行扩展及过滤,最后提出基于规则和N-Gram算法的新词识别方法。结果表明使用该方法的分词效果准确率、召回率和F值都有所提高。实验结果表明,该新词识别方法能有效构造候选新词集合,提高中文分词效果。
A lot of word fragments can be produced and the meanings after word segmentation are very different from original meanings after word segmentation using the current word segmentation tool,and the formation rules of new words have the characteristic of high freedom degree. As a result,the current word segmentation method cannot effectively identify new words in network. The fragment library is constructed combining the formation rules of new word structures on the basis of the ICTCLAS2016 word segmentation system. The Bi-gram and Tri-gram modes are adopted to extract the candidate word strings in the fragment library. The left and right adjacent entropies are used for expansion and filtering of the candidate word strings. A new word recognition method based on rules and N-Gram algorithm is proposed. The results show that the word segmentation accuracy,recall rate and F values of the method are improved. The experimental results show that the new word recognition method can effectively construct the candidate new word sets and improve the effect of Chinese word segmentation.new word recognitionN-Gram algorithmword formation ruleChinese word segmentationfragment library

引文

[1]霍帅,张敏,刘奕群,等.基于微博内容的新词发现方法[J].模式识别与人工智能,2014,27(2):141?145.HUO Shuai,ZHANG Min,LIU Yiqun,et al. New words dis?covery in microblog content[J]. Pattern recognition and artifi?cial intelligence,2014,27(2):141?145.
    [2]林自芳,蒋秀凤.基于词内部模式的新词识别[J].计算机与现代化,2010(11):162?164.LIN Zifang,JIANG Xiufeng. A new method for Chinese newword identification based on inner pattern of word[J]. Computerand modernization,2010(11):162?164.
    [3]周超,严馨,余正涛,等.融合词频特性及邻接变化数的微博新词识别[J].山东大学学报(理学版),2015,50(3):6?10.ZHOU Chao,YAN Xin,YU Zhengtao,et al. Weibo new wordrecognition combining frequency characteristic and accessor va?riety[J]. Journal of Shandong University(Natural science),2015,50(3):6?10.
    [4] MILLER D R H, LEEK T, SCHWARTZ R M. BBN atTREC7:using hidden Markov models for information retrieval[C]//Proceedings of the 7th Text Retrieval Conference.[S.l.:s.n.],2008:80?89.
    [5] MANNING C D,SCHUTZEH H.统计自然语言处理基础[M].苑春法,李庆中,王昀,等译.北京:电子工业出版社,2005.MANNING C D,SCHUTZEH H. Foundations of statistical naturallanguage processing[M]. YUAN Chunfa, LI Qingzhong,WANG Jun,et al,translation. Beijing:Publishing House ofElectronics Industry,2005.
    [6] HARB B,CHELBA C,DEAN J,et al. Back?off language modelcompression[C]//Proceedings of 10th Annual Conference ofthe International Speech Communication Association. Brighton:[s.n.],2014:352?355.
    [7]兰冲.基于统计规则的中文分词研究[D].西安:西安电子科技大学,2011.LAN Chong. Research on Chinese word segmentation based onstatistical rules[D]. Xi’an:Xidian University,2011.
    [8]夭荣朋,许国艳,宋健.基于改进互信息和邻接熵的微博新词发现方法[J].计算机应用,2016,36(10):2772?2776.YAO Rongpeng,XU Guoyan,SONG Jian. Micro?blog newword discovery method based on improved mutual informationand branch entropy[J]. Journal of computer applications,2016,36(10):2772?2776.
    [9]周霜霜,徐金安,陈钰枫,等.融合规则与统计的微博新词发现方法[J].计算机应用,2017,37(4):1044?1050.ZHOU Shuangshuang,XU Jin’an,CHEN Yufeng,et al. Newwords detection method for microblog text based on integratingof rules and statistics[J]. Journal of computer applications,2017,37(4):1044?1050.
    [10]张海军,李勇,闫琪琪.一种基于海量语料的网络热点新词识别方法[J].计算机工程与应用,2015,51(5):208?213.ZHANG Haijun,LI Yong,YAN Qiqi. Method of new Chi?nese words identification from large scale network corpora[J].Computer engineering and applications,2015,51(5):208?213.
    [11]杜丽萍,李晓戈,于根,等.基于互信息改进算法的新词发现对中文分词系统改进[J].北京大学学报(自然科学版),2016,52(1):35?40.DU Liping,LI Xiaoge,YU Gen,et al. New word detectionbased on an improved PMI algorithm for enhancing segmenta?tion system[J]. Acta Scientiarum Naturalium Universitatis Pe?kinensis,2016,52(1):35?40.
    [12]邢恩军,赵富强.基于上下文词频词汇量指标的新词发现方法[J].计算机应用与软件,2016,33(6):64?67.XING Enjun,ZHAO Fuqiang. A novel approach for Chinesenew word identification based on contextual word frequency?contextual word count[J]. Computer applications and soft?ware,2016,33(6):64?67.
    [13]黄轩,李熔烽.博客语料的新词发现方法[J].现代电子技术,2013,36(2):144?146.HUANG Xuan,LI Rongfeng. Discovery method of new wordsin blog contents[J]. Modern electronics technique,2013,36(2):144?146.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700