摘要
新词发现,作为自然语言处理的基本任务,是用计算方法研究中国古代文学必不可少的一步。该文提出一种基于古汉语料的新词识别方法,称为AP-LSTM-CRF算法。该算法分为三个步骤。第一步,基于Apache Spark分布式并行计算框架实现的并行化的Apriori改进算法,能够高效地从大规模原始语料中产生候选词集。第二步,用结合循环神经网络和条件随机场的切分概率模型对测试集文档的句子进行切分,产生切分概率的序列。第三步,用结合切分概率的过滤规则从候选词集里过滤掉噪声词,从而筛选出真正的新词。实验结果表明,该新词发现方法能够有效地从大规模古汉语语料中发现新词,在宋词和宋史数据集上分别进行实验,F1值分别达到了89.68%和81.13%,与现有方法相比,F1值分别提高了8.66%和2.21%。
New word detection,as a fundamental task in natural language processing,is an indispensable step in the computational study of ancient Chinese literature.In this work,we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature.This model consists of three steps.First,the parallelized improved-Apriori algorithm,implemented on Apache Spark(a distributed parallel computing framework),is used to generate candidate character sequences from large-scale raw corpus.Second,a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities.Third,we design a rule based filter to remove noise words in the candidate character sequences.Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively.The F1 is up to89.68% and 81.13%in Song Poetry dataset and History of the Song Dynasty dataset,respectively.
引文
[1]黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19.
[2] Ke Deng,et al.On the unsupervised analysis of domain-specific Chinese texts[J].Proceedings of the National Academy of Sciences of the USA.2016,113(22):6154-6159.
[3] Chen Ao,Sun Mao-Song.Domain-specific new words detection in Chinese[C]//Proceedings of the 6th Joint Conference on Lexical and Computational Semantics.2017:44-53.
[4]霍帅,等.基于微博内容的新词发现方法[J].模式识别与人工智能,2014,27(2):141-145.
[5]周霜霜,等.融合规则与统计的微博新词发现方法[J].计算机应用,2017,37(4):1044-1050.
[6]雷一鸣,刘勇,霍华.面向网络语言基于微博语料的新词发现方法[J].计算机工程与设计,2017,38(3):789-794.
[7]杜丽萍,李晓戈,于根.基于互信息改进算法的新词发现对中文分词系统改进[J].北京大学学报(自然科学版),2016,52(1):35-40.
[8]陈飞,等.基于条件随机场方法的开放领域新词发现[J].软件学报,2013,24(5):1051-1060.
[9]杨阳,刘龙飞,魏现辉.基于词向量的情感新词发现方法[J].山东大学学报(理学版),2014,49(11):51-58.
[10]万琪,等.利用新词探测提高中文微博的情感表达抽取[J].中国科学技术大学学报,2017,47(1):63-69.
[11] Xie Tao,Wu Bin,Wang Bai.New word detection in ancient Chinese literature[C]//Proceedings of the Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data,2017:260-275.
(1)https://pypi.org/project/jieba/
(1)https://code.google.com/archive/p/word2vec/
(1)关于谢韬等[11]的切分概率模型实验结果详见其论文,在本文不再赘述,此处是将表5与谢韬等[11]的切分概率模型实验结果的F[11]1值作比较。谢韬等的切分概率模型是用来判断一个四字输入的中间是否应该被切分。它的网络结构是字向量层+Bi-LSTM层+1输出的全连接层,再接sigmoid函数把输出值映射到[0,1]。最终的输出表示四字输入的中间的切分概率,大于0.5表示切分,小于0.5表示不切分。
(2)https://github.com/NLPchina/ansj_seg
(3)http://ictclas.nlpir.org/
(4)https://nlp.stanford.edu/software/segmenter.shtml