不同词性标记集在典籍实体抽取上的差异性探究

英文篇名：The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books
作者：袁悦 ; 王东波 ; 黄水清 ; 李斌
英文作者：Yuan Yue;Wang Dongbo;Huang Shuiqing;Li Bin;College of Information Science and Technology, Nanjijg Agricultural University;Research Center for Correlation of Domain Knowledge, Nanjing Agricultural University;School of Chinese Language and Literature, Nanjing Normal University;
关键词：数字人文 ; 古文信息处理 ; 词性标注 ; 命名实体抽取
英文关键词：Digital Humanities;;Ancient Chinese Character Information Processing;;Parts of Speech Tagging;;Named Entity Extraction
中文刊名：XDTQ
英文刊名：Data Analysis and Knowledge Discovery
机构：南京农业大学信息科学技术学院;南京农业大学领域知识关联研究中心;南京师范大学文学院;
出版日期：2019-03-25
出版单位：数据分析与知识发现
年：2019
期：v.3;No.27
基金：国家社会科学基金重大项目“基于《汉学引得丛刊》的典籍知识库构建及人文计算研究”(项目编号:15ZDB127);; 国家自然科学基金面上项目“基于典籍引得的句法级汉英平行语料库构建及人文计算研究”(项目编号:71673143)的研究成果之一
语种：中文;
页：XDTQ201903006
页数：9
CN：03
ISSN：10-1478/G2
分类号：61-69

摘要

【目的】在数字人文这一背景下,为更加深入和精准地从古代典籍中挖掘相应的知识,通过实验对比分析,探究不同词性标记集在典籍实体抽取上的差异性。【方法】基于已完成人工校验和机器自动标注的《左传》与《国语》构成的训练和测试语料,以南京师范大学先秦词性标记集为主、以北京大学、中国科学院计算技术研究所和教育部词性标记集为辅,共形成三种不同大小的新标记集,通过条件随机场以及添加特征模板比较这三种词性标记集合在同一语料上进行实体抽取结果的差异性。【结果】在先秦典籍《左传》和《国语》上对不同大小的三种词性标记集开展对比实验,三种模型各自进行实体抽取的F值分别达到82.53%、83.42%和84.07%。【局限】特征选取有待进一步改善,训练结果还有提升空间。【结论】本文研究结果有助于先秦古文献命名实体的抽取,所构建的词性标记集合适用于古汉语词性标注工作。
[Objective] In the context of digital humanities, in order to excavate the corresponding knowledge from the Pre-Qin literature more deeply and accurately, for different parts of the set of lexicon in the class of entity extraction model on the differences in the study. [Methods] Based on the training and testing corpora consisting of "Zuo Zhuan" and "Guo Yu" which have been manually labeled by the machine, three tagging sets of different sizes are formed, with the Pre-Qin part-of-speech tagging set of Nanjing normal university as the main part, supplemented by the part-of-speech tagging sets of Peking University, the Institute of Computing Technology of Chinese Academy of Sciences and the Ministry of Education. The differences between the results of the entity extraction on the same corpus were compared by using the conditional random field and the feature templates. [Results] Comparative experiments were carried out on three part-of-speech tagging sets of different sizes in the Pre-Qin classics "Zuo Zhuan" and "Guo Yu". The F values of the three models were 82.53%, 83.42% and 84.07%, respectively. [Limitations] Feature selection needs further improvement, and training results can be improved. [Conclusions] The result is helpful for the extraction of the named entities in the ancient literature of the Pre-Qin period. The set of part-of-speech tags constructed is suitable for the part-of-speech tagging of ancient Chinese.

引文

[1]刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000.(Liu Kaiying.Chinese Text Automatic Segmentation and Tagging[M].Beijing:The Commercial Press,2000.)
    [2]苗夺谦,卫志华.中文文本信息处理的原理与应用[M].北京:清华大学出版社,2007.(Miao Duoqian,Wei Zhihua.The Principle and Application of Chinese Text Information Processing[M].Beijing:Tsinghua University Press,2007.)
    [3]牛秀萍.基于隐马尔科夫模型词性标注的研究[D].太原:太原理工大学,2013.(Niu Xiuping.The Research of Part-of-Speech Tagging Based on Hidden Markov Model[D].Taiyuan:Taiyuan University of Technology,2013.)
    [4]蒋建洪,赵嵩正,罗玫.词典与统计方法结合的中文分词模型研究及应用[J].计算机工程与设计,2012,33(1):387-391.(Jiang Jianhong,Zhao Songzheng,Luo Mei.Analysis and Application of Chinese Word Segmentation Model Which Consist of Dictionary and Statistics Method[J].Computer Engineering and Design,2012,33(1):387-391.)
    [5]王嘉灵.以《汉书》为例的中古汉语自动分词[D].南京:南京师范大学,2014.(Wang Jialing.Middle Ancient Chinese Word Segmentation Based on“Han Books”[D].Nanjing:Nanjing Normal University,2014.)
    [6]石民,李斌,陈小荷.基于CRF的先秦汉语分词标注一体化研究[J].中文信息学报,2010,24(2):39-45.(Shi Min,Li Bin,Chen Xiaohe.CRF Based Research on a Unified Approach to Word Segmentation and POS Tagging for Pre-Qin Chinese[J].Journal of Chinese Information Processing,2010,24(2):39-45.)
    [7]留金腾,宋彦,夏飞.上古汉语分词及词性标注语料库的构建--以《淮南子》为范例[J].中文信息学报,2013,27(6):6-15,81.(Lau Kamtang,Song Yan,Xia Fei.The Construction of a Segmented and Part-of-Speech Tagged Archaic Chinese Corpus:A Case Study on Huainanzi[J].Journal of Chinese Information Processing,2013,27(6):6-15,81.)
    [8]钱智勇,周建忠,童国平,等.基于HMM的楚辞自动分词标注研究[J].图书情报工作,2014,58(4):105-110.(Qian Zhiyong,Zhou Jianzhong,Tong Guoping,et al.Research on Automatic Word Segmentation and Pos Tagging for“Chu Ci”Based on HMM[J].Library and Information Service,2014,58(4):105-110.)
    [9]姜维,关毅,王晓龙.基于条件随机域的词性标注模型[J].计算机工程与应用,2006(21):13-16,42.(Jiang Wei,Guan Yi,Wang Xiaolong.Conditional Random Fields Based POSTagging[J].Computer Engineering and Application,2006(21):13-16,42.)
    [10]张颖杰,李斌,陈家骏,等.基于词典信息的先秦汉语全文词义标注方法研究[J].中文信息学报,2012,26(3):65-71,103.(Zhang Yingjie,Li Bin,Chen Jiajun,et al.A Study in Dictionary-Based All-word Word Sense Disambiguation for Pre-Qin Chinese[J].Journal of Chinese Information Processing,2012,26(3):65-71,103.)
    [11]Turney P D.Learning Algorithms for Keyphrase Extraction[J].Information Retrieval,2000,2(4):303-336.
    [12]Frank E,Paynter G W,Witten I H,et al.Domain-Specific Keyphrase Extraction[C]//Proceedings of the 16th International Joint Conference on Artificial Intelligence.1999:668-673.
    [13]Mihalcea R,Tarau P.TextRank:Bringing Order into Texts[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.ACL,2004:404-411.
    [14]牛萍,黄德根.TF-IDF与规则相结合的中文关键词自动抽取研究[J].小型微型计算机系统,2016,37(4):711-715.(Niu Ping,Huang Degen.TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J].Journal of Chinese Computer Systems,2016,37(4):711-715.)
    [15]徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法[J].情报理论与实践,2008,31(2):298-302.(Xu Wenhai,Wen Youkui.A Chinese Keyword Extraction Algorithm Based on TFIDF Method[J].Information Studies:Theory&Application,2008,31(2):298-302.)
    [16]李鹏,王斌,石志伟,等.Tag-TextRank:一种基于Tag的网页关键词抽取方法[J].计算机研究与发展,2012,49(11):2344-2351.(Li Peng,Wang Bin,Shi Zhiwei,et al.Tag-TextRank:A Webpage Keyword Extraction Method Based on Tags[J].Journal of Computer Research and Development,2012,49(11):2344-2351.)
    [17]谢玮,沈一,马永征.基于图计算的论文审稿自动推荐系统[J].计算机应用研究,2016,33(3):798-801.(Xie Wei,Shen Yi,Ma Yongzheng.Recommendation System for Paper Reviewing Based on Graph Computing[J].Application Research of Computers,2016,33(3):798-801.)
    [18]蒲梅,周枫,周晶晶,等.基于加权TextRank的新闻关键事件主题句提取[J].计算机工程,2017,43(8):219-224.(Pu Mei,Zhou Feng,Zhou Jingjing,et al.Topic Sentence Extraction of Key News Events Based on Weighted TextRank[J].Computer Engineering,2017,43(8):219-224.)
    [19]宁建飞,刘降珍.融合Word2Vec与TextRank的关键词抽取研究[J].现代图书情报技术,2016(6):20-27.(Ning Jianfei,Liu Jiangzhen.Using Word2Vec with TextRank to Extract Keywords[J].New Technology of Library and Information Service,2016(6):20-27.)
    [20]夏天.词语位置加权TextRank的关键词抽取研究[J].现代图书情报技术,2013(9):30-34.(Xia Tian.Study on Keyword Extraction Using Word Position Weighted TextRank[J].New Technology of Library and Information Service,2013(9):30-34.)
    [21]魏赟,孙先朋.融合统计学和TextRank的生物医学文献关键短语抽取[J].计算机应用与软件,2017,34(6):27-30.(Wei Yun,Sun Xianpeng.Fusion of Statistics and TextRank for Keyphrase Extraction in Biomedical Literature[J].Computer Applications and Software,2017,34(6):27-30.)
    [22]温锐.中文命名实体识别及其关系抽取研究[D].苏州:苏州大学,2005.(Wen Rui.The Research of Chinese Named Entity Recognition and Its Relation Extraction[D].Suzhou:Soochow University,2005.)
    [23]Lafferty J D,McCallum A,Pereira F C N.Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]//Proceedings of the 18th International Conference on Machine Learning.2001:282-289.
    [24]Pearl J.Bayes and Markov Networks:A Comparison of Two Graphical Representations of Probabilistic Knowledge[D].Los Angeles,California,USA:University of California,1986.
    [25]王东波,黄水清,何琳.基于多特征知识的先秦典籍词性自动标注研究[J].图书情报工作,2017,61(12):64-70.(Wang Dongbo,Huang Shuiqing,He Lin.Research of Automatic Part-of-speech Tagging for Pre-Qin Literature Based on Multi-Feature Knowledge[J].Library and Information Service,2017,61(12):64-70.)

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700