基于词向量与可比语料库的双语词典提取研究

英文篇名：Bilingual lexicon extraction based on word vector and comparable corpus
作者：柳路芳 ; 李波 ; 陈鹏 ; 周凌寒 ; 王兵
英文作者：LIU Lu-fang;LI Bo;CHEN Peng;ZHOU Ling-han;WANG Bing;School of Computer Science,Central China Normal University;Beijing GEOWAY Software Co.,Ltd.;
关键词：双语词典 ; 词向量 ; 词间关系 ; 可比语料库
英文关键词：bilingual lexicon;;word vector;;words' correlation;;comparable corpus
中文刊名：JSJK
英文刊名：Computer Engineering & Science
机构：华中师范大学计算机学院;北京吉威时代软件股份有限公司;
出版日期：2018-02-15
出版单位：计算机工程与科学
年：2018
期：v.40;No.278
基金：国家语委十二五规划项目(YB125-132);; 中央高校基本科研业务费专项项目(CCNU15A05062,CCNU17GF0005,CCNU16A06015)
语种：中文;
页：JSJK201802026
页数：6
CN：02
ISSN：43-1258/TP
分类号：182-187

摘要

双语词典是跨语言信息检索以及机器翻译等自然语言处理应用中的一项重要资源。现有的基于可比语料库的双语词典提取算法不够成熟,抽取效果有待提高,而且大多数研究都集中在特定领域的专业术语抽取。针对此不足,提出了一种基于词向量与可比语料库的双语词典提取算法。首先给出了该算法的基本假设以及相关的研究方法,然后阐述了基于词向量利用词间关系矩阵从可比语料库中提取双语词典的具体步骤,最后将该抽取方法与经典的向量空间模型做对比,通过实验分析了上下文窗口大小、种子词典大小、词频等因素对两种模型抽取效果的影响。实验表明,与基于向量空间模型的方法相比,本算法的抽取效果有着明显的提升,尤其是对于高频词语其准确率提升最为显著。
Bilingual lexicon is an important resource in natural language processing applications such as cross-language information retrieval and machine translation.The existing bilingual lexicon extraction algorithm based on comparable corpus is not mature enough and its extraction effect needs to be improved,and most researches focus on the extraction of professional terms in specific fields.In view of this shortcoming,this paper proposes a bilingual lexicon extraction algorithm based on word vector and comparable corpus.Firstly,the basic assumptions of the algorithm and the related research methods are given.Secondly,the concrete steps of extracting bilingual lexicon from the corpus are discussed based on the word vector.The final method is compared with the traditional vector space model.The effects of context window size,seed dictionary size,word frequency and other factors on the extraction efficiency of the two models areanalyzed experimentally.The experimental results show that,compared with the method based on the vector space model,the extraction effect of the algorithm is obviously improved,especially for the high frequency words.

引文

[1]Miangah T M.Automatic term extraction for cross-language information retrieval using a bilingual parallel corpus[C]∥Proc of the 6th International Conference on Informatics and Systems Special Track on Natural Language Processing,2008:81-84.
    [2]Veskis K.Generation of bilingual lexicons from a parallel corpus[J].Eesti Rakenduslingvistika Uhingu Aastaraamat,2007(3):355-372.
    [3]Sun Le.Automatic extraction of bilingual term lexicon from parallel corpora[J].Journal of Chinese Information Processing,2000,14(6):33-39.(in Chinese)
    [4]Tamura A,Watanabe T,Sumita E.Bilingual lexicon extraction from comparable corpora using label propagation[C]∥Proc of Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning,2012:24-36.
    [5]Fung P,Mckeown K.Finding terminology translations from non-parallel corpora[C]∥Proc of Annual Workshop on Very Large Corpora,1997:192-202.
    [6]Turian J,Ratinov L,Bengio Y.Word representations:A simple and general method for semi-supervised learning[C]∥Proc of the 48th Annual Meeting of the Association for Computational Linguistics,2010:384-394.
    [7]Rapp R.Identifying word translations in non-parallel texts[C]∥Proc of the 33rd Annual Meeting on Association for Computational Linguistics,1995:320-322.
    [8]Tanaka K,Umemura K.Construction of a bilingual dictionary intermediated by a third language[C]∥Proc of the 15th Conference on Computational Linguistics,1994:297-303.
    [9]Rapp R.Automatic identification of word translations from unrelated English and German corpora[C]∥Proc of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics,1999:519-526.
    [10]Fung P.Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus[C]∥Proc of the 3rd Workshop on Very Large Corpora,2010:173-183.
    [11]Mikolov T,Le Q V,Sutskever I.Exploiting similarities among languages for machine translation[J].arXiv preprint arXiv,2013:1309-4168.
    [12]http://translate.google.cn/.
    [13]Mikolov T.Word2vec project[EB/OL].[2014-11-10].https://code.google.com/p/word2vec/.
    [3]孙乐.平行语料库中双语术语词典的自动抽取[J].中文信息学报,2000,14(6):33-39.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700