用户名: 密码: 验证码:
基于半监督学习的小语种机器翻译算法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Machine translation algorithm of low-resource languages based on semi-supervised learning
  • 作者:陆雯洁 ; 谭儒昕 ; 刘功申 ; 孙环荣
  • 英文作者:LU Wenjie;TAN Ruxin;LIU Gongshen;SUN Huanrong;Shanghai Jiao Tong University,School of Electronic Information and Electrical Engineering;Shanghai Jiao Tong University-Shanghai Songheng Information Content Analysis Joint Lab;
  • 关键词:半监督学习 ; 小语种 ; 机器翻译
  • 英文关键词:semi-supervised learning;;low-resource language;;machine translation
  • 中文刊名:XDZK
  • 英文刊名:Journal of Xiamen University(Natural Science)
  • 机构:上海交通大学电子信息与电气工程学院;上海交通大学-上海嵩恒信息内容分析技术联合实验室;
  • 出版日期:2019-03-28
  • 出版单位:厦门大学学报(自然科学版)
  • 年:2019
  • 期:v.58;No.269
  • 基金:国家自然科学基金(61772337,61472248)
  • 语种:中文;
  • 页:XDZK201902010
  • 页数:9
  • CN:02
  • ISSN:35-1070/N
  • 分类号:58-66
摘要
近年来,基于神经网络的机器翻译取得了快速发展,然而由于它需要大规模的平行语料库,所以对于资源稀缺的小语种的翻译往往显得效果不佳.在分析编码-解码框架和注意力机制的基础上,基于对偶学习的思想,提出了一种面向小语种翻译的半监督神经网络模型.该模型利用较大的单语语料库与少量平行语料库来实现小语种翻译.实验结果表明,当平行语料资源不足以训练一个普通神经网络模型时,使用半监督网络模型能够取得较好的结果,但所采用的半监督学习模型对单语语料库的数量要求非常高,要达到一定数量级才能达到良好效果.
        Recent years,neural machine translation has achieved great development.However,its requirement for large-scale parallel corpora,translating low-resource languages fluently becomes a big challenge.This paper first briefly introduces the encoder-decoder framework and attention mechanism.Next,we propose a semi-supervised neural network model based on dual-learning,which can translate low-resource languages using some monolingual corpora and small parallel corpora.Finally,results show that semisupervised neural machine translation can achieve reasonable results with parallel corpora which are insufficient to train a common neural model.However,the semi-supervised model requires a large number of monolingual corpora to achieve great performance.
引文
[1]BROWN P F,PIETRA V J D,PIETRA S A D,et al.The mathematics of statistical machine translation:parameter estimation[J].Computational Linguistics,1993,19(2):263-311.
    [2]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]∥Advances in Neural Information Processing Systems.[S.l.]:NIPS,2014:3104-3112.
    [3]FORCADA M L,ECO R P.Recursive hetero-associative memories for translation[C]∥International WorkConference on Artificial Neural Networks.Berlin:Springer,1997:453-462.
    [4]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1409.0473.
    [5]KARAKANTA A,DEHDARI J,VAN GENABITH J.Neural machine translation for low-resource languages without parallel corpora[J].Machine Translation,2018,32(1/2):167-189.
    [6]杜金华,张萌,宗成庆,等.中国机器翻译研究的机遇与挑战:第八届全国机器翻译研讨会总结与展望[J].中文信息学报,2013,27(4):1-8.
    [7]CHO K,VAN MERRINBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoderdecoder for statistical machine translation[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1406.1078.
    [8]HE D,XIA Y,QIN T,et al.Dual learning for machine translation[C]∥Advances in Neural Information Processing Systems.[S.l.]:NIPS,2016:820-828.
    [9]SENNRICH R,HADDOW B,BIRCH A.Improving neural machine translation models with monolingual data[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1511.06709.
    [10]SENNRICH R,HADDOW B,BIRCH A.Neural machine translation of rare words with subword units[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1508.07909.
    [11]ARTETXE M,LABAKA G,AGIRRE E.Learning bilingual word embeddings with(almost)no bilingual data[C]∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.[S.l.]:ACL,2017,1:451-462.
    [12]李良友,贡正仙,周国栋.机器翻译自动评价综述[J].中文信息学报,2014,28(3):81-91.
    [13]李亚超,江静,加羊吉,等.TIP-LAS:一个开源的藏文分词词性标注系统[J].中文信息学报,2015,29(6):203-207.
    [14]韩冬,李军辉,熊德意,等.基于子字单元的神经机器翻译未登录词翻译分析[J].中文信息学报,2018,32(4):74-79,119.
    [15]ZOPH B,YURET D,MAY J,et al.Transfer learning for low-resource neural machine translation[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1604.02201.
    [16]HE D,XIA Y C,QIN T,et al.Dual learning for machine translation[C].[S.l.]:NIPS,2016:820-828.
    [17]李亚超,熊德意,张民,等.藏汉神经网络机器翻译研究[J].中文信息学报,2017,31(6):103-109.
    [18]位素东.基于短语的藏汉在线翻译系统研究[D].兰州:西北民族大学,2015.
    [19]罗延根,李晓,蒋同海,等.基于词向量的维吾尔语词项归一化方法[J].计算机工程,2018,44(2):220-225.
    [20]潘一荣,李晓,杨雅婷,等.面向汉维机器翻译的调序表重构模型[J].计算机应用,2018,38(5):1283-1288.
    [21]哈里旦木·阿布都克里木,刘洋,孙茂松.神经机器翻译系统在维吾尔语-汉语翻译中的性能对比[J].清华大学学报(自然科学版),2017,57(8):878-883.
    [22]ARTETXE M,LABAKA G,AGIRRE E,et al.Unsupervised neural machine translation[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1710.11041.
    [23]LAMPLE G,CONNEAU A,DENOYER L,et al.Unsupervised machine translation using monolingual corpora only[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1711.00043.
    [24]YANG Z,CHEN W,WANG F,et al.Unsupervised neural machine translation with weight sharing[EB/OL].[2018-11-08].https:∥arxiv.org/pdf/1804.09057.
    [25]GUZMN F,CHEN P J,OTT M,et al.Two new evaluation datasets for low-resource machine translation:Nepali-English and Sinhala-English[EB/OL].[2019-02-27].https:∥arxiv.org/pdf/1902.01382.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700