统计机器翻译判别式训练方法研究

英文题名：Research on Discriminative Training Methods for Statistical Machine Translation
作者：刘乐茂
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：统计机器翻译 ; 对数线性模型 ; 判别式训练 ; 极端保守更新 ; 特征分组 ; 局部训练 ; 可加型神经网络
英文关键词：Statistical machine translation ; log-linear model ; discriminative training ; ultraconservative updata ; feature grouping ; local training ; additive neural network
学位年度：2013
导师：赵铁军
学科代码：0812
学位授予单位：哈尔滨工业大学
论文提交日期：2013-09-01

摘要

过去二十多年，统计机器翻译取得了很大的成功；但是它还远不能满足人们的需求，它仍然需要进一步的发展和改善。在当前的形势下，从数学模型的角度来看，统计机器翻译的一个发展趋势就是，从少特征、小模型到多特征、大模型的过渡；从线性到非线性模型的演变。按照翻译模型这个发展趋势，本文从目前最主流的对数线性翻译模型出发，以判别式训练作为主要线索，主要研究了如下四个方面的内容：
     （1）对于含有少数特征的对数线性模型，现有最成功的判别式学习算法MERT遭遇了不稳定性。由于在每次优化步k-best翻译列表会发生变化，这就意味着定义在k-best翻译列表之上的优化目标函数会发生变化，从而引起优化权重的“震荡”现象，同时引起了MERT的不稳定性。本文在设计判别式训练的优化目标时，采用了极端保守更新的思想来抑制优化权重的“震荡”现象，提出了基于极端保守更新的最小期望错误率训练方法。该训练方法采用基于梯度投影的学习算法来实现，因而它的实现比MERT简单。实验表明，这个训练算法的性能比MERT更好。
     （2）对于含有大规模稀疏特征的翻译模型，虽然现有的可扩展的训练方法从训练效率上来说，能够勉强运用于训练这样的翻译模型，但是这些训练方法由于遭遇严重的特征稀疏性而导致翻译性能不佳。本文就特征稀疏性，研究了两个实用的应对技术-扩大开发集和L1正则，不过由于一些其他的原因，这两个技术并不足以解决特征稀疏这个问题。为此，本文提出了一个基于OSCAR的自动特征分组的训练方法。为了有效地学习特征的分组结构，本文提出了一个在线学习方法。实验结果表明，这个训练方法取得了比现有方法更好的性能。
     （3）基于对数线性模型的所有现有训练方法均存在如下两个不足：首先，它们的性能严重依赖于开发集的选择，而通常适合测试任务的开发集往往很难获得，这样容易导致由于采用了不合适的开发集进行训练，测试性能很差；其次，这些训练方法都是针对给定的开发集，训练出一个权重，而这个权重不能保证所有测试句子翻译结果的一致性。为了解决这两个问题，本文提出了一个局部训练的方法，与现有的方法明显不同，它为每个测试句子训练一个权重。局部训练方法的一个瓶颈是训练效率问题，本文提出了一个增量式的训练方法来克服这个瓶颈。需要强调的是，从测试时的决策函数来看，局部训练方法对应于一个非线性翻译模型。
     （4）基于对数线性的翻译模型，在建模翻译现象时，存在如下两个局限性：它严格要求特征同模型函数之间的线性关系，容易引起建模的不充分；不能对于其中的表面特征进行进一步的抽象和解释。采用神经网络对于翻译进行建模是缓和上述问题的一个潜在途径，一方面，神经网络可以突破线性的限制，能够逼近任何的模型函数，因而建模更充分；另外一方面，它通过引入隐含单元，可以对输入的表面特征进行抽象和解释。不过，如果将翻译的建模同它的解码联合在一起进行考虑的话，经典的神经网络由于它的一些特性，会遭遇严重的解码效率问题。为了解决这个问题，本文提出了一个变化的神经网络-可加型神经网络，来对翻译进行模型，同时本文为基于可加神经网络的翻译模型提出了一个有效的训练方法。
Over the last two decades, statistical machine translation (SMT) has achievedgreat successes; nevertheless, it is still far away from the human being’s requirementsand thus needs further development and improvements. In the current situation, fromthe view of mathematical models, one of potential directions for SMT is the transi-tion from a few features and small models to many features and large models, and thetransformation from linear models to nonlinear models. Under this research direction,this paper starts from the log-linear based translation model which is the most pop-ular model for SMT, and mainly investigates the following contents focusing on thediscriminative training.
     (1)Forthelog-linearbasedmodelconsistingofafewfeatures,themostsuccessfultuningmethod,MERT,suffersfromalimitationofinstability. Sinceak-besttranslationlist always changes at each optimization step, which means the variant of optimizationobjective defined over a variant of k-best translation list, the shake phenomenon ofoptimized weights incurs and this induces the instability of MERT. This paper employsthe idea of ultraconservative update when designing the optimization objective, andproposesanewtuningmethodcallederrorrateminimizationbasedonultraconservativeupdate. Experiments show that its performance is better than that of MERT.
     (2)Forthelog-linearmodelconsistingofalargescaleofsparsefeatures, althoughexisting tuning methods can be used to tune such a translation model from the view oftuning efficiency, its performance is limited due to feature sparsity. This paper con-siders two practical techniques, i.e. enlarging a tuning set and L1regularization, andshows these two techniques are not sufficient due to some other reasons. Therefore,it proposes a novel tuning method based on automatic feature grouping to relieve fea-ture sparsity. In order to learn feature group structure efficiently, it also investigatesan online learning method. Experiments show that this tuning method outperforms theexisting tuning methods.
     (3)Existingtuningmethodsforthelog-linearmodelusuallysufferfromtwoshort-comings. Firstly, theirperformanceishighlydependentonthechoiceofadevelopmentset, but usually the suitable development set is not available and not simple to create,which may potentially lead to an unstable translation performance for testing because of the difference between the development set and a test set. Secondly, they try tooptimize a single weight towards a given development set but this weight cannot leadto consistent results on the sentence level. To overcome these two shortcomings, thispaper proposes a local training method, which tunes many weights, each one for eachtest sentence, and thus is different from these existing methods. The bottleneck of localtraining is its training efficiency, and thus this paper also proposes an efficient incre-mental training method. Please note that according to decision function for testing thelocal training method works like a nonlinear model.
     (4) When modeling translation phenomenon, the log-linear model has two limita-tions: its features are strictly required to be linear with respect to the objective and thismay induce modeling inadequacy. In addition, it cannot deeply interpret and representits surface features. A potential solution to address these limitations is modeling withneural networks. On one hand, neural networks can go beyond the linear limitation anditactuallycanapproximatearbitrarycontinuousfunctions. Inotherwords, theirmodel-ing will be more adequate. On the other hand, they can represent their surface featuresby using hidden units. However, classical neural networks will be challenged by thedecoding efficiency due to their inherent characteristics when modeling and decodingare considered together. Therefore, this paper proposes a variant neural network calledAdditive Neural Network for machine translation, and investigates an efficient methodfor its discriminative training.

引文

1Brown P F, Cocke J, Pietra S A D, et al. A Statistical Approach to MachineTranslation[J]. Computational Linguistics.1990,16(2):79–85
    2Koehn P. Statistical Machine Translation[M]. Cambridge, UK: Cambridge Uni-versity Press, Inc.,2009
    3赵铁军.机器翻译原理[M].中国哈尔滨:哈尔滨工业大学出版社,2000
    4M.Nagao. A Framework of a Mechanical Translation between Japanese and En-glish by Analogy Principle. Readings in Machine Translation[J].2003,351
    5Brown P F, Pietra V J D, Pietra S A D, et al. The Mathematics of Statistical Ma-chineTranslation: ParameterEstimation[J]. Comput.Linguist.1993,19:263–311
    6Och F J, Ney H. A Comparison of Alignment Models for Statistical MachineTranslation[C]. In Proceedings of the18th International Conference on Compu-tational Linguistics. Saarbrucken, Germany,2000:1086–1090
    7刘洋.树到串统计翻译模型研究[D].中国科学院计算所,博士学位论文.2007,13–33
    8宗成庆.统计自然语言处理[M].中国北京:清华大学出版社,2008,201–226
    9Wang Y, Waibel A. Modeling with Structures in Statistical Machine Transla-tion[C]. In Proceedings of the36th Annual Meeting of the Association for Com-putational Linguistics and17th International Conference on Computational Lin-guistics. Montreal, Quebec, Canada,1998:1357–1363
    10Koehn P, Och F J, Marcu D. Statistical Phrase-based Translation[C]. Proc. ofHLT-NAACL.2003
    11Och F J, Ney H. The Alignment Template Approach to Statistical Machine Trans-lation[J]. Computational Linguistics.2004,30(4):417–449
    12ZensR,NeyH,WatanabeT,etal. ReorderingConstraintsforPhrase-basedStatis-tical Machine Translation[C]. Proceedings of Coling2004. Geneva, Switzerland,2004:205–211
    13Zens R, Ney H. Machine Translation Using Neural Networks and Finite-stateModels[C]. Proc. of COLING-ACL WORKSHOP.2006
    14Chen H B, Wu J C, Chang J S. Learning Bilingual Linguistic Reordering Modelfor Statistical Machine Translation[C]. HLT-NAACL.2009:254–262
    15Xiong D, Liu Q, Lin S. Maximum Entropy Based Phrase Reordering Model forStatistical Machine Translation[C]. In Proceedings of the21st International Con-ference on Computational Linguistics and the44rd Annual Meeting on Associa-tion for Computational Linguistics. Sydney,Australia,2006:521–528
    16Nagata M, Saito K, Yamamoto K, et al. A Clustered Global Phrase ReorderingModel for Statistical Machine Translation[C]. ACL.2006
    17Wu D. An Algorithm for Simultaneously Bracketing Parallel Texts by AligningWords[C]. In Proceedings of the33rd Annual Meeting on Association for Com-putational Linguistics. Cambridge, Massachusetts,1995:244–251
    18Chiang D. A Hierarchical Phrase-based Model for Statistical Machine Transla-tion[C]. Proceedings of the43rd Annual Meeting on Association for Computa-tional Linguistics. Stroudsburg, PA, USA,2005:263–270
    19熊德意,刘群,林守勋.基于句法的统计机器翻译综述[J].中文信息学报.2008,22(2)
    20Zollmann A, Venugopal A. Syntax Augmented Machine Translation Via ChartParsing[C]. Proceedings of the Workshop on Statistical Machine Translation.Stroudsburg, PA, USA,2006:138–141
    21Li J, Tu Z, Zhou G, et al. Head-driven Hierarchical Phrase-based Translation[C].ACL (2).2012:33–37
    22Zollmann A, Vogel S. A Word-class Approach to Labeling Pscfg Rules for Ma-chine Translation[C]. Proceedings of the49th Annual Meeting of the Associationfor Computational Linguistics: Human Language Technologies. Portland, Ore-gon, USA,2011:1–11
    23Marton Y, Resnik P. Soft Syntactic Constraints for Hierarchical Phrased-based Translation[C]. In Proceedings of ACL-08: HLT. Columbus, Ohio,2008:1003–1011
    24VilarD,SteinD,NeyH. AnalysingSoftSyntaxFeaturesandHeuristicsforHierar-chical Phrase Based Machine Translation[C]. International Workshop on SpokenLanguage Translation. Honolulu, Hawaii,2008:190–197
    25ChiangD,KnightK,WangW.11,001NewFeaturesforStatisticalMachineTrans-lation[C]. HLT-NAACL.2009:218–226
    26Setiawan H, Kan M Y, Li H, et al. Topological Ordering of Function Words inHierarchical Phrase-based Translation[C]. ACL/IJCNLP.2009:324–332
    27He Z, Liu Q, Lin S. Improving Statistical Machine Translation Using LexicalizedRule Selection[C]. Proceedings of the22nd International Conference on Compu-tational Linguistics-Volume1. Stroudsburg, PA, USA,2008:321–328
    28CuiL,ZhangD,LiM,etal. AJointRuleSelectionModelforHierarchicalPhrase-based Translation[C]. Proceedings of the ACL2010Conference Short Papers.Stroudsburg, PA, USA,2010:6–11
    29Yamada K. A Syntax-based Statistical Translation Model[D]. Ph.D. thesis.2003
    30Galley M, Graehl J, Knight K, et al. Scalable Inference and Training of Context-rich Syntactic Translation Models[C]. ACL.2006
    31Marcu D, Wang W, Echihabi A, et al. Spmt: Statistical Machine Translationwith Syntactified Target Language Phrases[C]. Proceedings of the2006Confer-ence on Empirical Methods in Natural Language Processing. Sydney,Australia,2006:44–52
    32Liu Y, Liu Q, Lin S. Tree-to-string Alignment Template for Statistical MachineTranslation[C]. In Proceedings of the21st International Conference on Computa-tional Linguistics and the44rd Annual Meeting on Association for ComputationalLinguistics. Sydney,Australia,2006:609–616
    33HuangL,KnightK,JoshiA. ASyntax-directedTranslatorwithExtendedDomainof Locality[C]. Proceedings of the Workshop on Computationally Hard Problemsand Joint Inference in Speech and Language Processing. Stroudsburg, PA, USA,2006:1–8
    34Zhang M, Jiang H, Li H, et al. Grammar Comparison Study for Transla-tional Equivalence Modeling and Statistical Machine Translation[C]. COLING.2008:1097–1104
    35Zhang M, Jiang H, Aw A, et al. A Tree Sequence Alignment-based Tree-to-treeTranslation Model[C]. ACL.2008:559–567
    36Liu Y, Huang Y, Liu Q, et al. Forest-to-string Statistical Translation Rules[C].ACL.2007
    37Mi H, Huang L, Liu Q. Forest-based Translation[C]. ACL.2008:192–199
    38Lin D. A Path-based Transfer Model for Machine Translation[C]. Proceedings ofColing2004. Geneva, Switzerland,2004:625–630
    39Quirk C, Menezes A, Cherry C. Dependency Treelet Translation: Syntacti-cally Informed Phrasal Smt[C]. Proceedings of the43rd Annual Meeting ofthe Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan,2005:271–279
    40Ding Y, Palmer M. Machine Translation Using Probabilistic Synchronous De-pendency Insertion Grammars[C]. Proceedings of the43rd Annual Meeting ofthe Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan,2005:541–548
    41Shen L, Xu J, Zhang B, et al. Effective Use of Linguistic and Contextual Infor-mation for Statistical Machine Translation[C]. EMNLP.2009:72–80
    42熊德意.基于括号转录语法和依存语法的统计机器翻译研究[D].中国科学院计算所,博士学位论文.2007,1–105
    43蒋宏飞.基于同步树替换文法的统计机器翻译方法研究[D].哈尔滨工业大学,博士学位论文.2010,1–96
    44Och F J, Ney H. Discriminative Training and Maximum Entropy Models for Sta-tistical Machine Translation[C]. Proceedings of the40th Annual Meeting on As-sociation for Computational Linguistics. Stroudsburg, PA, USA,2002:295–302
    45Watanabe T, Suzuki J, Tsukada H, et al. Online Large-margin Training for Sta-tistical Machine Translation[C]. Proceedings of the2007Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:764–773
    46Chiang D, Marton Y, Resnik P. Online Large-margin Training of Syntactic andStructural Translation Features[C]. Proc. of EMNLP.2008
    47Duh K, Kirchhoff K. Beyond Log-linear Models: Boosted Minimum Error RateTraining for N-best Re-ranking[C]. Proceedings of ACL-08: HLT, Short Papers.Columbus, Ohio,2008:37–40
    48Sokolov A, Wisniewski G, Yvon F. Non-linear N-best List Reranking with FewFeatures[C]. AMTA. San Diego, USA,2012
    49Xiao T, Zhu J, Zhu M, et al. Boosting-based System Combination for MachineTranslation[C]. Proceedings of the48th Annual Meeting of the Association forComputational Linguistics. Stroudsburg, PA, USA,2010:739–748
    50Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journalof the American Statistical Association.2006,101(476):1566–1581
    51Teh Y W. A Hierarchical Bayesian Language Model Based on Pitman-Yor Pro-cesses[C]. Proceedings of the21st International Conference on ComputationalLinguistics and44th Annual Meeting of the Association for Computational Lin-guistics.2006:985–992
    52Socher R, Lin C C Y, Ng A Y, et al. Parsing Natural Scenes and Natural Lan-guage with Recursive Neural Networks[C]. Proceedings of the26th InternationalConference on Machine Learning (ICML).2011
    53Blunsom P, Cohn T, Dyer C, et al. A Gibbs Sampler for Phrasal SynchronousGrammar Induction[C]. ACL/IJCNLP.2009:782–790
    54Levenberg A, Dyer C, Blunsom P. A Bayesian Model for Learning Scfgs withDiscontiguous Rules[C]. EMNLP-CoNLL.2012:223–232
    55Neubig G, Watanabe T, Sumita E, et al. An Unsupervised Model for Joint PhraseAlignment and Extraction[C]. ACL.2011:632–641
    56Son L H, Allauzen A, Yvon F. Continuous Space Translation Models with NeuralNetworks[C]. Proceedings of the2012Conference of the North American Chap-teroftheAssociationforComputationalLinguistics: HumanLanguageTechnolo-gies. Stroudsburg, PA, USA,2012:39–48
    57Schwenk H. Continuous Space Translation Models for Phrase-based StatisticalMachine Translation[C]. Proceedings of the24th International Conference onComputational Linguistics. Mumbai, India,2012
    58Yang N, Liu S, Li M, et al. Word Alignment Modeling with Context DependentDeep Neural Network[C]. Proceedings of the51th Annual Meeting of the Asso-ciation for Computational Linguistics.2013
    59Koehn P, Hoang H, Birch A, et al. Moses: Open Source Toolkit for Statis-tical Machine Translation[C]. Proceedings of the45th Annual Meeting of theACL on Interactive Poster and Demonstration Sessions. Stroudsburg, PA, USA,2007:177–180
    60Dyer C, Lopez A, Ganitkevitch J, et al. Cdec: A Decoder, Alignment, and Learn-ing Framework for Finite-state andContext-free TranslationModels[C]. Proceed-ings of ACL.2010
    61VilarD,SteinD,HuckM,etal. Jane: AnAdvancedFreelyAvailableHierarchicalMachine Translation Toolkit[J]. Machine Translation.2012,26(3):197–216
    62Xiao T,Zhu J, ZhangH,et al. Niutrans: AnOpen SourceToolkitfor Phrase-basedand Syntax-based Machine Translation[C]. Proceedings of the ACL2012SystemDemonstrations. Jeju Island, Korea,2012:19–24
    63Wu X, Matsuzaki T, Tsujii J. Akamon: An Open Source Toolkit for Tree/forest-based Statistical Machine Translation[C]. ACL (System Demonstrations).2012:127–132
    64Och F J. Minimum Error Rate Training in Statistical Machine Translation[C].Proceedings of the41st Annual Meeting of the Association for ComputationalLinguistics. Sapporo, Japan,2003:160–167
    65PapineniK,RoukosS,WardT,etal. Bleu: AMethodforAutomaticEvaluationofMachine Translation[C]. Proceedings of40th Annual Meeting of the Associationfor Computational Linguistics. Philadelphia, Pennsylvania, USA,2002:311–318
    66Liang P, Bouchard-Cote A, Klein D, et al. An End-to-end Discriminative Ap-proach to Machine Translation[C]. In Proceedings of the21st International Con-ference on Computational Linguistics and the44rd Annual Meeting on Associa-tion for Computational Linguistics. Sydney,Australia,2006:761–768
    67Tillmann C, Zhang T. A Discriminative Global Training Algorithm for Statisti-cal Mt[C]. Proceedings of the21st International Conference on ComputationalLinguistics and the44th annual meeting of the Association for ComputationalLinguistics. Stroudsburg, PA, USA,2006:721–728
    68Blunsom P, Cohn T, Osborne M. A Discriminative Latent Variable Modelfor Statistical Machine Translation[C]. Proceedings of ACL. Columbus, Ohio,2008:200–208
    69Collins M. Discriminative Training Methods for Hidden Markov Models: Theoryand Experiments with Perceptron Algorithms[C]. Proc. of EMNLP.2002
    70Huang L, Fayong S, Guo Y. Structured Perceptron with Inexact Search[C]. Pro-ceedings of the2012Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies. Montreal,Canada,2012:142–151
    71BergerA L,Pietra VJD, PietraSAD. AMaximumEntropy Approach toNaturalLanguage Processing[J]. Comput. Linguist.1996,22(1):39–71
    72PowellM. AnEfficientMethodforFindingtheMinimumofaFunctionofSeveralVariables without Calculating Derivatives[J]. Comput.J.1964,7
    73ZhaoB,ChenS. ASimplexArmijoDownhillAlgorithmforOptimizingStatisticalMachine Translation Decoding Parameters[C]. Proceedings of Human LanguageTechnologies: The2009Annual Conference of the North American Chapter oftheAssociationforComputationalLinguistics,CompanionVolume: ShortPapers.Stroudsburg, PA, USA,2009:21–24
    74Nelder J A, Mead R. A Simplex Method for Function Minimization[J]. ComputerJournal.1965,7
    75Armijo L. Minimization of Functions Having Lipschitz Continuous First PartialDerivatives[J]. Pacific Journal of Mathematics.16(1):1–3
    76MooreRC,QuirkC. RandomRestartsinMinimumErrorRateTrainingforStatis-tical Machine Translation[C]. Proceedings of the22nd International Conferenceon Computational Linguistics-Volume1. Stroudsburg, PA, USA,2008:585–592
    77Galley M, Quirk C. Optimal Search for Minimum Error Rate Training[C]. Pro-ceedings of the2011Conference on Empirical Methods in Natural Language Pro-cessing. Edinburgh, Scotland, UK.,2011:38–49
    78Smith D A, Eisner J. Minimum Risk Annealing for Training Log-linear Mod-els[C]. ProceedingsoftheCOLING/ACL2006MainConferencePosterSessions.Sydney, Australia,2006:787–794
    79Zens R, Hasan S, Ney H. A Systematic Comparison of Training Criteria for Sta-tistical Machine Translation[C]. Proceedings of the2007Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:524–532
    80Cer D, Jurafsky D, Manning C D. Regularization and Search for Minimum ErrorRate Training[C]. Proc. of the Third Workshop on SMT.2008
    81Pauls A, Denero J, Klein D. Consensus Training for Consensus Decoding in Ma-chine Translation[C]. Proceedings of the2009Conference on Empirical Methodsin Natural Language Processing. Singapore,2009:1418–1427
    82Macherey W, Och F J, Thayer I, et al. Lattice-based Minimum Error RateTraining for Statistical Machine Translation[C]. Proceedings of the Conferenceon Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA,2008:725–734
    83Kumar S, Macherey W, Dyer C, et al. Efficient Minimum Error Rate Trainingand Minimum Bayes-risk Decoding for Translation Hypergraphs and Lattices[C].Proceedings of the Joint Conference of the47th Annual Meeting of the ACL andthe4th International Joint Conference on Natural Language Processing of theAFNLP: Volume1-Volume1. Stroudsburg, PA, USA,2009:163–171
    84ArunA,KoehnP. OnlineLearningMethodsforDiscriminativeTrainingofPhraseBased Statistical Machine Translation[C]. Proc. of MT Summit X.2007
    85Crammer K, Singer Y. Ultraconservative Online Algorithms for Multiclass Prob-lems[J]. J. Mach. Learn. Res.2003,3:951–991
    86Crammer K, Dekel O, Keshet J, et al. Online Passive-aggressive Algorithms[J].J. Mach. Learn. Res.2006,7:551–585
    87PlattJ. FastTrainingofSupportVectorMachinesUsingSequentialMinimalOpti-mization. Schoelkopf B, Burges C, Smola A,(Editors) Advances in Kernel Meth-ods-Support Vector Learning, MIT Press,1998
    88Cherry C, Foster G. Batch Tuning Strategies for Statistical Machine Transla-tion[C]. Proceedings of the2012Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies.Montreal, Canada,2012:427–436
    89HopkinsM,MayJ. TuningasRanking[C]. Proceedingsofthe2011ConferenceonEmpirical Methods in Natural Language Processing. Edinburgh, Scotland, UK.,2011:1352–1362
    90Watanabe T. Optimized Online Rank Learning for Machine Translation[C]. Pro-ceedings of the2012Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies. Montreal,Canada,2012:253–262
    91Bazrafshan M, Chung T, Gildea D. Tuning as Linear Regression[C]. Proceedingsof the2012Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies. Montreal, Canada,2012:543–547
    92ClarkJH,DyerC,LavieA,etal. BetterHypothesisTestingforStatisticalMachineTranslation: ControllingforOptimizerInstability[C]. Proceedingsofthe49thAn-nualMeetingoftheAssociationforComputationalLinguistics: HumanLanguageTechnologies: short papers-Volume2. Stroudsburg, PA, USA,2011:176–181
    93Gimpel K, Smith N A. Structured Ramp Loss Minimization for Machine Trans-lation[C]. Proceedings of the2012Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies.Montreal, Canada,2012:221–231
    94McDonald R, Crammer K, Pereira F. Online Large-margin Training of Depen-dency Parsers[C]. Proceedings of the43rd Annual Meeting on Association forComputational Linguistics. Stroudsburg, PA, USA,2005:91–98
    95Lin C Y, Och F J. Orange:a Method for Evaluating Automatic Evaluation Metricsfor Machine Translation[C]. Proc. of COLING.2004
    96Horst R, Tuy H. Global Optimization: Deterministic Approaches[C].1996
    97Koehn P. Statistical Significance Tests for Machine Translation Evaluation[C].Proc. of EMNLP.2004
    98Och F J, Ney H. Improved Statistical Alignment Models[C]. Proceedings of the38th Annual Meeting on Association for Computational Linguistics. Stroudsburg,PA, USA,2000:440–447
    99Stolcke A. Srilm-an Extensible Language Modeling Toolkit[C]. In Proceed-ings of7th International Conference on Spoken Language Processing. Denver,Colorado,2002:901–904
    100Chen S F, Goodman J. An Empirical Study of Smoothing Techniques for Lan-guage Modeling[C]. Technical Report TR-10-98.1998
    101Xiao X, Liu Y, Liu Q, et al. Fast Generation of Translation Forest for Large-scale Smt Discriminative Training[C]. Proceedings of the2011Conference onEmpirical Methods in Natural Language Processing. Edinburgh, Scotland, UK.,2011:880–888
    102Simianer P, Riezler S, Dyer C. Joint Feature Selection in Distributed Stochas-tic Learning for Large-scale Discriminative Training in Smt[C]. Proceedings ofthe50th Annual Meeting of the Association for Computational Linguistics: LongPapers-Volume1. Stroudsburg, PA, USA,2012:11–21
    103Li M, Zhao Y, Zhang D, et al. Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation[C]. Proceedings of the23rdInternational Conference on Computational Linguistics. Stroudsburg, PA, USA,2010:662–670
    104Tsuruoka Y, Tsujii J, Ananiadou S. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty[C]. Proceedings of theJoint Conference of the47th Annual Meeting of theACL and the4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume1-Volume1. Stroudsburg, PA, USA,2009:477–485
    105MacQueen J B. Some Methods for Classification and Analysis of MultivariateObservations[C]. Proc. of5-th Berkeley Symposium on Mathematical Statisticsand Probability. Berkeley,1967:281–297
    106Bondell H D, Reich B J. Simultaneous Regression Shrinkage, Variable Selec-tion, and Supervised Clustering of Predictors with Oscar[J]. Biometrics.2008,64(1):115–123
    107Duchi J, Singer Y. Efficient Online and Batch Learning Using Forward BackwardSplitting[J]. J. Mach. Learn. Res.2009,10:2899–2934
    108Zhong W, Kwok J. Efficient Sparse Modeling with Automatic Feature Group-ing[C]. Getoor L, Scheffer T,(Editors) Proceedings of the28th InternationalConference on Machine Learning (ICML-11). New York, NY, USA,2011:9–16
    109Tillmann C, Zhang T. A Localized Prediction Model for Statistical MachineTranslation[C]. In Proceedings of the43rd Annual Meeting on Association forComputational Linguistics. Ann Arbor, Michigan,2005:557–564
    110Kazama J, Tsujii J. Evaluation and Extension of Maximum Entropy Models withInequalityConstraints[C]. Proceedingsofthe2003conferenceonEmpiricalmeth-ods in natural language processing. Stroudsburg, PA, USA,2003:137–144
    111Andrew G, Gao J. Scalable Training of L1-regularized Log-linear Models[C].Proceedings of the24th international conference on Machine learning. New York,NY, USA,2007:33–40
    112Langford J, Li L, Zhang T. Sparse Online Learning Via Truncated Gradient[J]. J.Mach. Learn. Res.2009,10:777–801
    113Shalev-Shwartz S, Tewari A. Stochastic Methods for L1Regularized Loss Min-imization[C]. Proceedings of the26th Annual International Conference on Ma-chine Learning. New York, NY, USA,2009:929–936
    114Bakin S. Adaptive Regression and Model Selection in Data Mining Problems[D].Australian National University, Ph.d. thesis.1999
    115Yuan M, Lin Y. Model Selection and Estimation in Regression with GroupedVariables[J]. Journal of the Royal Statistical Society, Series B.2006,68:49–67
    116Ma Y, He Y, Way A, et al. Consistent Translation Using Discriminative Learn-ing-a Translation Memory-inspired Approach[C]. Proceedings of the49th An-nualMeetingoftheAssociationforComputationalLinguistics: HumanLanguageTechnologies. Portland, Oregon, USA,2011:1239–1248
    117Zhang H, Berg A C, Maire M, et al. Svm-knn: Discriminative Nearest Neigh-bor Classification for Visual Category Recognition[C]. Proceedings of the2006IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume2. Washington, DC, USA,2006:2126–2136
    118Watanabe T, Sumita E. Example-based Decoding for Statistical Machine Trans-lation[C]. Proc. of MT Summit IX.2003:410–417
    119Bottou L, Vapnik V. Local Learning Algorithms[J]. Neural Comput.1992,4:888–900
    120Cheng H, Tan P N, Jin R. Efficient Algorithm for Localized Support Vector Ma-chine[J]. IEEE Trans. on Knowl. and Data Eng.2010,22:537–549
    121Manning C D, Schutze H. Foundations of Statistical Natural Language Process-ing[M]. Cambridge, MA, USA: MIT Press,1999
    122Cauwenberghs G, Poggio T. Incremental and Decremental Support VectorMachine Learning[C]. Advances in Neural Information Processing Systems(NIPS*2000).2001,13
    123ShiltonA,PalaniswamiM,RalphD,etal. IncrementalTrainingofSupportVectorMachines.[J]. IEEE Transactions on Neural Networks.2005,16(1):114–131
    124Andoni A, Indyk P. Near-optimal Hashing Algorithms for Approximate NearestNeighbor in High Dimensions[J]. Commun. ACM.2008,51(1):117–122
    125He Y, Ma Y, van Genabith J, et al. Bridging Smt and Tm with Translation Rec-ommendation[C]. Proceedings of the48th Annual Meeting of the Association forComputational Linguistics. Uppsala, Sweden,2010:622–630
    126Hildebrand S, Eck M, Vogel S, et al. Adaptation of the Translation Model forStatistical Machine Translation Based on Information Retrieval[C]. Proceedingsof EAMT.2005
    127Lu Y, Huang J, Liu Q. Improving Statistical Machine Translation Performanceby Training Data Selection and Optimization[C]. Proceedings of the2007JointConference on Empirical Methods in Natural Language Processing and Compu-tational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:343–350
    128Nguyen P, Mahajan M, He X. Training Non-parametric Features for StatisticalMachine Translation[C]. Proceedings of the Second Workshop on Statistical Ma-chine Translation. Prague, Czech Republic,2007:72–79
    129Bishop C M. Neural Networks for Pattern Recognition[M]. New York, NY, USA:Oxford University Press, Inc.,1995
    130Koehn P. Pharaoh: A Beam Search Decoder for Phrase-based Statistical MachineTranslation Models[C]. AMTA.2004
    131Chiang D. Hierarchical Phrase-based Translation[J]. Comput. Linguist.2007,33(2):201–228
    132Mikolov T, Karafiat M, Burget L, et al. Recurrent Neural Network Based Lan-guage Model[C]. INTERSPEECH.2010:1045–1048
    133LeQV,NgiamJ,CoatesA,etal. OnOptimizationMethodsforDeepLearning[C].ICML.2011:265–272
    134Huang L, Chiang D. Better K-best Parsing[C]. In Proceedings of the Ninth Inter-nationalWorkshoponParsingTechnology.Vancouver,BritishColumbia,Canada,2005:53–64
    135Collobert R, Weston J. A Unified Architecture for Natural Language Processing:Deep Neural Networks with Multitask Learning[C]. International Conference onMachine Learning, ICML.2008
    136Buja A, Hastie T, Tibshirani R. Linear Smoothers and Additive Models[J]. TheAnnals of Statistics.1989,17:453–510
    137Potts W J E. Generalized Additive Neural Networks[C]. Proceedings of the fifthACM SIGKDD international conference on Knowledge discovery and data min-ing. New York, NY, USA,1999:194–200
    138Hager W W, Zhang H. Algorithm851: Cg descent, a Conjugate Gradient Methodwith Guaranteed Descent[J]. ACM Trans. Math. Softw.2006,32(1):113–137
    139Hinton G E, Osindero S, Teh Y W. A Fast Learning Algorithm for Deep BeliefNets[J]. Neural Comput.2006,18(7):1527–1554
    140Erhan D, Bengio Y, Courville A, et al. Why Does Unsupervised Pre-training HelpDeep Learning?[J]. J. Mach. Learn. Res.2010,11:625–660
    141Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and GeneralMethod for Semi-supervised Learning[C]. Proceedings of the48th Annual Meet-ing of the Association for Computational Linguistics. Stroudsburg, PA, USA,2010:384–394
    142Fujii A, Utiyama M, Yamamoto M, et al. Overview of the Patent Translation Taskat the Ntcir-8Workshop[C]. In Proceedings of the8th NTCIR Workshop Meet-ing on Evaluation of Information Access Technologies: Information Retrieval,Question Answering and Cross-lingual Information Access.2010:293–302
    143Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic LanguageModel[J]. J. Mach. Learn. Res.2003,3:1137–1155
    144Deselaers T, Hasan S, Bender O, et al. A Deep Learning Approach to MachineTransliteration[C]. Proceedings of the Fourth Workshop on Statistical MachineTranslation. Stroudsburg, PA, USA,2009:233–241
    145Waal D A d, Toit J V d. Generalized Additive Models from a Neural NetworkPerspective[C]. Proceedings of the Seventh IEEE International Conference onData Mining Workshops. Washington, DC, USA,2007:265–270
    146Castano M A, Casacuberta F, Vidal E. Machine Translation Using Neural Net-works and Finite-state Models[C]. TMI.1997:160–167

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700