摘要
过去二十多年,统计机器翻译取得了很大的成功;但是它还远不能满足人们的需求,它仍然需要进一步的发展和改善。在当前的形势下,从数学模型的角度来看,统计机器翻译的一个发展趋势就是,从少特征、小模型到多特征、大模型的过渡;从线性到非线性模型的演变。按照翻译模型这个发展趋势,本文从目前最主流的对数线性翻译模型出发,以判别式训练作为主要线索,主要研究了如下四个方面的内容:
(1)对于含有少数特征的对数线性模型,现有最成功的判别式学习算法MERT遭遇了不稳定性。由于在每次优化步k-best翻译列表会发生变化,这就意味着定义在k-best翻译列表之上的优化目标函数会发生变化,从而引起优化权重的“震荡”现象,同时引起了MERT的不稳定性。本文在设计判别式训练的优化目标时,采用了极端保守更新的思想来抑制优化权重的“震荡”现象,提出了基于极端保守更新的最小期望错误率训练方法。该训练方法采用基于梯度投影的学习算法来实现,因而它的实现比MERT简单。实验表明,这个训练算法的性能比MERT更好。
(2)对于含有大规模稀疏特征的翻译模型,虽然现有的可扩展的训练方法从训练效率上来说,能够勉强运用于训练这样的翻译模型,但是这些训练方法由于遭遇严重的特征稀疏性而导致翻译性能不佳。本文就特征稀疏性,研究了两个实用的应对技术-扩大开发集和L1正则,不过由于一些其他的原因,这两个技术并不足以解决特征稀疏这个问题。为此,本文提出了一个基于OSCAR的自动特征分组的训练方法。为了有效地学习特征的分组结构,本文提出了一个在线学习方法。实验结果表明,这个训练方法取得了比现有方法更好的性能。
(3)基于对数线性模型的所有现有训练方法均存在如下两个不足:首先,它们的性能严重依赖于开发集的选择,而通常适合测试任务的开发集往往很难获得,这样容易导致由于采用了不合适的开发集进行训练,测试性能很差;其次,这些训练方法都是针对给定的开发集,训练出一个权重,而这个权重不能保证所有测试句子翻译结果的一致性。为了解决这两个问题,本文提出了一个局部训练的方法,与现有的方法明显不同,它为每个测试句子训练一个权重。局部训练方法的一个瓶颈是训练效率问题,本文提出了一个增量式的训练方法来克服这个瓶颈。需要强调的是,从测试时的决策函数来看,局部训练方法对应于一个非线性翻译模型。
(4)基于对数线性的翻译模型,在建模翻译现象时,存在如下两个局限性:它严格要求特征同模型函数之间的线性关系,容易引起建模的不充分;不能对于其中的表面特征进行进一步的抽象和解释。采用神经网络对于翻译进行建模是缓和上述问题的一个潜在途径,一方面,神经网络可以突破线性的限制,能够逼近任何的模型函数,因而建模更充分;另外一方面,它通过引入隐含单元,可以对输入的表面特征进行抽象和解释。不过,如果将翻译的建模同它的解码联合在一起进行考虑的话,经典的神经网络由于它的一些特性,会遭遇严重的解码效率问题。为了解决这个问题,本文提出了一个变化的神经网络-可加型神经网络,来对翻译进行模型,同时本文为基于可加神经网络的翻译模型提出了一个有效的训练方法。
Over the last two decades, statistical machine translation (SMT) has achievedgreat successes; nevertheless, it is still far away from the human being’s requirementsand thus needs further development and improvements. In the current situation, fromthe view of mathematical models, one of potential directions for SMT is the transi-tion from a few features and small models to many features and large models, and thetransformation from linear models to nonlinear models. Under this research direction,this paper starts from the log-linear based translation model which is the most pop-ular model for SMT, and mainly investigates the following contents focusing on thediscriminative training.
(1)Forthelog-linearbasedmodelconsistingofafewfeatures,themostsuccessfultuningmethod,MERT,suffersfromalimitationofinstability. Sinceak-besttranslationlist always changes at each optimization step, which means the variant of optimizationobjective defined over a variant of k-best translation list, the shake phenomenon ofoptimized weights incurs and this induces the instability of MERT. This paper employsthe idea of ultraconservative update when designing the optimization objective, andproposesanewtuningmethodcallederrorrateminimizationbasedonultraconservativeupdate. Experiments show that its performance is better than that of MERT.
(2)Forthelog-linearmodelconsistingofalargescaleofsparsefeatures, althoughexisting tuning methods can be used to tune such a translation model from the view oftuning efficiency, its performance is limited due to feature sparsity. This paper con-siders two practical techniques, i.e. enlarging a tuning set and L1regularization, andshows these two techniques are not sufficient due to some other reasons. Therefore,it proposes a novel tuning method based on automatic feature grouping to relieve fea-ture sparsity. In order to learn feature group structure efficiently, it also investigatesan online learning method. Experiments show that this tuning method outperforms theexisting tuning methods.
(3)Existingtuningmethodsforthelog-linearmodelusuallysufferfromtwoshort-comings. Firstly, theirperformanceishighlydependentonthechoiceofadevelopmentset, but usually the suitable development set is not available and not simple to create,which may potentially lead to an unstable translation performance for testing because of the difference between the development set and a test set. Secondly, they try tooptimize a single weight towards a given development set but this weight cannot leadto consistent results on the sentence level. To overcome these two shortcomings, thispaper proposes a local training method, which tunes many weights, each one for eachtest sentence, and thus is different from these existing methods. The bottleneck of localtraining is its training efficiency, and thus this paper also proposes an efficient incre-mental training method. Please note that according to decision function for testing thelocal training method works like a nonlinear model.
(4) When modeling translation phenomenon, the log-linear model has two limita-tions: its features are strictly required to be linear with respect to the objective and thismay induce modeling inadequacy. In addition, it cannot deeply interpret and representits surface features. A potential solution to address these limitations is modeling withneural networks. On one hand, neural networks can go beyond the linear limitation anditactuallycanapproximatearbitrarycontinuousfunctions. Inotherwords, theirmodel-ing will be more adequate. On the other hand, they can represent their surface featuresby using hidden units. However, classical neural networks will be challenged by thedecoding efficiency due to their inherent characteristics when modeling and decodingare considered together. Therefore, this paper proposes a variant neural network calledAdditive Neural Network for machine translation, and investigates an efficient methodfor its discriminative training.
引文
1Brown P F, Cocke J, Pietra S A D, et al. A Statistical Approach to MachineTranslation[J]. Computational Linguistics.1990,16(2):79–85
2Koehn P. Statistical Machine Translation[M]. Cambridge, UK: Cambridge Uni-versity Press, Inc.,2009
3赵铁军.机器翻译原理[M].中国哈尔滨:哈尔滨工业大学出版社,2000
4M.Nagao. A Framework of a Mechanical Translation between Japanese and En-glish by Analogy Principle. Readings in Machine Translation[J].2003,351
5Brown P F, Pietra V J D, Pietra S A D, et al. The Mathematics of Statistical Ma-chineTranslation: ParameterEstimation[J]. Comput.Linguist.1993,19:263–311
6Och F J, Ney H. A Comparison of Alignment Models for Statistical MachineTranslation[C]. In Proceedings of the18th International Conference on Compu-tational Linguistics. Saarbrucken, Germany,2000:1086–1090
7刘洋.树到串统计翻译模型研究[D].中国科学院计算所,博士学位论文.2007,13–33
8宗成庆.统计自然语言处理[M].中国北京:清华大学出版社,2008,201–226
9Wang Y, Waibel A. Modeling with Structures in Statistical Machine Transla-tion[C]. In Proceedings of the36th Annual Meeting of the Association for Com-putational Linguistics and17th International Conference on Computational Lin-guistics. Montreal, Quebec, Canada,1998:1357–1363
10Koehn P, Och F J, Marcu D. Statistical Phrase-based Translation[C]. Proc. ofHLT-NAACL.2003
11Och F J, Ney H. The Alignment Template Approach to Statistical Machine Trans-lation[J]. Computational Linguistics.2004,30(4):417–449
12ZensR,NeyH,WatanabeT,etal. ReorderingConstraintsforPhrase-basedStatis-tical Machine Translation[C]. Proceedings of Coling2004. Geneva, Switzerland,2004:205–211
13Zens R, Ney H. Machine Translation Using Neural Networks and Finite-stateModels[C]. Proc. of COLING-ACL WORKSHOP.2006
14Chen H B, Wu J C, Chang J S. Learning Bilingual Linguistic Reordering Modelfor Statistical Machine Translation[C]. HLT-NAACL.2009:254–262
15Xiong D, Liu Q, Lin S. Maximum Entropy Based Phrase Reordering Model forStatistical Machine Translation[C]. In Proceedings of the21st International Con-ference on Computational Linguistics and the44rd Annual Meeting on Associa-tion for Computational Linguistics. Sydney,Australia,2006:521–528
16Nagata M, Saito K, Yamamoto K, et al. A Clustered Global Phrase ReorderingModel for Statistical Machine Translation[C]. ACL.2006
17Wu D. An Algorithm for Simultaneously Bracketing Parallel Texts by AligningWords[C]. In Proceedings of the33rd Annual Meeting on Association for Com-putational Linguistics. Cambridge, Massachusetts,1995:244–251
18Chiang D. A Hierarchical Phrase-based Model for Statistical Machine Transla-tion[C]. Proceedings of the43rd Annual Meeting on Association for Computa-tional Linguistics. Stroudsburg, PA, USA,2005:263–270
19熊德意,刘群,林守勋.基于句法的统计机器翻译综述[J].中文信息学报.2008,22(2)
20Zollmann A, Venugopal A. Syntax Augmented Machine Translation Via ChartParsing[C]. Proceedings of the Workshop on Statistical Machine Translation.Stroudsburg, PA, USA,2006:138–141
21Li J, Tu Z, Zhou G, et al. Head-driven Hierarchical Phrase-based Translation[C].ACL (2).2012:33–37
22Zollmann A, Vogel S. A Word-class Approach to Labeling Pscfg Rules for Ma-chine Translation[C]. Proceedings of the49th Annual Meeting of the Associationfor Computational Linguistics: Human Language Technologies. Portland, Ore-gon, USA,2011:1–11
23Marton Y, Resnik P. Soft Syntactic Constraints for Hierarchical Phrased-based Translation[C]. In Proceedings of ACL-08: HLT. Columbus, Ohio,2008:1003–1011
24VilarD,SteinD,NeyH. AnalysingSoftSyntaxFeaturesandHeuristicsforHierar-chical Phrase Based Machine Translation[C]. International Workshop on SpokenLanguage Translation. Honolulu, Hawaii,2008:190–197
25ChiangD,KnightK,WangW.11,001NewFeaturesforStatisticalMachineTrans-lation[C]. HLT-NAACL.2009:218–226
26Setiawan H, Kan M Y, Li H, et al. Topological Ordering of Function Words inHierarchical Phrase-based Translation[C]. ACL/IJCNLP.2009:324–332
27He Z, Liu Q, Lin S. Improving Statistical Machine Translation Using LexicalizedRule Selection[C]. Proceedings of the22nd International Conference on Compu-tational Linguistics-Volume1. Stroudsburg, PA, USA,2008:321–328
28CuiL,ZhangD,LiM,etal. AJointRuleSelectionModelforHierarchicalPhrase-based Translation[C]. Proceedings of the ACL2010Conference Short Papers.Stroudsburg, PA, USA,2010:6–11
29Yamada K. A Syntax-based Statistical Translation Model[D]. Ph.D. thesis.2003
30Galley M, Graehl J, Knight K, et al. Scalable Inference and Training of Context-rich Syntactic Translation Models[C]. ACL.2006
31Marcu D, Wang W, Echihabi A, et al. Spmt: Statistical Machine Translationwith Syntactified Target Language Phrases[C]. Proceedings of the2006Confer-ence on Empirical Methods in Natural Language Processing. Sydney,Australia,2006:44–52
32Liu Y, Liu Q, Lin S. Tree-to-string Alignment Template for Statistical MachineTranslation[C]. In Proceedings of the21st International Conference on Computa-tional Linguistics and the44rd Annual Meeting on Association for ComputationalLinguistics. Sydney,Australia,2006:609–616
33HuangL,KnightK,JoshiA. ASyntax-directedTranslatorwithExtendedDomainof Locality[C]. Proceedings of the Workshop on Computationally Hard Problemsand Joint Inference in Speech and Language Processing. Stroudsburg, PA, USA,2006:1–8
34Zhang M, Jiang H, Li H, et al. Grammar Comparison Study for Transla-tional Equivalence Modeling and Statistical Machine Translation[C]. COLING.2008:1097–1104
35Zhang M, Jiang H, Aw A, et al. A Tree Sequence Alignment-based Tree-to-treeTranslation Model[C]. ACL.2008:559–567
36Liu Y, Huang Y, Liu Q, et al. Forest-to-string Statistical Translation Rules[C].ACL.2007
37Mi H, Huang L, Liu Q. Forest-based Translation[C]. ACL.2008:192–199
38Lin D. A Path-based Transfer Model for Machine Translation[C]. Proceedings ofColing2004. Geneva, Switzerland,2004:625–630
39Quirk C, Menezes A, Cherry C. Dependency Treelet Translation: Syntacti-cally Informed Phrasal Smt[C]. Proceedings of the43rd Annual Meeting ofthe Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan,2005:271–279
40Ding Y, Palmer M. Machine Translation Using Probabilistic Synchronous De-pendency Insertion Grammars[C]. Proceedings of the43rd Annual Meeting ofthe Association for Computational Linguistics (ACL’05). Ann Arbor, Michigan,2005:541–548
41Shen L, Xu J, Zhang B, et al. Effective Use of Linguistic and Contextual Infor-mation for Statistical Machine Translation[C]. EMNLP.2009:72–80
42熊德意.基于括号转录语法和依存语法的统计机器翻译研究[D].中国科学院计算所,博士学位论文.2007,1–105
43蒋宏飞.基于同步树替换文法的统计机器翻译方法研究[D].哈尔滨工业大学,博士学位论文.2010,1–96
44Och F J, Ney H. Discriminative Training and Maximum Entropy Models for Sta-tistical Machine Translation[C]. Proceedings of the40th Annual Meeting on As-sociation for Computational Linguistics. Stroudsburg, PA, USA,2002:295–302
45Watanabe T, Suzuki J, Tsukada H, et al. Online Large-margin Training for Sta-tistical Machine Translation[C]. Proceedings of the2007Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:764–773
46Chiang D, Marton Y, Resnik P. Online Large-margin Training of Syntactic andStructural Translation Features[C]. Proc. of EMNLP.2008
47Duh K, Kirchhoff K. Beyond Log-linear Models: Boosted Minimum Error RateTraining for N-best Re-ranking[C]. Proceedings of ACL-08: HLT, Short Papers.Columbus, Ohio,2008:37–40
48Sokolov A, Wisniewski G, Yvon F. Non-linear N-best List Reranking with FewFeatures[C]. AMTA. San Diego, USA,2012
49Xiao T, Zhu J, Zhu M, et al. Boosting-based System Combination for MachineTranslation[C]. Proceedings of the48th Annual Meeting of the Association forComputational Linguistics. Stroudsburg, PA, USA,2010:739–748
50Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journalof the American Statistical Association.2006,101(476):1566–1581
51Teh Y W. A Hierarchical Bayesian Language Model Based on Pitman-Yor Pro-cesses[C]. Proceedings of the21st International Conference on ComputationalLinguistics and44th Annual Meeting of the Association for Computational Lin-guistics.2006:985–992
52Socher R, Lin C C Y, Ng A Y, et al. Parsing Natural Scenes and Natural Lan-guage with Recursive Neural Networks[C]. Proceedings of the26th InternationalConference on Machine Learning (ICML).2011
53Blunsom P, Cohn T, Dyer C, et al. A Gibbs Sampler for Phrasal SynchronousGrammar Induction[C]. ACL/IJCNLP.2009:782–790
54Levenberg A, Dyer C, Blunsom P. A Bayesian Model for Learning Scfgs withDiscontiguous Rules[C]. EMNLP-CoNLL.2012:223–232
55Neubig G, Watanabe T, Sumita E, et al. An Unsupervised Model for Joint PhraseAlignment and Extraction[C]. ACL.2011:632–641
56Son L H, Allauzen A, Yvon F. Continuous Space Translation Models with NeuralNetworks[C]. Proceedings of the2012Conference of the North American Chap-teroftheAssociationforComputationalLinguistics: HumanLanguageTechnolo-gies. Stroudsburg, PA, USA,2012:39–48
57Schwenk H. Continuous Space Translation Models for Phrase-based StatisticalMachine Translation[C]. Proceedings of the24th International Conference onComputational Linguistics. Mumbai, India,2012
58Yang N, Liu S, Li M, et al. Word Alignment Modeling with Context DependentDeep Neural Network[C]. Proceedings of the51th Annual Meeting of the Asso-ciation for Computational Linguistics.2013
59Koehn P, Hoang H, Birch A, et al. Moses: Open Source Toolkit for Statis-tical Machine Translation[C]. Proceedings of the45th Annual Meeting of theACL on Interactive Poster and Demonstration Sessions. Stroudsburg, PA, USA,2007:177–180
60Dyer C, Lopez A, Ganitkevitch J, et al. Cdec: A Decoder, Alignment, and Learn-ing Framework for Finite-state andContext-free TranslationModels[C]. Proceed-ings of ACL.2010
61VilarD,SteinD,HuckM,etal. Jane: AnAdvancedFreelyAvailableHierarchicalMachine Translation Toolkit[J]. Machine Translation.2012,26(3):197–216
62Xiao T,Zhu J, ZhangH,et al. Niutrans: AnOpen SourceToolkitfor Phrase-basedand Syntax-based Machine Translation[C]. Proceedings of the ACL2012SystemDemonstrations. Jeju Island, Korea,2012:19–24
63Wu X, Matsuzaki T, Tsujii J. Akamon: An Open Source Toolkit for Tree/forest-based Statistical Machine Translation[C]. ACL (System Demonstrations).2012:127–132
64Och F J. Minimum Error Rate Training in Statistical Machine Translation[C].Proceedings of the41st Annual Meeting of the Association for ComputationalLinguistics. Sapporo, Japan,2003:160–167
65PapineniK,RoukosS,WardT,etal. Bleu: AMethodforAutomaticEvaluationofMachine Translation[C]. Proceedings of40th Annual Meeting of the Associationfor Computational Linguistics. Philadelphia, Pennsylvania, USA,2002:311–318
66Liang P, Bouchard-Cote A, Klein D, et al. An End-to-end Discriminative Ap-proach to Machine Translation[C]. In Proceedings of the21st International Con-ference on Computational Linguistics and the44rd Annual Meeting on Associa-tion for Computational Linguistics. Sydney,Australia,2006:761–768
67Tillmann C, Zhang T. A Discriminative Global Training Algorithm for Statisti-cal Mt[C]. Proceedings of the21st International Conference on ComputationalLinguistics and the44th annual meeting of the Association for ComputationalLinguistics. Stroudsburg, PA, USA,2006:721–728
68Blunsom P, Cohn T, Osborne M. A Discriminative Latent Variable Modelfor Statistical Machine Translation[C]. Proceedings of ACL. Columbus, Ohio,2008:200–208
69Collins M. Discriminative Training Methods for Hidden Markov Models: Theoryand Experiments with Perceptron Algorithms[C]. Proc. of EMNLP.2002
70Huang L, Fayong S, Guo Y. Structured Perceptron with Inexact Search[C]. Pro-ceedings of the2012Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies. Montreal,Canada,2012:142–151
71BergerA L,Pietra VJD, PietraSAD. AMaximumEntropy Approach toNaturalLanguage Processing[J]. Comput. Linguist.1996,22(1):39–71
72PowellM. AnEfficientMethodforFindingtheMinimumofaFunctionofSeveralVariables without Calculating Derivatives[J]. Comput.J.1964,7
73ZhaoB,ChenS. ASimplexArmijoDownhillAlgorithmforOptimizingStatisticalMachine Translation Decoding Parameters[C]. Proceedings of Human LanguageTechnologies: The2009Annual Conference of the North American Chapter oftheAssociationforComputationalLinguistics,CompanionVolume: ShortPapers.Stroudsburg, PA, USA,2009:21–24
74Nelder J A, Mead R. A Simplex Method for Function Minimization[J]. ComputerJournal.1965,7
75Armijo L. Minimization of Functions Having Lipschitz Continuous First PartialDerivatives[J]. Pacific Journal of Mathematics.16(1):1–3
76MooreRC,QuirkC. RandomRestartsinMinimumErrorRateTrainingforStatis-tical Machine Translation[C]. Proceedings of the22nd International Conferenceon Computational Linguistics-Volume1. Stroudsburg, PA, USA,2008:585–592
77Galley M, Quirk C. Optimal Search for Minimum Error Rate Training[C]. Pro-ceedings of the2011Conference on Empirical Methods in Natural Language Pro-cessing. Edinburgh, Scotland, UK.,2011:38–49
78Smith D A, Eisner J. Minimum Risk Annealing for Training Log-linear Mod-els[C]. ProceedingsoftheCOLING/ACL2006MainConferencePosterSessions.Sydney, Australia,2006:787–794
79Zens R, Hasan S, Ney H. A Systematic Comparison of Training Criteria for Sta-tistical Machine Translation[C]. Proceedings of the2007Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:524–532
80Cer D, Jurafsky D, Manning C D. Regularization and Search for Minimum ErrorRate Training[C]. Proc. of the Third Workshop on SMT.2008
81Pauls A, Denero J, Klein D. Consensus Training for Consensus Decoding in Ma-chine Translation[C]. Proceedings of the2009Conference on Empirical Methodsin Natural Language Processing. Singapore,2009:1418–1427
82Macherey W, Och F J, Thayer I, et al. Lattice-based Minimum Error RateTraining for Statistical Machine Translation[C]. Proceedings of the Conferenceon Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA,2008:725–734
83Kumar S, Macherey W, Dyer C, et al. Efficient Minimum Error Rate Trainingand Minimum Bayes-risk Decoding for Translation Hypergraphs and Lattices[C].Proceedings of the Joint Conference of the47th Annual Meeting of the ACL andthe4th International Joint Conference on Natural Language Processing of theAFNLP: Volume1-Volume1. Stroudsburg, PA, USA,2009:163–171
84ArunA,KoehnP. OnlineLearningMethodsforDiscriminativeTrainingofPhraseBased Statistical Machine Translation[C]. Proc. of MT Summit X.2007
85Crammer K, Singer Y. Ultraconservative Online Algorithms for Multiclass Prob-lems[J]. J. Mach. Learn. Res.2003,3:951–991
86Crammer K, Dekel O, Keshet J, et al. Online Passive-aggressive Algorithms[J].J. Mach. Learn. Res.2006,7:551–585
87PlattJ. FastTrainingofSupportVectorMachinesUsingSequentialMinimalOpti-mization. Schoelkopf B, Burges C, Smola A,(Editors) Advances in Kernel Meth-ods-Support Vector Learning, MIT Press,1998
88Cherry C, Foster G. Batch Tuning Strategies for Statistical Machine Transla-tion[C]. Proceedings of the2012Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies.Montreal, Canada,2012:427–436
89HopkinsM,MayJ. TuningasRanking[C]. Proceedingsofthe2011ConferenceonEmpirical Methods in Natural Language Processing. Edinburgh, Scotland, UK.,2011:1352–1362
90Watanabe T. Optimized Online Rank Learning for Machine Translation[C]. Pro-ceedings of the2012Conference of the North American Chapter of the Associ-ation for Computational Linguistics: Human Language Technologies. Montreal,Canada,2012:253–262
91Bazrafshan M, Chung T, Gildea D. Tuning as Linear Regression[C]. Proceedingsof the2012Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies. Montreal, Canada,2012:543–547
92ClarkJH,DyerC,LavieA,etal. BetterHypothesisTestingforStatisticalMachineTranslation: ControllingforOptimizerInstability[C]. Proceedingsofthe49thAn-nualMeetingoftheAssociationforComputationalLinguistics: HumanLanguageTechnologies: short papers-Volume2. Stroudsburg, PA, USA,2011:176–181
93Gimpel K, Smith N A. Structured Ramp Loss Minimization for Machine Trans-lation[C]. Proceedings of the2012Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies.Montreal, Canada,2012:221–231
94McDonald R, Crammer K, Pereira F. Online Large-margin Training of Depen-dency Parsers[C]. Proceedings of the43rd Annual Meeting on Association forComputational Linguistics. Stroudsburg, PA, USA,2005:91–98
95Lin C Y, Och F J. Orange:a Method for Evaluating Automatic Evaluation Metricsfor Machine Translation[C]. Proc. of COLING.2004
96Horst R, Tuy H. Global Optimization: Deterministic Approaches[C].1996
97Koehn P. Statistical Significance Tests for Machine Translation Evaluation[C].Proc. of EMNLP.2004
98Och F J, Ney H. Improved Statistical Alignment Models[C]. Proceedings of the38th Annual Meeting on Association for Computational Linguistics. Stroudsburg,PA, USA,2000:440–447
99Stolcke A. Srilm-an Extensible Language Modeling Toolkit[C]. In Proceed-ings of7th International Conference on Spoken Language Processing. Denver,Colorado,2002:901–904
100Chen S F, Goodman J. An Empirical Study of Smoothing Techniques for Lan-guage Modeling[C]. Technical Report TR-10-98.1998
101Xiao X, Liu Y, Liu Q, et al. Fast Generation of Translation Forest for Large-scale Smt Discriminative Training[C]. Proceedings of the2011Conference onEmpirical Methods in Natural Language Processing. Edinburgh, Scotland, UK.,2011:880–888
102Simianer P, Riezler S, Dyer C. Joint Feature Selection in Distributed Stochas-tic Learning for Large-scale Discriminative Training in Smt[C]. Proceedings ofthe50th Annual Meeting of the Association for Computational Linguistics: LongPapers-Volume1. Stroudsburg, PA, USA,2012:11–21
103Li M, Zhao Y, Zhang D, et al. Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation[C]. Proceedings of the23rdInternational Conference on Computational Linguistics. Stroudsburg, PA, USA,2010:662–670
104Tsuruoka Y, Tsujii J, Ananiadou S. Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty[C]. Proceedings of theJoint Conference of the47th Annual Meeting of theACL and the4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume1-Volume1. Stroudsburg, PA, USA,2009:477–485
105MacQueen J B. Some Methods for Classification and Analysis of MultivariateObservations[C]. Proc. of5-th Berkeley Symposium on Mathematical Statisticsand Probability. Berkeley,1967:281–297
106Bondell H D, Reich B J. Simultaneous Regression Shrinkage, Variable Selec-tion, and Supervised Clustering of Predictors with Oscar[J]. Biometrics.2008,64(1):115–123
107Duchi J, Singer Y. Efficient Online and Batch Learning Using Forward BackwardSplitting[J]. J. Mach. Learn. Res.2009,10:2899–2934
108Zhong W, Kwok J. Efficient Sparse Modeling with Automatic Feature Group-ing[C]. Getoor L, Scheffer T,(Editors) Proceedings of the28th InternationalConference on Machine Learning (ICML-11). New York, NY, USA,2011:9–16
109Tillmann C, Zhang T. A Localized Prediction Model for Statistical MachineTranslation[C]. In Proceedings of the43rd Annual Meeting on Association forComputational Linguistics. Ann Arbor, Michigan,2005:557–564
110Kazama J, Tsujii J. Evaluation and Extension of Maximum Entropy Models withInequalityConstraints[C]. Proceedingsofthe2003conferenceonEmpiricalmeth-ods in natural language processing. Stroudsburg, PA, USA,2003:137–144
111Andrew G, Gao J. Scalable Training of L1-regularized Log-linear Models[C].Proceedings of the24th international conference on Machine learning. New York,NY, USA,2007:33–40
112Langford J, Li L, Zhang T. Sparse Online Learning Via Truncated Gradient[J]. J.Mach. Learn. Res.2009,10:777–801
113Shalev-Shwartz S, Tewari A. Stochastic Methods for L1Regularized Loss Min-imization[C]. Proceedings of the26th Annual International Conference on Ma-chine Learning. New York, NY, USA,2009:929–936
114Bakin S. Adaptive Regression and Model Selection in Data Mining Problems[D].Australian National University, Ph.d. thesis.1999
115Yuan M, Lin Y. Model Selection and Estimation in Regression with GroupedVariables[J]. Journal of the Royal Statistical Society, Series B.2006,68:49–67
116Ma Y, He Y, Way A, et al. Consistent Translation Using Discriminative Learn-ing-a Translation Memory-inspired Approach[C]. Proceedings of the49th An-nualMeetingoftheAssociationforComputationalLinguistics: HumanLanguageTechnologies. Portland, Oregon, USA,2011:1239–1248
117Zhang H, Berg A C, Maire M, et al. Svm-knn: Discriminative Nearest Neigh-bor Classification for Visual Category Recognition[C]. Proceedings of the2006IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume2. Washington, DC, USA,2006:2126–2136
118Watanabe T, Sumita E. Example-based Decoding for Statistical Machine Trans-lation[C]. Proc. of MT Summit IX.2003:410–417
119Bottou L, Vapnik V. Local Learning Algorithms[J]. Neural Comput.1992,4:888–900
120Cheng H, Tan P N, Jin R. Efficient Algorithm for Localized Support Vector Ma-chine[J]. IEEE Trans. on Knowl. and Data Eng.2010,22:537–549
121Manning C D, Schutze H. Foundations of Statistical Natural Language Process-ing[M]. Cambridge, MA, USA: MIT Press,1999
122Cauwenberghs G, Poggio T. Incremental and Decremental Support VectorMachine Learning[C]. Advances in Neural Information Processing Systems(NIPS*2000).2001,13
123ShiltonA,PalaniswamiM,RalphD,etal. IncrementalTrainingofSupportVectorMachines.[J]. IEEE Transactions on Neural Networks.2005,16(1):114–131
124Andoni A, Indyk P. Near-optimal Hashing Algorithms for Approximate NearestNeighbor in High Dimensions[J]. Commun. ACM.2008,51(1):117–122
125He Y, Ma Y, van Genabith J, et al. Bridging Smt and Tm with Translation Rec-ommendation[C]. Proceedings of the48th Annual Meeting of the Association forComputational Linguistics. Uppsala, Sweden,2010:622–630
126Hildebrand S, Eck M, Vogel S, et al. Adaptation of the Translation Model forStatistical Machine Translation Based on Information Retrieval[C]. Proceedingsof EAMT.2005
127Lu Y, Huang J, Liu Q. Improving Statistical Machine Translation Performanceby Training Data Selection and Optimization[C]. Proceedings of the2007JointConference on Empirical Methods in Natural Language Processing and Compu-tational Natural Language Learning (EMNLP-CoNLL). Prague, Czech Republic,2007:343–350
128Nguyen P, Mahajan M, He X. Training Non-parametric Features for StatisticalMachine Translation[C]. Proceedings of the Second Workshop on Statistical Ma-chine Translation. Prague, Czech Republic,2007:72–79
129Bishop C M. Neural Networks for Pattern Recognition[M]. New York, NY, USA:Oxford University Press, Inc.,1995
130Koehn P. Pharaoh: A Beam Search Decoder for Phrase-based Statistical MachineTranslation Models[C]. AMTA.2004
131Chiang D. Hierarchical Phrase-based Translation[J]. Comput. Linguist.2007,33(2):201–228
132Mikolov T, Karafiat M, Burget L, et al. Recurrent Neural Network Based Lan-guage Model[C]. INTERSPEECH.2010:1045–1048
133LeQV,NgiamJ,CoatesA,etal. OnOptimizationMethodsforDeepLearning[C].ICML.2011:265–272
134Huang L, Chiang D. Better K-best Parsing[C]. In Proceedings of the Ninth Inter-nationalWorkshoponParsingTechnology.Vancouver,BritishColumbia,Canada,2005:53–64
135Collobert R, Weston J. A Unified Architecture for Natural Language Processing:Deep Neural Networks with Multitask Learning[C]. International Conference onMachine Learning, ICML.2008
136Buja A, Hastie T, Tibshirani R. Linear Smoothers and Additive Models[J]. TheAnnals of Statistics.1989,17:453–510
137Potts W J E. Generalized Additive Neural Networks[C]. Proceedings of the fifthACM SIGKDD international conference on Knowledge discovery and data min-ing. New York, NY, USA,1999:194–200
138Hager W W, Zhang H. Algorithm851: Cg descent, a Conjugate Gradient Methodwith Guaranteed Descent[J]. ACM Trans. Math. Softw.2006,32(1):113–137
139Hinton G E, Osindero S, Teh Y W. A Fast Learning Algorithm for Deep BeliefNets[J]. Neural Comput.2006,18(7):1527–1554
140Erhan D, Bengio Y, Courville A, et al. Why Does Unsupervised Pre-training HelpDeep Learning?[J]. J. Mach. Learn. Res.2010,11:625–660
141Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and GeneralMethod for Semi-supervised Learning[C]. Proceedings of the48th Annual Meet-ing of the Association for Computational Linguistics. Stroudsburg, PA, USA,2010:384–394
142Fujii A, Utiyama M, Yamamoto M, et al. Overview of the Patent Translation Taskat the Ntcir-8Workshop[C]. In Proceedings of the8th NTCIR Workshop Meet-ing on Evaluation of Information Access Technologies: Information Retrieval,Question Answering and Cross-lingual Information Access.2010:293–302
143Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic LanguageModel[J]. J. Mach. Learn. Res.2003,3:1137–1155
144Deselaers T, Hasan S, Bender O, et al. A Deep Learning Approach to MachineTransliteration[C]. Proceedings of the Fourth Workshop on Statistical MachineTranslation. Stroudsburg, PA, USA,2009:233–241
145Waal D A d, Toit J V d. Generalized Additive Models from a Neural NetworkPerspective[C]. Proceedings of the Seventh IEEE International Conference onData Mining Workshops. Washington, DC, USA,2007:265–270
146Castano M A, Casacuberta F, Vidal E. Machine Translation Using Neural Net-works and Finite-state Models[C]. TMI.1997:160–167