中文词法分析的研究及其应用

英文题名：The Research and Applications of Chinese Lexical Analysis
作者：孙晓
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：中文信息处理 ; 中文词法分析 ; 条件随机域 ; 超函数 ; 机器翻译
英文关键词：Chinese Information Processing ; Chinese Lexical Analysis ; Conditional Random Fields ; Super Function ; Machine Translation
学位年度：2010
导师：高庆狮 ; 黄德根
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2009-09-29

摘要

在机器翻译和其他自然语言处理任务中,对于中文和日文等亚洲语言,词的识别和处理是一个最为关键的基础性步骤,而其中存在的问题至今仍然没有得到完善的解决,从而影响了机器翻译以及其他自然语言处理任务的精度和效率。在中文词法分析任务中,除了中文分词,还包括词性标注,未登录词(或新词)的识别和词性标注等基础性步骤,这些也是影响中文词法分析性能和精度提高的难点所在。
     首先,针对中文词法分析存在的问题,提出了一种新的融合单词和单字信息的基于词格的中文词法分析方法。该方法利用系统词表,构建包含所有分词和词性标注候选路径的词格,同时对候选未登录词及其词性进行同步识别并加入到词格中,降低了未登录词识别的运算复杂度,然后利用基于词的条件随机域模型,结合定义在整条输入路径上的全局特征模板,在词格中选择最终的分词以及词性标注结果。基于词的条件随机域的解码速度要高于基于单字的条件随机域,并降低了标注偏置问题和长度偏置的影响,在SIGHAN-6等开式和闭式语料上进行测试,获得了令人满意的结果。另外,为了进行对比,对基于单字的中文分词模型也进行了进一步的研究,在其中引入多个外部词典,并增加了相应的特征,进一步提高了基于单字的中文分词模型的分词精度；同时,为了满足高效率的中文词法分析需求,提出了基于最长次长匹配算法的一体化的中文词法分析方法,因为是基于隐马尔可夫进行编码和解码,因此具有较高的训练和词法分析速度。
     其次,针对中文词法分析中的未登录词识别和标注问题,提出了隐藏状态的半马尔可夫条件随机域模型(Hidden semi-CRF), Hidden semi-CRF模型可以同步识别未登录词及其词性。Hidden semi-CRF模型结合了隐藏变量动态条件随机域模型(LDCRF)和半马尔可夫条件随机域模型(semi-CRF)的优势,相对semi-CRF模型具有更低的运算代价和更高的识别精度。通过Hidden semi-CRF模型同步识别未登录词及其词性,并加入到词格中参与整体路径选择,提高了词法分析的整体精度。
     最后,将中文词法分析的结果直接应用到基于超函数的中日机器翻译系统中,对原有超函数进行了扩展：首先是将超函数扩展为面向句子的超函数和面向短语的超函数,其次是扩展了超函数中变量的范围,最后提出了高效率的搜索相似超函数的匹配算法。扩展后的超函数降低了超函数库的数量,提高了匹配超函数的检索速度,并且翻译的精度和质量也得到提高。
Words are the smallest meaningful units that can be used independently, lexical analysis is the basic step for syntactic tagging, semantic tagging and other deeply corpus processing. Most natural language processing systems, such as machine translation, speech synthesis, information extraction, document retrieval and so on, treat the word as the basic processing units, so correct lexical analysis is of great significance, In machine translation and other natural language processing tasks, the identification of words has been, and is still problematic in Chinese and other Asian language such as Japanese. Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts (Chinese word segmentation) becomes an essential task for Chinese language processing. In Chinese lexical analysis, besides Chinese word segmentation, we also need to identify the part-of-speech (POS) tags for the words and detect the unknown words.
     First, we proposed a pragmatic Chinese lexical analyzer integrating the word-level and character-level information based on conditional random fields (CRFs) model. The word-lattice, which represents all candidate outputs, is built by utilizing the system lexicon. The linear-chain CRF is applied in the selection of final token sequence from the word-lattice by using rich and flexible predefined features. This pragmatic method based on hybrid CRF models offers a solution to the long-standing problems in corpus-based or statistical, word-based or character-based Chinese lexical analysis.In order to make comparisons, we continue to extend the character-based Chinese lexical analysis for comparison, several extended dictionary are added into the system and corresponding features are imported for Chinese lexical analysis. We used this model to attend the SIGHAN-6 bakeoff and gained satisfying results. For meeting the demand of effectiveness, based on the maximum matching and second-maximum matching algorithm, we build the integrative Chinese lexical analyzer, which is encoded and decoded by using the HMM model. Thus, the integrative model has higher training and testing speed.
     Secondly, for the unknown words in the real-word text, we proposed a hidden semi-CRF model, which combines the strength of (Latent-Dynamic CRF) LDCRF and semi-CRF. The proposed hidden semi-CRF, which incorporates the character-level features and word-level features, is invoked when no matching word can be found in a lexicon and could detect the unknown words and the corresponding POS tags synchronously.
     Thirdly, based on the results from the pragmatic Chinese lexical analyzer, we built an extended Super Function-based Chinese Japanese machine Translator. We extended the original Super Function in three ways, the first is that the Super Function is divided in to Super Function for sentences and Super Function for phrases; the second is the scope of the variables is extended, and the third is the matching algorithm for Super Functions is proposed. With the extended Super Function, fewer Super Functions are stored in database and the precision of the Chinese Japanese machine translation is also guaranteed.

引文

[1]Charniak E, McDermott D V. Introduction to Artificial Intelligence[M], Addison Wesley,1985.
    [2]Manning CD, Schutze H. Foundations of Statistical Natural Language Processing[M], MIT Press,1999.
    [3]Goldsmith J. Probabilistic Models of Grammar:Phonology as Information Minimization [J]. Phonological Studies,2002(5):21-46.
    [4]黄昌宁,李娟子.语料库语言学[M].北京：商务印书馆,2002.
    [5]黄昌宁,张小凤.自然语言处理技术的三个里程碑[J].外语教学与研究,2002,34(3)：180-187.
    [6]Peng F C, Feng F F, McCallum A. Chinese segmentation and new word detection using conditional random fields[C]. Proceedings of the 20th international conference on Computational Linguistics, Association for Computational Linguistics, 2004:562-568.
    [7]Gao J F, Wu A D, Li M, et al. Adaptive Chinese word segmentation[C]. Proceedings of ACL2004,2004:21-26.
    [8]Zhou G D. A chunking strategy towards unknown word detection in Chinese word segmentation[J]. Lecture notes in computer science, Springer,2005,3651:530-541.
    [9]Zhao H, Huang C N, Li M. An improved Chinese word segmentation system with conditional random field[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing,2006:162-165.
    [10]Xue N W. Chinese Word Segmentation as Character Tagging[C]. International Journal of Computational Linguistics and Chinese Language Processing,2003,8:29-48.
    [11]Charniak E, Hendrickson C, Jacobson N., et al. Equations for part-of-speech tagging[C]. Proceedings of the National Conference on Artificial Intelligence, 1993:784-784.
    [12]Brants T. TnT--a statistical part-of-speech tagger[C]. Proceedings of the sixth conference on applied natural language processing,2000:224-231.
    [13]Wu Y C, Chang C H, Lee Y S. A general and multi-lingual phrase chunking model based on masking method[J]. Lecture Notes in Computer Science, Springer,2006, 3878:144-155.
    [14]Wu Y C, Fan T K, Lee Y S, ct al. Extracting named entities using support vector machines[J]. Lecture Notes in Bioinformatics(LNBI):Knowledge Discovery in Life Science Literature,2006,3886:91-103.
    [15]Ratnaparkhi A. A maximum entropy part-of-speech tagger[C]. Proceedings of the conference on empirical methods in natural language processing,1996:133-142.
    [16]Sproat R, Emerson T. The first international Chinese word segmentation bakeoff [C]. Proceedings of the second SIGHAN workshop on Chinese language processing, 2003:133-143.
    [17]Emerson T. The second international Chinese word segmentation bakeoff[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005:123-133.
    [18]Levow G. The third international Chinese language processing bakeoff:Word segmentation and named entity recognition[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing,2006:108-117.
    [19]Jin G J, Chen X. The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Named Entity Recognition and Chinese POS Tagging[C]. Sixth SIGHAN Workshop on Chinese Language Processing,2008:69-81.
    [20]Kudo T, Yamamoto K, Matsumoto Y. Applying conditional random fields to Japanese morphological analysis[C]. Proceeding of EMNLP2004,2004:230-237.
    [21]刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析[J],计算机研究与发展.2004,41(8)：1421-1429.
    [22]Jiang F, Liu H, Chen Y Q, et al. An enhanced model for Chinese word segmentation and part-of-speech tagging[C]. SIGHAN Workshop On Chinese Language Processing. 2004:28-32.
    [23]高山,张艳,徐波等.基于三元统计模型的汉语分词及标注一体化研究[C].自然语言理解与机器翻译—全国第六届计算语言学联合学术会议论文集.北京：清华大学出版社.2001.8：116-122.
    [24]Sun MS, Xu D L, Tsou B K. Integrated Chinese Word Segmentation and Part-of-speech Tagging Based on the Divide-and-Conquer Strategy[C]. International Conference on Natural Language Processing and Knowledge Engineering,2003:610-615.
    [25]Zhang Y, Clark S. Joint Word Segmentation and POS Tagging Using a Single Perceptron[C]. Proceedings of ACL2008.2008:1-9.
    [26]Tseng H, Chen K. Design of Chinese morphological analyzer[C]. Proceedings of the first SIGHAN workshop on Chinese language processing,2002,18:1-7.
    [27]俞士汶,段慧明,朱学锋等,北大语料库加工规范：切分·词性标注·注音[J],汉语语言与计算学报.2003,13(2)：121-158.
    [28]Uemura S. Automatic Compilation and Retrieval of Modern Japanese Concordance [J]. Jounal of Information Processing,1979,1:172-179.
    [29]Uemura S, Sugawara Y, Hashimoto M, et al. Automatic compilation of modern Chinese concordances[C]. Proceedings of the 8th conference on Computational linguistics, 1980:323-329.
    [30]Sproat R, Shih C, Gale W, et al. A stochastic finite-state word-segmentation algorithm for Chinese[J]. Computational linguistics,1996,22:377-404.
    [31]Wu Z, Tseng G. Chinese text segmentation for text retrieval:Achievements and problems[J]. Journal of the American Society for Information Science, Wiley Subscription Services, Inc., A Wiley Company Washington, DC,1993,44:532-542.
    [32]Nagao M. A Framework of a Mechanical Translation between Japanese and English by Analogy Principle[J]. Artificial Human Intelligence, Elsevier Science Publishers, NATO,1984.
    [33]Ren F J. Super-function based machine translation[J]. Communications of COLIPS, 1999,9:83-100.
    [34]Sun X, Ren F J, Huang D G. Dual-chain Unequal-state CRF for Chinese New Word Detection and POS Tagging[C]. IEEE International Conference on. Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'08),2008:60-66.
    [35]Morency L, Quattoni A, Darrell T. Latent-dynamic discriminative models for continuous gesture recognition[C]. Proceedings IEEE Conference on Computer Vision and Pattern Recognition,2007:1-8.
    [36]Sun X, Morency L, Okanohara D, et al. Modeling Latent-Dynamic in Shallow Parsing: A Latent Conditional Model with Improved Inference[C]. Proceedings of the 22nd International Conference on Computational Linguistics,2008:841-848.
    [37]Sun X, Tsujii J. Sequential Labeling with Latent Variables:An Exact Inference Algorithm and its Efficient Approximation[C]. Proceedings of the 12th Conference of the European Chapter of the ACL,2009:772-780.
    [38]Sarawagi S, Cohen W. Semi-markov conditional random fields for information extraction[J]. Advances in Neural Information Processing Systems, Citeseer,2005, 17:1185-1192.
    [39]Okanohara D., Miyao Y., Tsuruoka Y., et al.. Improving the scalability of semi-markov conditional random fields for named entity recognition. Ratio,2006, 1(21646):42-19.
    [40]Sun X, Huang D G, Ren F J. Chinese Lexical Analysis based on Hidden semi-CRF[J]. ICIC Express Letters,2009,3(2):177-182.
    [41]Sasayama M, Ren F J, Kuroiwa S. Super-function based Japanese-English machine translation system[C]. Proceedings of Natural Language Processing and Knowledge Engineering, Proceedings of Natural Language Processing and Knowledge Engineering, 2003,1:555-560.
    [42]Zhao X, Ren F J, Voss S. A super-function based Japanese-Chinese machine translation system for business users[J]. Lecture notes in computer science, Springer,2004:272-282.
    [43]Levenshtein V. Binary codes capable of correcting deletions, insertions and reversals[J]. Soviet Physics Doklady,1966,10(8):707-710.
    [44]Hwang Y, Finch A, Sasaki Y. Improving statistical machine translation using shallow linguistic knowledge[J]. Computer Speech & Language, Elsevier,2007, 21(2):350-372.
    [45]Gao Y. Zhou B, Diao Z, et al. MARS:A statistical semantic parsing and generation-based multilingual automatic translation system[J]. Machine Translation, Springer,2002,17(3):185-212.
    [46]Feng J. Using confidence scores to improve hands-free speech based navigation in continuous dictation systems[J], ACM Transactions on Computer-Human Interaction (TOCHI), ACM New York, NY, USA,2004,11(4):329-356.
    [47]Torre A, Peinado A M, Rubio A J. Discriminative feature weighting for HMM-based continuous speech recognizers[J]. Speech Communication,2002,38(3-4):267-286.
    [48]Dong J X, Kr'zyzak A, Suen C Y. An improved handwritten Chinese character recognition system using support vector machine[J]. Pattern Recognition Letters, 2005,26 (12):1849-1856.
    [49]Lee C K, Leedham C G. A new hybrid approach to handwritten address verification[J]. International Journal of Computer Vision,2004,57(2):107-120.
    [50]Chen Z, Lee K F. A new statistical approach to Chinese pinyin input[C]. in:Cardie C, Daelemans W, N'edellec C, Tjong Kim Sang E F, eds., Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong:Association for Computational Linguistics,2000:241-247.
    [51]Anni J, Antti J, Kalervo J. S-grams:Defining generalized n-grams for information retrieval[J]. Information Processing & Management,2007,43(4):1005-1019.
    [52]Henrik N, Umberto S. Information retrieval and machine learning for probabilistic schema matching[J]. Information Processing & Management,2007,43(4):552-576.
    [53]Jelinek F. Statistical Methods for Speech Recognition[M]. USA:The MIT press,1997.
    [54]Johansen S, Juselius K. Maximum likelihood estimation and inference on cointegration-with applications to the demand for money[J]. Oxford Bulletin of Economics and statistics, Blackwell Publishing Ltd,1990,52:169-210.
    [55]张学工.统计学习理论的本质[M].北京：清华大学出版社,2000.
    [56]许建华,张学工.统计学习理论[M].北京：电子工业出版社,2004.
    [57]Lafferty J, McCallum A, Pereira F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]. Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc. 2001:282-289.
    [58]Clifford P. Markov random fields in statistics[J]. Disorder in physical systems, 1990,1:19-32.
    [59]Forney J G. The viterbi algorithm[C]. Proceedings of the IEEE,1973,61 (3):268-278.
    [60]Matsuzaki T, Miyao Y. Probabilistic CFG with latent annotations[C]. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005:75-82.
    [61]Petrov S, Klein D. Discriminative log-linear grammars with latent variables[C]. Advances in Neural Information Processing Systems, MIT Press,2007,20:1153-1160.
    [62]Blunsom P, Cohn T, Osborne M. A discriminative latent variable model for statistical machine translation[C]. Proceeding of the 46th Annual Conference of the Association for Computational Linguistics:Human Language Technologies (ACL-08: HLT),2008:200-208.
    [63]Petrov S, Klein D. Parsing German with Latent Variable Grammars[C]. Proceedings of the Workshop on Parsing German at ACL'08, Association for Computational Linguistics,2008:33-39.
    [64]Sutton C, McCallum A, Rohanimanesh K. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data[J]. The Journal of Machine Learning Research, MIT Press Cambridge, MA, USA,2007, 8:693-723.
    [65]孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报,2004,27(6)：736-742.
    [66]赵铁军,吕雅娟,于浩,等.提高汉语自动分词精度的多步处理策略[J].中文信息学报,2001,15(1)：13-18.
    [67]刘挺,吴岩,王开铸.串频统计和词形匹配相结合的汉语自动分词系统[J].中文信息学报,1997,12(1)：17-22.
    [68]Goh C, Asahara M, Matsumoto Y. Chinese word segmentation by classification of characters[J]. Computational Linguistics and Chinese Language Processing,2005, 10(3):381-396.
    [69]肖云,孙茂松,邹嘉彦.利用上下文信息解决汉语自动分词中的组合型歧义[J].计算机工程与应用,2001,37(19)：78-81.
    [70]孙茂松,黄昌宁,邹嘉彦.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J].计算机工程与应用,1997,34(5)：332-339.
    [71]孙茂松,左正平,邹嘉彦.高频最大交集型歧义切分字段在汉语自动分词中的作用[J].中文信息学报,1999,13(1)：27-34.
    [72]李天侠,戴新宇,陈家骏.一种基于统计和规则相结合的汉语分词交集型歧义消歧策略.第十一届中国机器学习会议(CCML2008).计算机工程与应用,2008,44(21).
    [73]郑家恒,吴芳芳.多义型歧义字段切分研究[G].计算机语言学文集.北京：清华大学出版社.1999,39(5)：101-103.
    [74]孙茂松,邹嘉彦.汉语自动分词研究评述[J].当代语言学,2001,3(1)：22-32.
    [75]黄昌宁.中文信息处理中的分词问题[J].语言文字应用,1997,21(1)：72-78.
    [76]Huang D G, Sun X, Jiao S D, et al. HMM and CRF Based Hybrid Model for Chinese Lexical Analysis. Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, 2008:133-137.
    [77]黄德根,朱和合,王昆仑,等.基于最长次长匹配的汉语自动分词[J].大连理工大学学报,1999,39(6)：831-835.
    [78]孙晓,黄德根.基于动态规划的最小代价路径汉语自动分词[J].小型微型计算机系统,2006,27(3)：516-519.
    [79]孙茂松,黄昌宁,高海燕,等.中文姓名的自动辨识[J].中文信息学报,1995,9(2)：16-27.
    [80]黄德根,杨元生,王省,等.基于统计方法的中文姓名识别[J].中文信息学报,2001,15(2)：31-37.
    [81]郑家恒,李鑫,谭红叶.基于语料库的中文姓名识别方法研究[J].中文信息学报,2000,14(3)：7-12.
    [82]刘秉伟,黄萱箐,郭以昆,等.基于统计方法的中文姓名识别[J].中文信息学报,2000,14(3)：16-24.
    [83]季恒,罗振声.基于统计与规则的中文姓名自动辨识[J].语言文字应用,2001,2(1)：14-18.
    [84]沈达阳,孙茂松,黄昌宁.中文地名的自动辨识[J].见：陈力为,袁琦,eds.,计算语言学进展与应用.北京：清华大学出版社,1995：68-74.
    [85]黄德根,岳广玲,杨元生.基于统计的中文地名识别[J].中文信息学报,2003,17(2)：36-41.
    [86]谭红叶,郑家恒,刘开瑛.中文地名的自动识别方法研究[G].黄昌宁,董振东,eds.计算语言学文集.北京：清华大学出版社,1999：174-179.
    [87]孙茂松,张维杰.英语姓名译名的自动辨识[J].见：陈力为,ed.,计算语言学研究与应用.北京：北京语言学院出版社,1993：144-149.
    [88]陈小荷.自动分词中未登录词问题的一揽子解决方案[J].语言文字应用,1999,31(3)：103-109.
    [89]吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J].中文信息学报,2001,15(1)：28-33.
    [90]刘开瑛.歧义切分与专有名词识别软件[J].语言文字应用,2001,2(3)：9-15.
    [91]张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报,2004,27(1)：85-91.
    [92]Fu G H, Luke K K. Chinese named entity recognition using lexicalized HMMs[J]. ACM SIGKDD Explorations Newsletter,2005,7(1):19-25.
    [93]Tan H Y, Zheng J H, Liu K Y. Research on method of automatic recognition of Chinese place names based on transformation [J]. Journal of Software,2001, 12(11):1605-1610.
    [94]王振华,孔祥龙,陆汝占.结合决策树方法的中文姓名识别[J].中文信息学报,2004,18(6)：10-15.
    [95]泰文,苑春法.基于决策树的汉语未登录词识别[J].中文信息学报,2004,18(1)：14-19.
    [96]周俊生,戴新宇,尹存燕,等.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报,2006,34(5)：804-809.
    [97]Li L S, Chen C R, Huang D G, et al. Identifying pronunciation-translated names from Chinese texts based on support vector machines[C]. in:Yin F L, Wang J, Guo C G, eds., Advances in Neural Networks-ISNN 2004, International Symposium on Neural Network. Dalian, China:Springer,2004:983-987.
    [98]Goh C L, Masayuki A, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking[C]. in:Walker DE, ed., The Companion Volume to the Proceedings of 41st ACL. Sapporo, Japan:Association for Computational Linguistics,2003:197-200.
    [99]王志强.基于条件随机域的中文命名实体识别研究[D].南京：南京理工大学,2006.
    [100]Gao J F, Li M, Wu A D, et al. Chinese word segmentation and named entity recognition: A pragmatic approach[J]. Computational Linguistics,2005,31(4):531-574.
    [101]Zhao H, Kit C. Incorporating global information into supervised learning for Chinese word segmentation[C]. The 10th Conference of the Pacific Association for Computational Linguistics (PACLING-2007),2007:66-74.
    [102]Zhao H, Kit C. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition[C]. Sixth SIGHAN Workshop on Chinese Language Processing,2008:106-111.
    [103]Pinto D, McCallum A, Wei X, et al. Table extraction using conditional random fields[C]. SIGIR'03:Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2003:235-242.
    [104]Sha F, Pereira F. Shallow parsing with conditional random fields[C]. Proceedings of HLT-NAACL,2003:213-220.
    [105]Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for SIGHAN Bakeoff 2005[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing,2005:168-171.
    [106]Liu D, Nocedal J. On the limited memory BFGS method for large scale optimization[J]. Mathematical Programming, Springer,1989,45(1):503-528.
    [107]Zhang H, Yu H, Xiong D, et al.. HHMM-based Chinese lexical analyzer ICTCLAS[C]. Proceedings of Second SIGHAN Workshop on Chinese Language Processing, 2003:184-187.
    [108]Low J, Ng H, Guo W. A maximum entropy approach to Chinese word segmentation[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, 2005:161-164.
    [109]Huang D G, Sun X. An Integrative Approach to Chinese Named Entity Recognition[C]. Proceedings of the Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007) 2007:171-176.
    [110]McCallum A, Freitag D, Pereira F. Maximum Entropy Markov Models for Information Extraction and Segmentation[C]. Proceedings of the Seventeenth International Conference on Machine Learning,2000:591-598.
    [111]McCallum A, Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons [C]. Seventh Conference on Natural Language Learning (CoNLL),2003.
    [112]Peng F C, Schuurmans D. Self-supervised Chinese word segmentation[C]. IDA'01: Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis, Springer-Verlag,2001:238-247.
    [113]Dong Z D. Bigger context and better understanding-Expectation on future MT technology[C]. Proceedings of Intenational Conference on Machine Translation & Computer Language Information Processing,1999:26-28.
    [114]董振东,董强.关于知网——中文信息结构库[M/OL], http://www.keenage.com,2000.
    [115]Goh C L, Asahara M, Matsumoto Y. Machine learning-based methods to Chinese unknown word detection and POS tag guessing[J]. Journal of Chinese Language and Computing, 2006,16 (4):185-206.
    [116]Nie J Y, Hannan M L, Jin W. Unknown word detection and segmentation of Chinese using statistical and heuristic knowledge[J]. Communications of COLIPS,1995, 5(1):47-57.
    [117]Chen C, Bai M H, Chen K J. Category guessing for Chinese unknown words[C]. Proceedings of the Natural Language Processing Pacific Rim Symposium,1997:35-40.
    [118]Chen K J, Bai M H. Unknown word detection for Chinese by a corpus-based learning method[J]. Computational Linguistics,1998,3(1):27-44.
    [119]Chen K J, Ma W Y. Unknown word extraction for Chinese documents. Proceedings of COLING,2002,1:169-175.
    [120]Chiang T, Chang J, Lin M, et al. Statistical models for word segmentation and unknown word resolution[C]. Proceedings of ROCLING V,1992:121-146.
    [121]Goh C L, Asahara M, Matsumoto Y. Training multi-classifiers for Chinese unknown word detection[J]. Journal of Chinese Language and Computing,2005,15(1):1-12.
    [122]Fu G H, Wang X. Unsupervised Chinese word segmentation and unknown word identification. Proceedings of NLPRS'99,1999:32-37.
    [123]Asahara M, Matsumoto Y. Unknown word identification in Japanese text based on morphological analysis and chunking[C]. Joho Shori Gakkai Kenkyu Hokoku(In Japanese),2003:47-54.
    [124]Fu G H, Luke K K. Chinese unknown word identification using class-based LM[C]. Lecture Notes in Computer Science (IJCNLP 2004),2005,3248:704-713.
    [125]Goh C L, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking[C]. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics,2003,2:197-200.
    [126]Tseng H, Jurafsky D, Manning C. Morphological features help POS tagging of unknown words across language varieties[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing,2005:32-39.
    [127]郑家恒,李文花.基于构词法的网络新词自动识别初探[J].山西大学学报,2002：115-119.
    [128]Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system[C]. Proceedings of the Second Workshop on Chinese Language Processing:Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics,2000,12:46-51.
    [129]邹纲,刘洋,刘群,等.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6)：1-9.
    [130]崔世起,刘群,孟遥,等.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5)：927-932.
    [131]Shi Q, Shen L Q, Chai H X. Automatic new word extraction method[C]. in:MercerB, ed., IEEE International Conference on Acoustics, Speech, and Signal Processing. Florida:The Printing House,2002,1:865-868.
    [132]Li H Q, Huang C N, Gao J F. The use of SVM for Chinese new word identification[C]. in:Su K Y, Tsujii J, Lee J H, Kwong 0 Y, eds., Proceedings of First International Joint Conference on Natural Language Processing. Sanya, China:Springer, 2004:497-504.
    [133]秦浩伟,步丰林.一个中文新词识别特征的研究[J].计算机工程,2004,30：369-371.
    [134]贾自艳,史忠植.基于概率统计技术和规则方法的新词发现[J].计算机工程,2004,30(20)：19-21.
    [135]陈玉泉,顾顺莲,陆汝占.计算机辅助新词新语词典的编纂[J].上海交通大学学报,2000,34(7)：999-1000.
    [136]Yan W. New words mining from the dynamic current corpus based on VSM[C]. Dictionaries and Digital Symposium,2004.
    [137]赵军.汉语基本名词短语识别及结构分析研究[D].北京：清华大学1998.
    [138]Zhao T J, Yang M Y, Liu F, et al. Statistics based hybrid approach to Chinese base phrase identification. Proceedings of the Second Workshop on Chinese Language Processing:Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics,2000:73-77.
    [139]Petrov S, Klein D. Improved inference for unlexicalized parsing[C]. Proceedings of NAACL-HLT,2007:404-411.
    [140]周强.汉语语料库的短语自动划分和标注研究[D].北京：北京大学计算机系1996年6月.
    [141]Wojciech S, Thorsten B. A Maximum-Entropy partial parser for unrestricted text [C]. Sixth Workshop on Very Large Corpora,1998:143-151.
    [142]Xia F. The segmentation guidelines for the Penn Chinese Treebank (3.0)[R]. IRCS Technical Reports Series, Citeseer,2000,1:37.
    [143]Way A, Gough N. Comparing example-based and statistical machine translation[J]. Natural Language Engineering, Cambridge University Press,2005,11(3):295-309.
    [144]SasayamaM, Kuroiwa S, Ren F J. Extracting date/time expressions in Super-Function based Japanese-English Machine Translation (in Japanese) [J]. IEEJ Transactions on Electronics, Information and Systems,2008,128(8):1342-1350.
    [145]Papineni K, Roukos S, Ward T, et al. BLEU:a method for automatic evaluation of Machine Translation[C].40th Annual meeting of the Association for Computational Linguistics 2002.2002:311-318.
    [146]Nagao M. A framework of a mechanical translation between Japanese and English by analogy principle[J]. Readings in Machine Translation, Bradford Books,2003, 1:351.
    [147]Koehn P. Pharaoh:a beam search decoder for phrase-based statistical machine translation models[J]. Lecture notes in computer science, Springer,2004:115-124.
    [148]Hutchins J. Towards a definition of example-based machine translation[C]. Mt summit x workshop on example-based machine translation. Phuket, Thailand,2005.
    [149]Groves D., Way A.. Hybrid example-based SMT:the best of both worlds? building and using parallel texts:Data-Driven Machine Translation and beyond[C]. Proceedings of ACL 2005 Workshop on Building and Using Parallel Texts:Data-Driven Machine Translation and Beyond,2005:183-190.
    [150]Carl M, Way A, Daelemans W. Recent advances in example-based machine translation. Computational Linguistics, MIT Press Cambridge, MA, USA,2004,30(4):516-520.
    [151]Callison-Burch C, Osborne M, Koehn P. Re-evaluating the role of BLEU in machine translation research[C]. Proceedings of EACL,2006:249-256.
    [152]Brown P F, Della-Pietra V J, Della-Pietra, S A, et al. The mathematics of statistical machine translation:parameter estimation[J]. Computational Linguistics, MIT Press Cambridge, MA, USA,1993,19(2):263-311
    [153]Zhu J, Wang H F. The effect of adding rules into the rule-based MT system[C]. Proceedings of MT SUMMIT X, Phuket Island, Thailand,2005:298-304.
    [154]Liu Z Y, Wang H F, Wu H. Example-based Machine Translation based on tree-string correspondence and statistical generation [J]. Machine Translation. Special Issue on Example-Based Machine Translation.2006,20(1):25-41.
    [155]Wu D, Stochastic inversion transduction grammars and bilingual parsing of parallel corpora[J]. Computational Linguistics,1997,23(3):377-403.
    [156]Alshawi H, Bangalore S, Douglas S. Learning dependency translation models as collections of finite-state head transducers [J]. Computational Linguistics,2000, 26(1):45-60.
    [157]Melamed D.2004. Statistical machine translation by parsing[C]. Proceedings of ACL2004, Barcelona, Spain.2004:653-660.
    [158]Chiang D. A hierarchical Phrase-based Model for Statistical Machine Translation[C]. Proceedings of ACL2005, Ann Arbor, MI.2005:263-270.
    [159]Amano S, Hirakawa H, Nogami H, et al. The Toshiba Machine Translation system[R]. Future Computing System,1989,2(3):227-246.
    [160]Arjen P. Data-oriented Translation[C]. Proceedings of COLING2000, Saarbrucken, Germany,2000:635-641.
    [161]Way A. Machine Translation Using LFG-DOP[J]. In Bod R, Scha R, Sima'anK(eds) Data-oriented parsing, CSLI Publications, Stanford, CA.2003:359-384.
    [162]Aramaki E, Kurohashi S. Example-Based Machine Translation Using Structural Translation Examples[C]. In International Workshop on Spoken Language Translation, Kyoto, Japan,2004:91-94.
    [163]Shieber S M. Restricting the weak generative capacity of synchronous tree adjoining grammar[J]. Computational Intelligence,1994,10(4):371-385.
    [164]Germann U. Greedy decoding for statistical Machine Translation in almost linear Time[C]. Proceedings of HLT-NAACL. Edmonton, Alta, Canada.2003:72-79.
    [165]王小捷,钟义信.基于Ontology的英汉机器翻译研究[J].中文信息学报,2000,14(5)：8-15.
    [166]Wu H, Wang H F. Improving statistical word alignment with ensemble methods[C]. Proceedings of IJCNLP2005, Jeju Island, Republic of Korea.2005:462-473.
    [167]Wu H, Wang H F, Liu Z Y. Boosting statistical word alignment using labeled and unlabeled data[C]. Proceedings of Coling/ACL2006 Main Conference Poster Sessions, Sydney, Australia.2006:913-920.
    [168]Och F J, Ney H. Improved statistical alignment models[C]. Proceedings of the 38th ACL, Hong Kong. China.2000:440-447.
    [169]Wang X J, Ren F J. Chinese-Japanese Clause Alignment[C]. Computational Linguistics and Intelligent Text Processing,6th International Conference. 2005:400-412.
    [170]Zhu J, Wang H F. The effect of translation quality in MT-based cross-language information retrieval[C]. Proceedings of Coling/ACL2006, Sydney, Australia. 2006:593-600.
    [171]Wu H, Wang H F. Improving statistical word alignment with a rule-based Machine Translation system[C]. Proceedings of COLING-04, Geneva, Switzerland. 2004:29-35.
    [172]陈家骏,戴新宇,尹存燕,等.一个基于格语法和转换策略的日汉机器翻译系统[J],中文信息学报,2006：20(B03)：61-65.
    [173]黄德根,中日机器翻译的研究与实现[D].大连：大连理工大学,2004.
    [174]李沐,吕学强,姚天顺.一种基于E-Chunk的机器翻译模型[J].软件学报,2002,13(4)：669-676.
    [175]Och F, Ney H. Giza++:Training of statistical translation models[R].2003.
    [176]Marcu D, Germann U. The isi rewrite decoder release 0.7. Ob[CP].2002.
    [177]Mochihashi D, Yamada T, Ueda N. Bayesian unsupervised word segmentation with nested pitman-yor language modeling[C]. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,2009:100-108.
    [178]Kruengkrai C, Uchimoto K, Kazama J, et al.. An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging[C]. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,2009:513-521.
    [179]Wu H, Wang H. Revisiting pivot language approach for Machine Translation[C]. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Association for Computational Linguistics,2009:154-162.
    [180]Galley M, Manning C D. Quadratic-time dependency parsing for Machine Translation[C]. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Association for Computational Linguistics,2009:773-781.
    [181]Jiang W B, Huang L, Liu Q, et al. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging[C]. Proceedings of ACL,2008.
    [182]Jiang W B, Mi H T, and Liu Q. Word lattice reranking for chinese word segmentation and part-of-speech tagging[C]. Proceedings of COLING,2008.
    [183]Kruengkrai C, Uchimoto K, Kazama J, et al. An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging[C]. Proceedings of ACL-IJCNLP,2009

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700