短语级复述的识别与抽取

英文题名：Identification and Extraction of Phrasal Paraphrase
作者：刘树伟
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：复述短语 ; 可比新闻 ; 词对齐特征 ; 复述语料
英文关键词：paraphrase phrase ; comparable news ; word alignment feature ; paraphrase corpora
学位年度：2009
导师：刘挺
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2009-06-01

摘要

复述是指对相同语义的不同表达,复述研究在众多自然语言处理的应用领域中都有重要的意义。本文的主要研究任务是短语级复述资源的获取。这项研究工作的目的和意义是为基于统计机器翻译的复述生成模型提供更多资源,从而提高复述生成的质量。
     本文抽取短语级复述的方法共包括两个步骤:复述短语候选的获取和复述候选的确认。复述短语候选的获取使用了基于可比新闻的方法,此类方法的主要优点在于互联网上可比新闻的数量众多,因此使用该方法可以构建相当规模的复述短语库。基于可比新闻提取候选的步骤包括获取新闻语料,基于新闻内容的相似度和新闻发布时间的间隔获取可比新闻,从可比新闻中提取可比句,以及从可比句中提取复述短语。复述候选的确认使用基于二元分类的方法,其重点是分类特征的设计。本文所使用的特征主要是基于复述语料的统计特征,其中包括基于χ~2方法的词对齐特征,基于互信息方法词对齐特征以及基于χ~2方法的词性标注模板对齐特征。前两个特征是词汇层面上的统计特征,后一个为以词性信息为模板的统计特征。除此之外,我们还使用了一些简单的短语串相似特征,如词长度比,词重叠率,编辑距离特征。
     实验结果表明了使用基于可比新闻的方法可以获取大规模的复述短语,并根据特征比较证明了每一类特征对分类准确率提高均有贡献,其中以基于χ~2方法的词对齐特征的贡献最大。基于可比新闻的方法共获取复述短语2,961,739对,其准确率为21.47%。我们使用4类特征对2,961,739对复述短语候选进行分类确认,最终共抽取出595,619对复述短语,其准确率为59.3%,提高了37.83%。
Paraphrases are alternative ways to convey the same information. Paraphrases are important in plenty of natural language processing (NLP) applications. This paper mainly focuses on the extraction of phrasal paraphrase corpora. The significance of this study is to provide more resources for paraphrase generation model based on statistical machine translation (SMT) in order to improve paraphrase generation performance.
     In this paper, phrasal paraphrase extraction method contains two steps: candidate extraction and paraphrase identification. Candidate extraction method is based on comparable news. The main advantage of such method is that there are a large number of comparable news on Internet, so large-scale paraphrase phrases are extracted by the method. Candidate extraction method includes four steps: crawling news corpus, obtaining comparable news based on news content similarity and time interval features, extracting comparable sentences from comparable news, as well as further extracting paraphrase phrases from comparable sentences. Paraphrase identification method is based on binary classification, which focuses on the design of classification features. This paper mainly uses statistical features based on paraphrase corpus, including word alignment feature based onχ2, word alignment feature based on mutual information, as well as part-of-speech template alignment feature based onχ2. The first two features are lexical and the last one is used to extract syntactic paraphrases. In addition, we also used a few phrase string similarity features, such as word length ratio, word overlap number, word edit distance.
     The experimental results show that candidate extraction method based on comparable news is ability to obtain large-scale paraphrase phrases. Feature evaluation results show that each type of feature helps to increase classification performance, especially word alignment feature based onχ2. Using the method based on comparable news, we extract 2,961,739 pairs of paraphrase phrase candidates, the precision of which is 21.47%. By further identifying based on classification method, we finally obtain 595,619 pairs of paraphrase phrases, the precision of which is 59.3%, increasing by 37.83%.

引文

[1] M. A. K. Halliday. An Introduction to Functional Grammar. London; Baltimore, Md, 1985.
    [2] De Beaugrande, R.Alain, and W.Dressler. Introduction to text linguistics. New York: Longman, 1981.
    [3] Regina Barzilay and Kathleen McKeown. Extracting paraphrases from a parallel corpus. In: ACL /EACL, Toulouse, France, 2001: 50-57.
    [4] O. Glickman and I. Dagan. Identifying lexical paraphrases from a single corpus: A case study for verbs. In: proceedings of Recent Advantages in Natural Language Processing, September 2003.
    [5] Callison-Burch C, Koehn P, Osborne M. Improved Statistical Machine Translation Using Paraphrases. In: Proc. of HLT-NAACL. 2006: 17-24.
    [6] David Kauchak and Regina Barzilay. Paraphrasing for Automatic Evaluation. In Proceedings of HLT-NAACL. 2006: 455-462.
    [7] Zhou L, Lin CY, Hovy E. Re-evaluating Machine Translation Results with Paraphrase Support. In: Proc. of EMNLP. 2006: 77-84.
    [8] Ravichandran D, Hovy E. Learning Surface Text Patterns for a Question Answering System. In: Proc. of ACL. 2002: 41-47.
    [9] Rinaldi F, Dowdall J, Molla D. Exploiting Paraphrases in a Question Answering System. In: Proc. of IWP. 2003: 25-32.
    [10]Shiqi Zhao, Ming Zhou, and Ting Liu. Learning Question Paraphrases for QA from Encarta Logs. In Proceedings of IJCAI. 2007: 1796-1800.
    [11]Stede M. Lexical Paraphrases in Multilingual Sentence Generation. Machine Translation, 1996, 11: 75-107.
    [12]Mckeown KR, Barzilay R, Evans D, Hatzivassiloglou V, Klavans JL, Nenkova A, Sable C, Schiffman B, Sigelman S. Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster. In: Proc. of HLT. 2002: 280-285.
    [13]Zhou L, Lin CY, Munteanu DS, Hovy E. ParaEval: Using Paraphrases to Evaluate Summaries Automatically. In: Proc. of HLT-NAACL. 2006: 447-454.
    [14]Ingrid Zukerman and Bhavani Raskutti. Lexical Query Paraphrasing for Document Retrieval. In: Proceedings of COLING. 2002: 1-7.
    [15]Yusuke Shinyama, Satoshi Sekine, and Kiyoshi Sudo. Automatic Paraphrase Acquisition from News Articles. In Proceedings of HLT. 2002: 40-46.
    [16]Chris Quirk, Chris Brockett, and William Dolan. Monolingual Machine Translation for Paraphrase Generation. In Proceedings of EMNLP. 2004:142-149.
    [17]Andrew Finch, Taro Watanabe, Yasuhiro Akiba, and Eiichiro Sumita. Paraphrasing as Machine Translation. 2004, 11(5): 87-111.
    [18]Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, Sheng Li. Combining Multiple Resources to Improve SMT-based Paraphrasing Model. In Proceedings of ACL-08: HLT. 2008: 1021-1029.
    [19]Ali Ibrahim, Boris Katz, Jimmy Lin. Extracting Structural Paraphrases from Aligned Monolingual Corpora. In Proceedings of IWP. 2003: 57-64.
    [20]Bo Pang, Kevin Knight, and Daniel Marcu. Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences. In Proceedings of HLT-NAACL. 2003: 102-109.
    [21]Regina Barzilay and Lillian Lee. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment. In Proceedings of HLT-NAACL. 2003: 16-23.
    [22]Bill Dolan, Chris Quirk, Chris Brockett. Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In Proceedings of COLING. 2004: 350-356.
    [23]Chris Brockett and William B. Dolan. Support Vector Machines for Paraphrase Identification and Corpus Construction. In Proceedings of IWP. 2005: 1-8.
    [24]William B. Dolan and Chris Brockett. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of IWP. 2005: 9-16.
    [25]Vasileios Hatzivassiloglou, Judith L. Klavans, Eleazar Eskin. Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning. In Proceedings of EMNLP. 1999: 203-212.
    [26]Andrew Finch, Young-Sook Hwang, Eiichiro Sumita. Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence. In Proceedings of IWP. 2005: 17-24.
    [27]Igor A. Bolshakov and Alexander Gelbukh. Synonymous Paraphrasing Using WordNet and Internet. In Proceedings of NLDB. 2004: 312-323.
    [28]Dekang Lin. Automatic Retrieval and Clustering of Similar Words. InProceedings of COLING/ACL. 1998: 768-774.
    [29]Kazutaka Takao, Kenji Imamura, Hideki Kashioka. Comparing and Extracting Paraphrasing Words with 2-way Bilingual Dictionaries. In Proceedings of LREC. 2002: 1016-1022.
    [30]Colin Bannard and Chris Callison-Burch. Paraphrasing with Bilingual Parallel Corpora. In Proceedings of ACL. 2005: 597-604.
    [31]Hua Wu and Ming Zhou. Synonymous Collocation Extraction Using Translation Information. In Proceedings of ACL. 2003: 120-127.
    [32]Dekang Lin and Patrick Pantel. Discovery of Inference Rules for Question Answering. In Natural Language Engineering. 2001, 7(4): 343-360.
    [33]Rahul Bhagat and Deepak Ravichandran. Large Scale Acquisition of Paraphrases for Learning Surface Patterns. In Proceedings of ACL. 2008: 674-682.
    [34]Shiqi Zhao, Ting Liu, Xincheng Yuan, Sheng Li, Yu Zhang. Automatic Acquisition of Context-Specific Lexical Paraphrases. In Proceedings of IJCAI. 2007: 1789-1794.
    [35]T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Sch?lkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999.
    [36]Ryan McDonald, Fernando Pereira, Kiril Ribarov and Jan Hajiˇc. Non-projective dependency parsing using spanning tree algorithms. In Proceedings of HLT/EMNLP. 2005: 523–530.
    [37]Shiqi Zhao, Lin Zhao, Yu Zhang, Ting Liu, Sheng Li. HIT: Web based Scoring Method for English Lexical Substitution. In Proceedings of SemEval. 2007: 173-176.
    [38]Ristad E.S. and Yianilos P.N. Learning string edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1998, 20(5): 522–532.
    [39]S. Nie?en, F. J. Och, G. Leusch, and H. Ney. Evaluation Tool for Machine Translation: Fast Evaluation for MT Research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. 2000.
    [40]Dekang Lin, Shaojun Zhao, Benjamin Van Durme and Marius Pasca. Mining Parenthetical Translations from the Web by Word Alignment. In Proceedings of ACL. 2008: 994-1002.
    [41]Y. Zhang, S. Vogel. Competitive Grouping in Integrated Phrase Segmentationand Alignment Model. In Proceedings of ACL-05 Workshop on Building and Parallel Text. 2005: 159–162.
    [42]Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. In Proceedings of ACL. 1989, 16(1): 22-29.
    [43]Peng HC, Long F and Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005, 27(8): 1226-1238.
    [44]Vladimir N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer-Verlag. 1995.
    [45]N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000.
    [46]Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977, 33:159–174.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700