时间表达式识别与归一化研究

英文题名：Research on Temporal Information Recognition and Normalization
作者：潘越群
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：时间表达式识别 ; 时间表达式归一化 ; 信息抽取 ; 条件随机域
英文关键词：Time expression recognition ; Time expression normalization ; Information extraction ; Conditional Random Field
学位年度：2008
导师：秦兵
学科代码：081201
学位授予单位：哈尔滨工业大学
论文提交日期：2008-06-01

摘要

在自然语言中,时间是一种重要的语义载体。人们通过了解一个事件的开始、发展和结束的时间信息,把握事件发展的全过程。时间信息识别在信息抽取、问答系统、摘要生成、话题跟踪和检测等领域中有着广泛应用。
     本文对时间信息识别的研究现状、研究方法等作了简要的介绍与分析,并简要介绍了TIMEX2标注规范,分别采用了基于规则的方法和和基于统计的方法对中文时间表达式进行识别,并对英文时间表达式的识别和归一化进行了探索。
     在基于规则方法的中文时间表达式识别中,根据时间表达式范围的句法标准,采用了基于依存句法分析的方法,然后通过将错误驱动方法融合到依存分析方法中,大大改进了实验结果,最终实验结果达到了76%以上。
     在基于统计方法的中文时间表达式识别中,依次使用了SVM、CRF方法以及改进CRF方法。这是首次将CRF方法应用到中文时间表达式识别中,选用了一系列有效特征,并对特征进行了扩展。用ACE标准评测工具对系统进行了评测,最终识别结果达到90%以上。评测结果表明:基于统计的方法优于基于规则的方法;在基于统计的方法中,CRF方法优于SVM方法;改进后的CRF方法在不影响时间表达式识别效果的情况下,提高了识别的效率。
     在英文时间表达式识别与归一化中,采用SVM方法对时间表达式进行识别及分类,然后使用规则对每一类时间表达式进行归一化。将统计方法引入时间表达式归一化中,其结果优于纯规则方法且减少了写规则的工作量。
     总之,本文对中文时间表达式的识别以及英文时间表达式识别与归一化进行了探索,取得了较好效果和有益结论。
In the area of natural language processing, temporal information is an important carrier of language semantics. Time information denotes the changes of things in everyday language. People catch the whole process of things by knowing the temporal information of starting, proceeding, and ending. Time expression recognition plays an important role in information extraction, question answering, summary generation, topic detection and tracking.
     In this paper, a brief introduction and analysis to current research status and available method was brought, along with the annotation guidelines. Methods based on rules and statistics are separately explored to solve the problem of Chinese time expression recognition. An effective method to solve the problem of English time expression extraction and normalization was explored.
     In rule-based Chinese time expression recognition method, according to the syntax guidelines of time expression extent recognition, a method based on dependency tree was used, then the error-driven method was combined to the dependency tree method, which improves the result greatly, the final result achieves more than 76%.
     In machine learning based time expression recognition, method of Support Vector Machine, Conditional Random Field and improved Conditional Random Field was separately used. This is the first time to use CRF model to solve the time recognition problem. A series of effective features was selected and enlarged by templates. ACE evaluation tool was used to evaluate the system, the final results achieves more than 90%. The evaluation results shows that machine learning method is better than rule base method, among all machine learning methods, CRF model achieves better result than SVM model, improved CRF method improves the recognition efficiency while the result is improved.
     In the problem of English time expression recognition and normalization, SVM model was first used to recognize time and then to classify the time to several classes. For each class of time expressions, rules are used to normalize it. By introducing machine learning method to English time recognition and normalization, the result improves greatly than only use the rule based method while saves a lot of work to write rules.
     In a word, this paper explores effectively on Chinese time expression extraction and English time expression recognition and normalization, and achieves good results and beneficial conclusions.

引文

1. Lisa Ferro, Laurie Gerber, Inderjeet Mani, Beth Sundheim, George Wilson. TIDES 2005, Standard for the Annotation of Temporal Expressions. First 2005, release April 2005, Updated September 2005. pages 1-65
    2. Estela Saquete, Patricio Martinez-Barco and Rafael Munoz. Recognizing and Tagging Temporal Expressions in Spanish. In Proc. of the Third International Conference on Language Resources and Evaluation. Workshop on Annotation Standards for Temporal Information in Natural Language (LREC2002). Las Palmas (Spain). May, 2002: pages 44-51
    3. Seok Bae Jang, Jennifer Baldwin, And Inderjeet Mani. Automatic TIMEX2 Tagging of Korean News. ACM Transactions on Asian Language Information Processing. Vol. 3, No. 1. March 2004: pages 51-65
    4. Frank Schilder and Christopher Habel. From Temporal Expressions to Temporal Information: Semantic Tagging of News Messages. Proceedings of ACL'01 workshop on temporal and spatial information processing. Association for Computational Linguistics. Morristown. NJ. USA: Pages1-8
    5. Kadri Hacioglu, Ying Chen, Benjamin Douglas. Automatic time expression labeling for English and Chinese Text. CICLing 05 Center for spoken language Research university of Colorado at Boulder: pages 548-559
    6. The ACE 2007(ACE07) Evaluation Plan: Evaluation of the Detection and Recognition of ACE Entities, values, Temporal Expressions, Relations, and Events. http://www.nist.gov/speech/tests/ace/2007/doc/ace07- evalplan.v1.3a. pdf. March 2008: pages 1-8
    7. ACE (Automatic Content Extraction) Chinese Annotation Guidelines for TIMEX2 (summery). http://www.ldc.upenn.edu/Projects/ACE/docs/Chinese-TIMEX2-Guideline-Summary_v1.2.pdf. pages 1-8
    8. Laurie Gerber, Shudong Huang, Linguistic Data Consortium, Xiaoman Wang. 2003 standard for the annotation of temporal expressions Chinese supplement DRAFT. April 2004. http://fofoca.mitre.org/annotation_guidelines/timex2Chinese_Supplement_v_ 0_3.pdf. March 2008: pages 1-65
    9.贺瑞芳,秦兵,刘挺,潘越群,李生.基于依存分析和错误驱动的中文时间表达式识别.中文信息学报. 2007年05期:36-40页
    10. J. Lafferty, F. Pereira, and A. M cCallum. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning. 2001: pages 1-8
    11.欧阳佑,李素建.条件随机域模型和实验分析.第三届学生计算语言学研讨会论文集.中国辽宁沈阳.中国中文信息学会. 2006年:1-8页
    12. Mingli Wu, Wenjie Li, Qin Lu, Baoli Li: CTEMP: A Chinese Temporal Parser for Extracting and Normalizing Temporal Information. IJCNLP 2005: pages 694-706.
    13. Yang Ye, Victoria Li Fossum, and Steven Abney. Latent features in automatic tense translation between chinese and english. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney, Australia, July 2006: pages 48-55
    14. Vazov N. A System for Extraction of Temporal Expressions from French Texts based on Syntactic and Semantic Constraints. Proceedings of the ACL Workshop on Temporal and Spatial Information Processing. 2001: pages 96-103
    15. Wilson, G., Mani, I., Sundheim, B., and Ferro, L. 2001. A multilingual approach to annotating and extracting temporal information. In Proceedings of the Workshop for Temporal and Spatial Information Processing. EACL-ACL 2001. Toulouse, France. July 2001:pages 1-7
    16. Setzer, A. Temporal information in newswire articles: An annotation scheme and corpus study. Ph.D.thesis, Univ. of Sheffield. 2001:pages 1-22
    17. Mani, I . Recent Developments in Temporal Information Extraction. Proceedings of the Conference on Recent Advances In Natural Language Processing, John Benjamins. 2004: pages 1-8
    18. Ahn, D., Adafre, S. F., and Rijke, M. de. Towards Task-Based Temporal Extraction and Recognition. Proceedings Dagstuhl Workshop on Annotating, Extracting, and Reasoning about Time and Events. 2005:pages 1-8
    19. Allen J.F. Towards a General Theory of Action and Time. Artificial Intelligence. 23. 1984: pages123-154.
    20. Brill , Eric. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics. 1995. 21(4):pages543-565.
    21. yung Kim, Chungnam National University, Korea and Sung Hyon Myaeng, information and Communications University, Korea. Usefulness of Temporal Information Automatically Extracted from News Articles for Topic Tracking. ACM Transactions on Asian Language Information Processing: Vol. 3, No. 4, December 2004: pages 227-242.
    22. Yung Kim, Chungnam National University, Korea and Sung Hyon Myaeng, information and Communications University, Korea. Usefulness of Temporal Information Automatically Extracted from News Articles for Topic Tracking. ACM Transactions on Asian Language Information Processing. Vol. 3. No. 4. December 2004: Pages 227-242
    23. George Wilson Inderjeet Mani, Beth Sundheim, Lisa Ferro. A multilingual Approach to Annotating and Extracting Temporal Information. ACL2001, Proceedings of the workshop on Temporal and spatial information processing - Volume 13. Association for Computational Linguistics Morristown, NJ, USA.2001:pages1-7
    24. Lisa Ferro, Robyn Kozierok, Laurie Gerber, Beth Sundheim. Annotating Temporal Information—from Theory to Practise. Proceedings of the second international conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc. 2002: Pages226 - 230
    25. Andrea setzer. Temporal information in newswire articles: an annotation scheme and copus study, submitted in paerial fulfillment of the requirments for the degree of doctor of philosophy at university of Sheffield Sheffield UK. September 2001.
    26. Shuang-Hong Yang and Bao-Gang Hu. Efficient feature selection in the presence of outliers and noises. Asian Conference on Information Retrieval (AIRS’08). LNCS. Springer: pages 188-195
    27. Andrew McCallum. Efficiently inducing features of Conditional Random Fields. Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03).2003:pages 403-410
    28. Annalisa appice. Annalisa Appice, Michelangelo Ceci, Simon Rawles, PeterFlach. Redundant feature elimination for multi-class problems. ACM International Conference Proceeding Series.Vol. 69 archive. Proceedings of the twenty-first international conference on Machine learning. 2004:pages1-5
    29. S.Sathiya Keerthi, S. Sundararajan. CRF versus SVM-struct for sequence labeling. Yahoo Research Technical Report.2007:pages1-4
    30. Nam Nguyen, Yunsong Guo. Comparisons of sequence labeling algorithms and extensions. Proceedings of the 24th international conference on Machine learning. 2007: Pages 681– 688.
    31.徐昉,宗成庆,王霞.中文base NP识别:错误驱动的组合分类器方法.第三届学生计算语言学研讨会论文集. 2006年8月15日-18日.沈阳. 2007年01期: p256-260
    32. Luciana S. Buriol, Carlos Castillo, Debora Donato, Stefano Leonardi, Stefano Millozzi. Temporal analysis of the wikigraph. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence:pages1-8
    33.于江德,樊孝忠,尹继豪.基于条件随机场的中文科研论文信息抽取.华南理工大学学报(自然科学版). 2007年:90-94页
    34.史树敏,王志强,周浪,冯冲,黄河燕.基于条件随机域的中文命名实体识别.中国优秀硕士学位论文全文数据库,南京理工大学.2006
    35.冯冲,陈肇雄,黄河燕,张亮,王江伟.基于条件随机域的复杂最长名词短语识别.小型微型计算机系统. 2006年06期.
    36. John Lafferty, Yan Liu and Xiaojin Zhu. Kernel conditional random fields: representation, clique selection, and semi-supervised learning. In Proc. Twenty-First International Conference on Machine Learning. February 5, 2004:pages 1-5
    37. Yuejie Zhang, Zhiting Xu, Tao Zhang. Fusion of multiple features for Chinese named entity recognition based on CRF model. Information Retrieval Technology, 4th Asia Infomation Retrieval Symposium, AIRS 2008. Harbin, China. January 15-18, 2008, Revised Selected Papers. Lecture Notes in Computer Science 4993 Springer. 2008:pages95-106
    38. Le Song, Alex Smola, Arthur Gretton, Karsten M.Borgwardt, Justin Bedo. Supervised feature selection via dependence estimation. Proceedings of the 24th international conference on Machine learning. 2007: Pages 823– 830
    39. Qi Zhang, Xipeng Qiu, Xuanjing Huang, and Lide Wu. Domain adaptation for conditional random fields. Machine Learning, 4th Asia Infomation Retrieval Symposium, AIRS 2008. Harbin, China. January 15-18, 2008, Revised Selected Papers. Lecture Notes in Computer Science 4993 Springer. 2008:pages192-202
    40. Wenjie Li, Kam-Fai Wong, Guihong Cao, Chunfa Yuan. Applying Machine Learning to Chinese Temporal Relation Resolution. ACL 2004:pages 582-588
    41. Wenjie Li, Kam-Fai Wong, Chunfa Yuan. A Design of Temporal Event Extraction from Chinese Financial News. Int. J. Comput. Proc. Oriental Lang. 16(1).2003: pages 21-39
    42. Kam-Fai Wong, Wenjie Li, Chunfa Yuan, Xiaodan Zhu. Temporal Representation and Classification in Chinese. Int. J. Comput. Proc. Oriental Lang. 15(2). 2002:pages 211-230
    43. Wenjie Li, Kam-Fai Wong, Chunfa Yuan. Application and Difficulty of Natural Language Processing in Chinese Temporal Information Extraction. NLPRS. 2001: pages 501-506
    44. Wenjie Li, Kam-Fai Wong, Chunfa Yuan. Identifying Temporal Components in a Chinese Temporal Information System. Int. J. Comput. Proc. Oriental Lang. 13(2). 2000: pages 113-130
    45. ngli Wu, Wenjie Li, Qing Chen, Qin Lu. Normalizing Chinese temporal expressions with multi-label classification. Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE. 30 Oct.-1 Nov. 2005:pages 318- 323
    46. Chih-Jen Lin.A practical guide to Support Vector Classification. http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. Talk at University of Freiburg. July 15, 2003

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700