音乐领域典型事件抽取技术的研究

英文题名：Research on Typical Event Extraction Technology in the Field of Music
作者：宋凡
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：事件抽取 ; 事件类型识别 ; 事件元素识别 ; 模式匹配 ; 最大熵
英文关键词：Event Extraction ; Event Type Recognition ; Event Argument Recognition ; Pattern Matching ; Maximum Entropy
学位年度：2009
导师：秦兵
学科代码：081203
学位授予单位：哈尔滨工业大学

摘要

事件抽取是信息抽取领域一个重要的研究方向。事件抽取主要把人们感兴趣的,用自然语言表达的事件以结构化的形式呈现出来,如什么人,什么地方,什么时间,做了什么事等,在自动文摘,自动问答以及信息检索等领域有着广泛的应用。本文关注音乐领域的事件抽取,选择了具有代表性的演唱会及专辑事件进行深入研究。
     本文借鉴ACE评测中事件抽取任务的相关概念以及构建语料库的一些经验,详细定义了音乐领域我们所关注的两类事件,并且构建了语料库,详细介绍了语料标注的来源、过程、标注规范以及存储格式等。
     本文对事件抽取的两项关键技术——事件类型识别以及事件元素识别采用不同的处理策略,简化了事件类型的识别过程,采用了基于关键词与触发词相结合的过滤方法。
     在事件元素识别中,如何从众多的实体中找出事件元素,成为本文研究的重点。本文提出了两种方法:基于模式匹配的事件元素识别,以及基于最大熵的事件元素识别。在总结前人三种事件表示模型的基础上,本文结合汉语的特点以及所采用句法分析模块的特点提出了一种基于简化依存句法树模式匹配的方法;基于最大熵的方法将事件元素识别问题看作分类问题,将所有出现的实体作为候选事件元素,选取上下文、邻近实体、句法结构等特征从不同的角度描述候选元素,并采用最大熵分类器对其进行二元分类。为了发挥各自方法的优点,将基于模式匹配的方法与基于最大熵分类的方法采用级联的方式串联起来形成最终事件元素识别的解决方案,在本文构建的语料库下,最终事件识别的平均F值达到83.84%,事件元素识别的平均F值达到76.41%,整个事件识别的平均F值达到67.31%。
Event extraction is a very important research point in the area of information extraction. Event extraction can present the event which was describes by natural language through structural form, e.g. who, where, when and what is related to the event. And this technology can be widely applied to many NLP researches, such as summarization, question and answering, information retrieval and so on. This paper focuses on the event extraction in the music corpus and we choose concert event and album event as two representative events for our intensive research work.
     According to some concepts and corpus construction experience borrowed from the event extraction task in ACE, we defined two types of events in music domain and constructed a corpus. The detail descriptions on the source, the specification and process of annotation and storage format of the corpus are also presented in this paper.
     In this paper we adopt different strategies to deal with two key issues in event extraction: event type recognition and event argument recognition. The recognition of event type is greatly simplified in our proposed method which is based on the filtering of keywords and triggers.
     For the event argument recognition, our work is mainly concentrated on how to recognize the right argument among various entities. Two methods which are based on pattern matching and maximum entropy model separately are proposed is this paper. Inspired by three kinds of event representation model in previous work, our pattern matching method is based on a reduced dependency tree which fully utilizes some features specific to the language and parser model. In our binary classifier based on maximum entropy model, all entities occurred are considered as candidate arguments and are classified as true or false according to features including the context, the adjacent entities and the syntactic structure of the entity being considered. For the two methods to work together better, our final solution is formed by combining them in a sequential manner. Experiments on our corpus show we achieved an average F measure of 83.84%, 76.41% and 67.31% for type recognition, argument recognition and event recognition separately.

引文

1李保利,陈玉忠,俞士汶.信息抽取研究综述.计算机工程与应用. 2003, 39(10): 1-5
    2朱靖波,姚天顺.中文信息自动抽取.东北大学学报. 1998, 19(1): 52-54
    3 Jim Cowie, Wendy Lehnert. Information Extraction. Communication of the ACM. 1996, 39(1): 80-91
    4 Douglas E. Appelt, David J. Israel. Introduction to Information Extraction Technology. A Tutorial Prepared for IJCAI-99. 1999
    5 Ronen Feldman. Information Extraction Theory and Practice. A Tutorial Prepared for ICML2006. 2006
    6 Ralph Grishman. Information Extraction: Techniques and Challenges. In M.T. Pazienza, editor, Information Extraction. Springer-Verlag, Lecture Notes in Artificial Intelligence, Rome, 1997: 1-16
    7 The ACE 2007 (ACE07) Evaluation Plan. http://www.nist.gov/speech/tests/ace/ace07/doc/ace07-evalplan.v1.3a.pdf
    8 Zheng Chen, Heng Ji. Language Specific Issue and Feature Exploration in Chinese Event Extraction. In Proceedings of NAACL HLT 2009 Boulder, Colorado, June 2009. Short Papers: 209–212,
    9 Wenjie Li, Mingli Wu, Qin Lu. Extractive Summarization using Inter- and Intra- Event Relevance. Proceedings of 44th Annual Meeting of the ACL, Sydney, July 2006: 369-376.
    10 Elena Filatova, Vasileios Hatzivassiloglou. Event-Based Extractive Summarization. In ACL Workshop on Summarization, Barcelona, Spain, 2004: 104-111
    11 Jiayin Ge, Xuanjing Huang, Lide Wu. Approaches to Event-Focused Summarization Based on Named Entities and Query Words. In DUC 2003 Workshop on Text Summarization, 2003: 76-80
    12 ACE (Automatic Content Extraction) Chinese Annotation Guidelines for Events. 2007 http://www.nist.gov/speech/tests/ace/ace07/index.htm
    13 Naomi Daniel, Dragomir Radev, Timothy Allison. Sub-event based multi-document summarization. In Proceedings of the HLT-NAACL 2003 Workshop on Text Summarization, 2003: 9-16
    14 Silja Hunttunen, Roman Yangarber, Ralph Grishman. Complexity of Event Structure in IE Scenarios. In Proceedings of the 19th International Conference on Computational Linguistics. 2002: 35-42
    15 Milena Yankova. Focusing on Scenario Recognition in Information Extraction. In Proc. EACL-2003, 41-48
    16 Chang-Shing Lee, Yea-Juan Chen, and Zhi-Wei Jian. Ontology-based FuzzyEvent Extraction Agent for Chinese e-News Summarization. (SCI) Expert Systems with Applications. 2003, 25(3): 431-447
    17 Hai Leong Chieu, Hwee Tou Ng. A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text. Proceedings of the 18th National Conference on Artificial Intelligence, 2002: 786-791.
    18 David Ahn. The stages of event extraction. Proceedings of the Workshop on Annotations and Reasoning about Time and Events, Sydney, 2006: 1-8
    19姜吉发.自由文本的信息抽取模式获取的研究.中国科学院博士学位论文, 2004: 1-18
    20 Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. Proc. Eleventh National Conf. on Artificial Intelligence, 1993: 811-816
    21 J. Kim and D. Moldovan. Acquisition of linguistic patterns for knowledge-based information extraction. IEEE Transactions on Knowledge and Data Engineering,. 1995, 7(5): 713-724
    22 Ellen Riloff and Jay Shoen. Automatically Acquiring Conceptual Answer Patterns Without an Annotated Corpus. In Proceedings of the Third Workshop on Very Large Corpora, 1995: 148–161
    23 Roman Yangarber. Scenario Customization for Information Extraction. Ph.D. Dissertation. New York University. January, 2001: 38–46
    24 Joyce Yue Chai. Learning and Generalization in the Creation of Information Extraction Systems. Doctoral dissertation, Dept. of Computer Science, Graduate School of Duke University. 1998: 56-60
    25 http://de.wikipedia.org/wiki/Message Understanding Conference
    26 http://www.nist.gov/speech/tests/ace/
    27 Heng Ji, Ralph Grishman. Refining Extraction through Cross-document Inference. Proceedings of ACL-08: HLT, Columbus, Ohio USA, June 2008, pages 254-262
    28 Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD, ,Morgan Kaufmann. November 1995
    29 Ralph Grishman and Beth Sundheim. Message Understanding Conference - 6: A brief history. In Proc. 16th Int'l Conf. on Computational Linguistics (COLING 96), Copenhagen, August 1996: 1-8
    30 http://ir.hit.edu.cn/demo/rea/index.jsp?w=fsong&id=1
    31卢志茂,刘挺,李生.统计词义消歧的研究进展.电子学报. 2006: 333-343
    32郭凯红,李文立.基于规则的大规模试卷文本语块识别方法的研究.计算机应用研究. 2009, 26(4): 1-4
    33于静,汉语句子的组块识别研究.大连理工硕士学位论文. 2008: 1-5
    34 Jiawei Han, Hong Cheng, Dong Xin, Xifeng. Frequrnt pattern mining current status and future directions. Data Min Knowl Disc. (2007)15: 55-86
    35 Yutaka Sasaki, Paul Thompson, Philip Cotter, John McNaught, SophiaAnaniadou. Event Frame Extraction Based on a Gene Regulation Corpus. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 761–768
    36 Sergey Brin. Extracting patterns and relations from World Wide Web. In Proceedings of WebDB Workshop at 6th International Conference on Extending Database Technology (WebDB’98). 1998:172-183
    37 Ronen Feldman and Benjamin Rosenfeld. Boosting unsupervised relation extraction by using NER. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP-2006). 2006: 473-481
    38 Benjamin Rosenfeld and Ronen Feldman. URES: an unsupervised web relation extraction system. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 2006:667-674
    39 Davidov, D., Rappoport, A. and Koppel, M., Fully unsupervised discovery of concept-specific relationships by Web mining. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL-2007). 2007:232-239
    40 Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-03), pages 224–231.
    41 Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. 2000. Automatic Acquisition of Domain Knowledge for Information Extraction. In Proceedings of 18th International Conference on Computational Linguistics (COLING-2000): 940-946
    42 Kiyoshi Sudo, Satoshi Sekine, and Ralph Grishman. 2001. Automatic Pattern Acquisition for Japanese Information Extraction. In Proceedings of the Human Language Technology Conference (HLT2001): 250-257
    43李正华.依存句法分析统计模型及树库转化研究.工学硕士学位论文.哈尔滨工业大学. 2008:21-30
    44 N. Cristianini, J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge University, 2000
    45 T. Zhang. Regularized winnow methods. In Advances in Neural Information Processing Systems 13. 2001: 703-709
    46李素建,刘群,张志勇.语言信息处理技术中的最大熵模型方法.计算机科学. 2002, 29(7): 108-111
    47 Berger AL, Della Pietra SA, Della Pietra VJ. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 1996,22(1): 39?71
    48 Wanxiang Che, Min Zhang, Ting Liu, Sheng Li. A Hybrid Convolution Tree Kernel for Semantic Role Labeling. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics. 2006: 73-80
    49 Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James Martin, Daniel Jurafsky. Semantic role labeling using different syntactic views. In proceedings of ACL-2005. 2005: 581-588.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700