网络中文事件自动检测技术研究

英文题名：Technique Research of Web Chinese Event Automatic Detection
作者：刘嵩
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：事件抽取 ; 触发词 ; K-means聚类 ; 时间表达式 ; 概念相似度 ; 话题检测
英文关键词：Event Extraction ; Trigger ; K-means Clustering Algorithm ; Time Expression ; Concept Similarity ; Topic Detection
学位年度：2010
导师：李弼程
学科代码：081002
学位授予单位：解放军信息工程大学
论文提交日期：2010-04-15

摘要

在现代通信技术及互联网技术高速发展的今天,如何以事件为线索,对构成事件的元素进行分析,抽取事件并对其进行精确描述,从海量互联网数据中快速准确地搜集到感兴趣的信息,已成为当前智能信息处理方向的研究热点。本文研究网络中文事件自动检测技术,主要包括:中文事件自动标注、时间信息提取技术、事件自动抽取技术及基于事件抽取的话题自动检测技术,主要取得如下三个方面的研究成果:
     (1)对中文事件抽取中的时间信息进行详细研究,提出一种基于自定义规则的时间信息提取方法。该方法针对传统时间信息提取目标单一的缺点,对文本中所涉及的时间信息进行详细分类,明确时间提取范围。然后根据文本中出现时间的规律,利用正则表达式,对不同时间制定不同的提取规则,实现自定义规则的时间信息提取。实验结果表明,新方法在时间提取的准确率和召回率上优于传统方法,是一种有效的时间信息提取方法。
     (2)研究了中文事件抽取,针对传统方法对事件类别限定的局限性,提出了一种基于触发词指导的自相似度聚类事件抽取方法。该方法改变了传统方法以词为实例进行分类的做法,在事件类别判断上引入聚类思想,将K-means算法应用于事件抽取。同时,在事件触发词的指导下,采用自相似度最大最小策略,对K-means算法中的K值进行自收敛,优化了聚类算法,完成了事件的类别判断。最后,根据文本中命名实体及其位置信息,对事件元素进行详细描述,解决了事件抽取方法对类别模板的依赖性,实现了中文事件抽取。实验结果表明,新方法无论是事件抽取的准确率还是召回率,均优于传统方法,为中文事件抽取提供了新的思路。
     (3)研究了事件抽取在话题检测中的应用,改变了传统话题检测方法中根据向量夹角余弦进行文本相似度计算的做法,提出一种基于概念相似度计算的话题检测方法。该方法首先对待检测样本及话题集合进行分析,对其中的事件元素及其描述信息进行抽取,并构造文本向量空间模型。然后利用知网知识计算其概念相似度、词相似度及文本单元相似度,完成概念相似度计算。最后,通过相似度比较,实现基于概念相似度计算的话题自动检测。实验结果表明,与传统话题检测方法相比较,新方法所检测话题明确,话题的漏检率及误检率低,是一种有效的话题自动检测方法。
With the high speed development of communication and internet technologies, internet public information collection based on event has become one of important researching areas in intelligent information processing. It is an exigent problem for researchers to solve how to detect and describe event, and collect interested information based on event in numerous web data quickly and exactly. This paper mainly discusses the technique of web Chinese event automatic detection, which involves automatic Chinese event annotation, time information extraction, automatic event extraction and web topic detection based on event. The major contributions of this paper are listed as follows:
     (1) A method for time information extraction based on user-defined rules is presented. Aiming at disadvantage of single target of traditional time extraction method, time expressions of text is classified exactly, and time range is defined. Then, different rules for time expressions are constituted, and user defined time information extraction is achieved. Experiment results show that the precision and recall of the new method are superior to those of traditional methods.
     (2) A self-similarity clustering event extraction method based on triggers guidance is proposed. Firstly, the idea of traditional event classifying method based on feature word is changed, and clustering idea is adopted to classify event catalog where K-means clustering algorithm is applied. Secondly, based on triggers guidance, min-max clustering strategy is adapted to self-constrict K in K-means clustering algorithm, which optimizes clustering algorithm, and event classification is completed. Thirdly, based on Named Entities and their location information in text, event arguments are described, dependency of event catalog model is solved, and Chinese event extraction is completed. Experiment results show that the new method outperforms traditional event extraction methods in precision and recall, and provides a new thought for Chinese event extraction.
     (3) A method of automatic topic detection is put forward based on document concept similarity in stead of feature word similarity on vector space model in traditional topic detection methods. Firstly, sample and topic set is analyzed, event arguments are extracted, and document vector space model is constructed. Secondly, concept similarity, words similarity and text similarity is calculated based on HowNet. Finally, topic detection is realized based on document concept similarity. Experiment results show that the new method is more efficient than traditional methods in precision and recall of topic detection.

引文

[1] Ralph Grishman, Information Extraction: Techniques and Challenges, in Maria Teresa Pazienza, editor, Information Extraction, Springer-Verlag, Lecture Nots in Artificial Intelligence, Room, 1997.
    [2]刘迁,焦慧,贾惠波.信息抽取技术的发展现状及构建方法研究[J].计算机应用研究, 2002, 24(7):6-9.
    [3]李保利等.信息抽取研究综述[J].计算机工程与应用, 2003, 10(3):1-5.
    [4] Gaizauakas R, Wilks Y. Information Extraction: Beyond Document Retrieval [J], Journal of Documentation, 1997.
    [5] Ralph Grishman and Beth Sundheim, Message Understanding Conference-6: A Brief History, In Proceedings of 16th International Computational Linguistics, 1996.
    [6] Proceedings of the Third Message Understanding Conference (MUC-3), Morgan Kaufmann, May, 1991.
    [7] Proceedings of the Fifth Message Understanding Conference (MUC-5), Baltimore, MD, August, 1993, Morgan Kaufmann.
    [8] Grishman R, Sundheim B. Message Understanding Conference-6: A Brief History [A]. In: Proceedings of the 16th International Conference on Computational Linguistics[C]. Copenhagen, Denmark, 1996:466-471.
    [9] Douthat A. The Message Understanding Conference Scoring Software User's Manual [A]. In: Proceedings of the 7th Message Understanding Conference[C]. Fairfax, VA, 1998:177-180.
    [10] Dejong G, An Overview of FRUMP System [C], In LEHNERT W, RINGLE M h eds. Strategies for Natural Language Processing, Lawrence Erlbaum, 1982: 149~176.
    [11] Hobbs, Jerry; Douglas Appelt; John Bear; David Israel; Mabry Tyson, FASTUS: a Coscated Finte-State Transducer for Extracting Information from Natural-Language Text, Technical Note No.519, SRI International Artificial Intelligence Center, 1992.
    [12] Douglas Appelt, Jerry Hobbs, John Bear, David Israel, Megumi Kameyama, Andy Kehler, David Martin, Karen Meyers, and Mabry Tyson, SRI International FASTUS systems: MUC-6 test results and analysis, In Proc. Sixth Message Understanding Conf. (MUC-6), Columbia, MD, November 1995, Morgan Kaufmann.
    [13] Linguistic Data Consortium. ACE (Automatic Content Extraction) ACE Overview [OL].[2010-02-15]. http://www.ldc.upenn.edu/ Projects/ACE.
    [14] Chinchor N, Marsh E.MUC-7 information extraction task definition [A]. In: proc. of the 7th Message Understanding Conference[C]. Fairfax, VA, 1998:27-31.
    [15] Zhang Y M, Zhou J F. A Trainable Method for Extracting Chinese Entity Names and Their Relations [A], In: Proceedings of the Second Chinese Language Processing Workshop[C]. Hong Kong, 2000:66-72.
    [16] Roman Yangarber and Ralph Grishman, Customization of information extraction systems, In Paols Ve-lardi, editor, Proc. International Workshop on Lexically Driven InformationExtraction, Frascati, Italy, July 1997.
    [17] Chikashi NOBOTO, Satoshi SEKINE, Towards Automatic Acquisition of Patterns for Information Extration, 1999.
    [18]贾自艳,何清等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展, 2004, 41(7):1273-1280.
    [19]秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报, 2005, 19(6):13-20.
    [20] J Allan, J Carbonell, G Doddington. Topic Detection and Tracking Pilot Study: Final Report[A]. In: Proceeding of the DARPA Broadcast News Transcription and Understanding Workshop[C]. San Francisco, 1998:194-218.
    [21]洪宇,张宇,刘挺.话题检测与跟踪的评测及研究综述[J].中文信息学报, 2007, 21(6):71-87.
    [22]李保利,俞士汶.话题识别与追踪研究[J].计算机工程与应用, 2003, 39(17):7-10.
    [23] Yi Lan, Liu Bing. Web Page Cleaning for Web Mining through Feature Weighting[A]. In: the proceedings of Eighteenth International Joint Conference on Artificial Intelligence[C]. Acapulco, Mexico, 2003:14-18.
    [24] GUPTA S, KAISER G, NEISTADT D, et al. DOM-based content extraction of HTML document [A].In Proceeding of the 12th International Conference on World Wide Web[C]. New York, 2003: 207-214.
    [25] Deng Cai, Yu Shipeng, Wen Jirong et al. VIPS: a vision-based page segmentation algorithm[R]. USA: Microsoft Technical Report, MSR-TR-2003-79, 2003.
    [26]罗成. Web新闻话题的检测与追踪技术研究[D].信息工程大学, 2007: 18-24.
    [27] Linguistic Data Consortium, ACE (Automatic Content Extraction) Chinese Annotation Guidelines for TIMEX2 (Summary) [OL]. Version 1.2 [2009-09-08], http://www.ldc.upenn.edu/ Projects/ACE.
    [28]王昀,金融领域中汉语时间信息抽取的研究,硕士毕业论文,清华大学计算机科学与技术学院,2004.6.
    [29]吴承荣,曾剑平,王巍.基于时间信息的关键子话题提取方法.申请专利号:200910054888.分类号G06F17/27.
    [30] Mihai Surdeanu, Sanda Harabagiu, John Williams, et al. Using Predicate-Argument Structures for Information Extraction. In Proceedings of ACL, 2003:8-15.
    [31] Mihai Surdeanu, Sanda Harabagiu. Infrastructure for open-domain information extraction, In Proceedings of the Human Language Technology Conference, 2002:325-330.
    [32] Hai Leong Chieu, Hwee Tou Ng. A Maximum entropy Approach to Information Extraction from Semi-Structured and Free Text, Proceedings of the 18th National Conference on Artificial Intelligence, 2002:786-791.
    [33] David Ahn,The Stages of Event Extraction, Proceedings of the Workshop on Annotations and Reasoning about Time and Events, 2006:1-8.
    [34]赵妍妍,秦兵,车万翔,刘挺.中文事件抽取技术研究.中文信息学报. 2008,22 (1): 3-8.
    [35]闪四清,陈茵,程雁.数据挖掘—概念、模型、方法和算法[M].北京:清华大学出版社, 2003:114-116.
    [36] Yamron J.P, S.Knecht, P.van Mulbregt. Dragon’s Tracking and Detection Systems for the TDT2000 Evaluation[A]. In: Proceeding of Topic Detection and Tracking workshop[C]. Washington, USA, 2000:75-80.
    [37]骆卫华,于满泉等.基于多策略优化的分治多层聚类算法的话题发现研究[J].中文信息学报, 2006, 20(1):29-36.
    [38]赵华,赵铁军等.基于内容分析的话题检测研究[J].哈尔滨工业大学学报, 2006, 38(10): 1740-1743.
    [39]赵华,赵铁军,于浩,张姝.面向动态演化的话题检测研究[J].高技术通讯, 2006, 16(12): 1230-1235.
    [40] Linguistic Data Consortium, ACE (Automatic Content Ext- raction) Chinese Annotation Guidelines for Entities [OL], Version 5.5, [2009-09-08].
    [41] Lisa Ferro, Timestamping of ACE Relations and Events for 2005 [C], National Institute of Standards and Technology, 2005:13-17.
    [42] Linguistic Data Consortium, ACE (Automatic Content Extraction) Chinese Annotation Guidelines for TIMEX2 (Summary) [OL], Version 1.2. [2009-09-08]. http://www.ldc.upenn.edu/ Projects/ACE.
    [43] Linguistic Data Consortium, (2008a), Entity Annotation Guidelines V6, http://projects.ldc.upenn.edu/ace/docs/ Entities-Guidelines_v6.1.pdf.
    [44] Hamilton, Michael Braun.“Online Survey Response Rates and Times: Background and Guidance for Industry.”2003. Tercent, Inc./SuperSurvey. 24 March 2008 .
    [45]贺瑞芳,秦兵,刘挺等.基于依存分析和错误驱动的中文时间表达式识别.中文信息学报. 2007,21(5): 24-29.
    [46] Mihai Surdeanu, Sanda Harabagiu, John Williams, et al. Using Predicate-Argument Structures for Information Extraction. In Proceedings of ACL, 2003:8-15.
    [47] Mihai Surdeanu, Sanda Harabagiu. Infrastructure for open-domain information extraction, In Proceedings of the Human Language Technology Conference, 2002:325-330.
    [48] Hai Leong Chieu, Hwee Tou Ng. A Maximum entropy Approach to Information Extraction from Semi-Structured and Free Text, Proceedings of the 18th National Conference on Artificial Intelligence, 2002:786-791.
    [49]刘安斐,李弼程.基于数据融合的多特征遥感图像分类[J].数据采集与处理,2006, 21(4): 463-467.
    [50] Chris Ding, Xiaofeng He. Cluster Merging and Splitting in Hierarchical Clustering Algorithms[A]. In: Proceedings of the 2002 IEEE International Conference on Data Mining[C]. Maebashi City, Japan: Maebashi TERRSA, 2002: 139-146.
    [51] C. Ding, X. He, H. Zha, et al. A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering[A]. In: Proceedings of the IEEE Internationl Conference[C]. San Jose, California, USA: Data Mining, 2001: 107-114.
    [52] James Allan. Topic Detection and Tracking: Event-based Information Organization [M]. Boston: Kluwer Academic Publishers, 2002:1241-1253.
    [53] Yamron J.P, S.Knecht, P.van Mulbregt. Dragon’s Tracking and Detection Systems for the TDT2000 Evaluation[A]. In: Proceeding of Topic Detection and Tracking workshop[C], Washington, 2000:75-80.
    [54] The 2004 Topic Detection and Tracking. Task Definition and Evaluation Plan[EB/OL]. http://www.nist.gov/speech/tests/tdt/tdt2002/evalplan/htm, 2004.
    [55] R.Papka,J.Allan. On-Line New Event Detection using Single Pass Clustering[A]. In: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval[C], Melbourne, 1998:37-45.
    [56] Giridhar Kumaran, James Allan. Text Classification and Named Entities for New Event Detection [A]. In: Proceeding of the 27th annual international ACM SIGIR conference on Research and development in information retrieval [C]. Sheffield, England, 2004:297-304.
    [57]彭京,杨冬青,唐世渭.基于概念相似度的文本相似计算.中国科学F辑:信息科学,2009,39(5): 534-544.
    [58]罗成,李弼程,张先飞.一种有效的网页噪声消除方法.计算机工程. 2007, 5(6): 11-14.
    [59]潘渊,李弼程,张先飞.一种基于自适应重心向量的主题检测方法,计算机工程. 2009, 6(2): 26-29.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700