基于SVM的突发事件新闻话题跟踪方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
移动互联网的发展使得人们进入了一个信息极度丰富的时代。网络信息规模的急剧膨胀和凌乱无章,又使得人们对有价值信息的发现和管理变得越来越困难。突发事件的随机性和不确定性,使得决策者掌握的信息有可能不全面和不及时,并且在信息的反馈和处理过程中,信息的准确性和有效性也难以保证,导致信息失真。如何能全面准确地获取相关报道和突发事件的发展演变信息成为目前需要解决的问题。
     话题检测技术能从新闻报道流中自动检测出最新的新闻主题,并将新闻报道及时地按照话题组织起来;话题跟踪技术则能追踪特定的新闻主题。因此,话题检测和跟踪技术的应用将能有效地管理和组织新闻信息,满足人们对新闻信息的特殊需求。本文对突发事件的后续报道进行跟踪,根据用户事先确定的感兴趣的话题,对大规模的海量信息进行实时过滤,生成相关话题的持续进展情况,进而掌握事件的全貌。
     本文采用构建多个子向量的多向量空间模型的方法来表示突发事件新闻文档。在对常见的文本分类算法分析的基础上,采用了基于SVM分类算法的方法实现了话题跟踪系统。针对话题跟踪过程中话题本身的漂移现象,提出了改进的话题跟踪系统,对跟踪过程中伪相关反馈包含的新颖信息进行检测和建模,并在此基础上使用多向量空间模型动态调整话题空间,以跟踪话题漂移,降低漏检率。
     本文的主要工作有:
     1.对已经下载加工好的突发事件新闻语料进行分析,采用词语作为候选特征并将特征词划分为五类(人名、时间名、地点名、组织机构名、内容)并形成五个子向量,用五个子向量空间模型来表示新闻文档。计算时间相似度和地点相似度计算的时候分别采用了报道时间距离和关联度的计算方法,同时在特征词的权重计算时考虑了特征词的位置信息。最后把突发事件文本的信息分为两类,即客观信息和主观信息,为进一步研究奠定理论基础。
     2.在报道关联检测中,采用了多向量模型构建和基于SVM的分类算法相结合的方法进行检测,取得了较好的效果。
     3.针对话题跟踪过程中话题本身的漂移现象,采用改进的基于核心和新颖部分的方法构建了话题跟踪系统。
     4.设计了一个可以实现报道关联检测和话题跟踪的实验系统,能够较好的识别既定话题的后续报道。
     最后,我们从收集加工好的突发事件新闻语料中选择了10个话题共260篇报道进行了对比测试,来验证我们提出的方法的可行性和有效性。实验结果表明本文所提出的方法在一定程度上提高了突发事件话题跟踪系统的效率。
Mobile wireless Internet makes it a very rich era of information. However, network expansion and messy drama without chapters, makes the discovery of valuable information and management become difficult. Because of Accidental events randomness and uncertainty, decision-makers may not available to comprehensive. In the information feedback and processing, information accuracy and effectiveness can not guarantee, resulting in distortion of information. How we can access to comprehensive and accurate reports of Accidental events and the evolution of that need to be addressed now.
     Topic detection can identify new topics in a stream of news stories and organize the news stories by topic. Topic tracking can track the given topics and obtain the relevant news stories in the news stream.so applying the topic detection, tracking techniques into the model will manage the information effectively. We track the sequential story of accidental event based on the certain topics people interested in ,which let people know the latest evolution of the event.
     We build a muti-vector space model for the Accidental events. By analysis text classification algorithm, we apply SVM classification algorithm into topic tracking. To find and track topic shift in topic tracking task, this paper proposes the improved topic tracking system, which detects the novelty information in topic tracking feedback and modifies topic model based on VSM, in order to track the topic shift effectively.
     The main work in this article:
     (1) By analyzing the processed corpus, we divided the text of the incident information into two types, objective information, and subjective information. And the use of the term will be characterized as a candidate feature words is divided into five categories (name, time, and place names, organization names, content) and the formation of the five sub-vector, with five sub-vector space model to table the document information, the location information word is special consideration when Weight calculation .
     (2) Link detection, based on the combination of multi-vector model and the SVM classification algorithm, which achieved good results.
     (3) To resolve the topic shift in topic tracking task, we build a topic tracking system based on improved core and innovative models.
     (4)We designed an experimental system to achieve topic link detection and topic tracking, It can track the sequential story of accidental news effectively. Finally, we use 10 topics from accidental news corpus, about 260 stories .The result shows that the method can improve the efficiency of tracking accidental events in a certain way.
引文
[1].洪宇,张宇,刘挺,李生.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.
    [2].J Carbonell,Y Yang,J Lafferty.CMU Report on TDT-2:Segmentation,Detection and Tracking[A].Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop,1999:117-120.
    [3].J Allan,V Lavrenko,D Frey.UMass at TDT 2000[A].Proceedings of Topic Detection and Tracking Workshop,2000:109-115.
    [4].L S Larkey,F F Feng.Language-specific Models in Multilingual Topic Tracking [A].Proceedings of the 27th annual international conference on research and development in information retrieval,2004:402-409.
    [5].M Franz,JS Mc Carley.Unsupervised and supervised clustering for topic tracking[A].Proceedings of the 24th annual international ACM SIGIR,2001.
    [6].Yiming Yang,Jaime Carbonell,Ralf Brown.Learning Approaches for Detecting and Tracking News Events[J].IEEE Intelligent Systems:Special Issue on Applications of Intelligent Information Retrieval[C],1999,14(4):32-43.
    [7].F Thollard,P Dupont.Probabilistic DFA inference using Kullback-Leibler divergence and minimality[C],Conf.on Machine Learning,2000.
    [8].X Li,WB Croft.Novelty detection based on sentence level patterns[A].Proceedings of the 14th ACM international conference on Information and knowledge management.2005:744-751.
    [9].王会珍,朱靖波.基于反馈学习自适应的中文话题追踪[J].中文信息学报,2006,20(3):92-98.
    [10].李保利.汉语新闻报道中的话题跟踪与识别研究[D].北京大学博士论文,2003.
    [11].曾青青,杨尔弘.突发事件文本的信息结构分析[C].第四届全国学生计算语言学研讨会会议论文集.2008,4(1)480-486.
    [12].花洁,刘涛.基于KNN中文文本自动分类研究[J].软件导刊.2008,7(2):16-18.
    [13].申红,吕宝粮.文本分类的特征提取方法比较与改进[J].计算机仿 真.2006,(3):0222-0224.
    [14].张云涛,龚珍,王永成.An inproved TF-IDF approach for text classification[J].浙江大学学报,2005,6(1):49-55.
    [15].宋丹,王卫东,陈英.基于改进向量空间模型的话题识别与跟踪[J].计算机技术与发展,2006,16(9).
    [16].赵华,赵铁军,于浩.面向动态演化的话题检测研究[J].高技术通讯2005,16(12).
    [17].薛晓飞,张永奎,任晓东.基于新闻要素的新事件检测方法研究[J].计算机应用.2008,28(11).
    [18].贾自艳.何清,张海俊等.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
    [19].王强,张永奎.基于SVM的中文报道关系识别方法研究[J].计算机工程与应用2008.44(33):141-143.
    [20].杨强,吴中福,余萍,钟将.一种新型支持向量机[J].重庆大学学报2005,28(2):81-84.
    [21].贾银山,贾传荧.一种加权支持向量机分类算法[J].计算机工程2005,31(12):23-25.
    [22].郑伟,张宇,邹博伟,洪宇,刘挺.基于相关性模型的中文话题跟踪研究[C].内容计算的研究与应用前沿会议.北京:清华大学出版社,2007,558-563.
    [23].The 2003 Topic Detection and Tracking(TDT2003) Task Definition and Evaluation Plan.
    [24].赵世奇,张宇,刘挺等.基于类别特征域的文本分类特征选择方法[J].中文信息学报.2005,19(6):21-27.
    [25].李东艳,张永奎.一个基于非法文本用词特征分析的文本分类器[J].电脑开发与应用,2006,19(10):2-4.
    [26].苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报2006,17(9):1848-1859.
    [27].姚兴山.基于词频的中文文本分类研究[J].现代情报2009,29(2):179-181.
    [28].闵锦,黄萱菁.基于主题和态度分类的文本过滤系统[J]2007,33(2):163-164.
    [29].高峰,张永奎.基于最大熵模型的不良文本识别[J].电脑开发与应用.2009,01:06-08.
    [30].Tamer Elsayed,Douglas W.Oard,David Doermann.TDT-2004:adaptive topic tracking at maryland[J].Institute for Advanced Computer Studies University of Maryland,College Park,MD 20742.
    [31].Yiming Yang,Tom Ault,Thomas Pierce and Charles W.Lattimer.Improving text categorization methods for event tracking[A].Proceeding of the 23rd International Conference on Research and Development in Information Retrieval(SIGIR-2000)[C],2000:65-72.
    [32].G.Kumaran and J.Allan.Text classification and named entities for new event detection[A].In Proc.of the SIGIR Conference[C].2004.Pages:297-304.
    [33].Jan Martinovic and Petr Gajdos.Vector model improvement by FCA and Topic Evolution[C].CEUR Workshop Proceedings.2005.Pages:46-57.
    [34].James Allan,Ron Papka.On-line New Event Detection and Tracking[A].SIGIR'98[C],1998:37-45.
    [35].邱立坤,龙志伟,钟华.层次化话题发现与跟踪方法及系统实现[J].广西师范大学学报.2007,25(2):157-160.
    [36].李茹,王文晶,梁吉业等.基于汉语框架网的旅游信息问答系统设计[J].中文信息学报,2009,23(2):34-40.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700