Web新闻话题检测与追踪技术研究

英文题名：Research on Web News Topic Detection and Tracking
作者：罗成
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：话题检测与追踪 ; Web信息采集 ; 向量空间模型 ; 命名实体 ; 话题重心向量 ; K近邻
英文关键词：Topic Detection and Tracking ; Web Crawler ; Vector Space Model ; Name Entities ; Topic Centroid Vector ; K-Nearest Neighbor
学位年度：2007
导师：李弼程
学科代码：081002
学位授予单位：解放军信息工程大学
论文提交日期：2007-04-15

摘要

话题检测与追踪是一种检测新出现的话题并追踪话题发展动态的信息智能获取技术。该技术能把分散的信息有效地汇集并组织起来,从整体上了解一个话题的全部细节以及该话题中事件之间的相关性,在军事和民用方面都具有极其重要的理论和实用意义。本文主要研究新闻话题检测与追踪技术,重点研究网页采集、网页噪声净化、新闻话题检测以及热门话题追踪,取得了如下4个方面的研究成果。
     首先,根据后续处理对网页采集的要求,设计并实现了Web采集器。该采集器在采集过程中进行了Robots协议分析、网页类型判断、新闻网页时间提取,对传统Web采集器的功能进行了扩展。实验证明,该采集器能够对网页信息进行自动采集,并对后续的应用提供充分的支持,具有良好的通用性。
     其次,从网页文本内容的表示方式以及对网页内部噪声的分析两方面入手,提出一种基于向量空间模型的网页噪声净化方法。该算法按照标签将网页内容划分为不同的内容块,从中挑选出网页的主题内容块,根据向量空间模型的内容相似性比较技术对其余内容块进行判断。实验结果表明,无论从噪声净化的准确性还是完整性方面,新方法均优于传统净化方法。
     再次,针对话题检测中事件动态发展可能会导致后继故事判断错误的现象,提出一种基于话题重心自适应的话题检测方法。新方法用命名实体作为特征项来表示话题重心,通过组合初始的话题重心以及每一次动态修正后的话题重心,构建用于检测后继故事的总话题检测器。实验结果表明,该方法有效地降低了漏报率与错报率,提高了话题检测的性能。
     最后,针对训练正例稀疏的问题,提出了一种改进的KNN话题追踪方法。新方法对传统KNN分类方法进行改进并应用于话题追踪,降低了训练反例密集带来的影响;还在话题追踪过程中加入时间窗策略,降低了计算的复杂度。实验结果表明,该方法能有效地克服训练集稀疏的问题,提高了话题追踪的效率,保证话题追踪的稳健性。
Topic Detection and Tracking (TDT in short) is an event-based information organizing task for detecting the appearance of new topics and tracking their reappearance and evolution. Its purpose is to organize information efficiently and help people finding what they want easily. In recent years, it is theoretically and practically valuable in military and other fields. This dissertation studies the models, algorithms and applications of several key research topics of TDT, including web crawler, web noise cleaning, news topic detection and tracking. The major contribution of this dissertation is as follows:
     Firstly, this dissertation designs and realizes a general web crawler to fulfill the demand of the following TDT, where the protocol of Robots is analyzed and web style is classified and the news time is parsed. The experiment shows that the web crawler have nice generality and can automatically download web pages and provide sufficient support for following information applications.
     Secondly, combining the knowledge of noisy information embedded in Web pages with the way of representing web contents, a new algorithm based on VSM for web noise cleaning is presented. The approach divides the web contents into different blocks according to HTML tokens, picks out the topic content and identifies web noise by using the similarity contrast technology between the topic content and the rest of contents. Experiments show that this algorithm excels other traditional methods in integrality and accuracy of the web cleaning.
     Thirdly, a method of topic detection based on adaptive centroid vector is proposed to avoid the shortcoming of current adaptive methods. The new method introduces name entities to represent topic and combines preliminary topic centroid vector with every modified centroid vector for topic detection. Experiments show that the new algorithm lowers the probability of miss and false alarm errors, and improves the performance of topic detection system.
     Finally, considering the sparseness of positive examples, a method of modified KNN-based topic tracking is introduced. The new method modifies traditional KNN classifier for topic tracking and could lessen the side-effect of densely populated negative examples. Furthermore, a time-window is imposed to decrease the complication of topic tracking. Experiment shows that the improved algorithm overcomes the sparseness of training set and enhances stability of topic tracking.

引文

[1]庄越挺,潘云鹤,吴飞.网上多媒体信息分析与检索[M].北京:清华大学出版社,2002:149-154.
    [2]James Allan.Topic Detection and Tracking:Event-based Information Organization[M].Boston:Kluwer Academic Publishers,2002:1241-1253.
    [3]李保利,俞士汶.话题识别与追踪研究[J].计算机工程与应用,2003,39(17):7-10.
    [4]NIST.The Year 2002 Topic Detection and Tracking Task Definition and Evaluation Plan[A].In:Proceeding of NIST[C],Paris,2002:1468-1480.
    [5]J Allan,J Carbonell,G Doddington.Topic Detection and Tracking Pilot Study:Final Report[A].In:Proceeding of the DARPA Broadcast News Transcription and Understanding Workshop[C],San Francisco,1998:194-218.
    [6]姚天顺,朱靖波.自然语言理解——种让机器懂得人类语言的研究[M].北京:清华大学出版社,2002:52-56.
    [7]James Allan,Victor Lavrenko,and Hubert Jin.First Story Detection in TDT is Hard[A].In:Proceeding Of 9th Conference on Information Knowledge Management[C],McClean,VA USA,2000:374-381.
    [8]Junghoo Cho,Hector Garcia-Molina.The evolution of the web and implications for an incremental crawler[A].In:Proceeding of 26th International Conference On Very Large Databases[C],Cairo,2000:200-209.
    [9]李盛韬,余智华,程学旗,白硕.Web信息采集研究进展[J].计算机科学,2003,30(2):151-171.
    [10]Junghoo Cho.Crawling the Web:Discovery and Maintenance of Large-Scale Web Data[M].San Francisco:Stanford University,2001:692-701.
    [11]李盛韬.基于主题的Web信息采集技术研究[D].北京:中科院计算所,2002.
    [12]J Cowie and W lehnert.Information extraction[J].ACM Press,1996,39(1):80-91.
    [13]Pazienza,Maria Teresa Pazienza.Information Extraction:A Multidisciplinary Approach to an Emerging Information Technology[M].London:Springer Publishers,1997:463-465.
    [14]Pazienza,Maria Teresa Pazienza.Information Extraction:Towards Scalable,Adaptable Systems[M].London:Springer Publishers,1999:259-264.
    [15]N.Kushmerick.Cleaning the web[J].IEEE Intelligent System,1999,14(2):20-22.
    [16]Ion Muslea.Extraction patterns for information extraction tasks:A survey[A].In:AAAI-99 Workshop on Machine Learning for information extraction[C],Orlando,1999:1421-1431.
    [17]Yorick Wilks.Information extraction as a core language technology[M].Berlin:Springer Verlag,1997:113-118.
    [18]S.Soderland.Learning information extraction rules for semi-structured and free text[J].Kluwer Machine learning,1999,34(1):233-272.
    [19]D.Freigat.Information extraction from html:application of a general learning approach[A]. In:Proceeding of the fifteenth conference on artifical intelligence AAAI-98[C],Madison,1998:517-523.
    [20]C.Hsu and M.Dung.Generating finite-state transducers for semi-structured data extraction from the web[J].Journal of Information Systems,1998,23(8):521-538.
    [21]闪四清,陈茵,程雁.数据挖掘—概念、模型、方法和算法[M].北京:清华大学出版社,2003:114-116.
    [22]Yamron J.P,S.Knecht,P.van Mulbregt.Dragon's Tracking and Detection Systems for the TDT2000 Evaluation[A].In:Proceeding of Topic Detection and Tracking workshop[C],Washington,2000:75-80.
    [23]刘培德,刘玉国,刘培玉.网络信息过滤系统的设计与实现[J].计算机工程与应用,2005,41(21):156-158.
    [24]Wayne C.Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation[A].In:Language Resources and Evaluation Conference[C],Athens,2000:1487-1494.
    [25]Zhiwei Li,Bin Wang,Mingjing Li.A probabilistic model for retrospective news event detection[A].In:Proceeding of the 28th annual international ACM SIGIR conference on research and development in information retrieval[C],Salvador,2005:106-113.
    [26]Yiming Yang,Tom Ault,Thomas Pierce,and Charles W.Lattimer.Improving Text Categorization Methods for Event Tracking[A].In:Proceeding of the 23rd International Conference on Research and Development in Information Retrieval[C],Athens,2000:65-72.
    [27]Robots protocol[EB/OL],http://www.robotstxt.org/wc/robots.html,1994.
    [28]Wen-jie Li,Kam-Fai WONG.A Word-Based Approach for Modeling and Discovering Temporal Relations Embedded in Chinese Sentences[J].ACM Transactions on Asian Language Information Processing,2002,1(3):173-206.
    [29]Jvan Benthem.In Handbook of Logic and Language[M].Cambridge:MIT Press,1997:895-902.
    [30]Reichenbach H.Elements of Symbolic Logic[M].Berkeley:University of California Press,1947:674-682.
    [31]贾自艳,何清,张海俊.一种基于动态进化模型的事件探测和追踪算法[J].计算机研究与发展,2004,41(7):1273-1280.
    [32]马玉春,宋瀚涛.Web中文文本分词技术研究[J].计算机应用,2004,24(4):134-135.
    [33]史忠植.知识发现[M].北京:清华大学出版社,2002:339-342.
    [34]A.Bookstein.Fussy requests:An Approach to Weighted Boolean searches[J].Journal of the American Society for Information Sciences,1980,31(4):240-247.
    [35]M.Ikonomakis,S.Kotsiantis,V.Tampakas.Text Classification Using Machine Learning Techniques[J].Wseas Transactions on Computers,2005,4(8):966-974.
    [36]Dikl.Lee,Huei chuang and Kent Seamons.Document Ranking and the Vector-Space Model[J].IEEE Software,1997,3(14):67-75.
    [37]R.Losee.Parameter Estimation for Probabilistic Document-Retrieval Models[J].Journal of the American Society for Information Science,1988,39(1):8-16.
    [38]M.E.Maron and J.L.Kuhns.On Relevance Probabilistic Indexing and Information Retrieval[J].ACM press,1960,7(3):261-244.
    [39]A.Bookstein.Outline of a General Probabilistic Retrieval Model[J].Journal of Documentation.1983,39(2):63-72.
    [40]Gerard Salton,James Allan and Chris Buckley.Automatic Analysis:Theme generation an summarization of machine-readable texts[J].Morgan Kaufmann Publishers,1994,264(5164):1421-1426.
    [41]Gerard Salton and Chris Buckley.Term-Weighting Approaches in Automatic Text Retrieval[J].Information Processing & Retrieval,1998,24(5):513-523.
    [42]黄萱菁.大规模中文文本处理[D].上海:复旦大学,1998.
    [43]David Lewis.Representation and Learning in Information Retrieval[J].Amherst:University of Massachusetts,1992,42(6):412-426.
    [44]S.T.Dumais.LSI meets TREC:a status report[A].In:Proceeding of 1st Text Retrieval Conference[C],Washington,1993:137-152.
    [45]A.K.Tripathy,A.K.Singh.An efficient method of eliminating noisy information in web pages for data mining[A].In:Proceeding of the Fourth International Conference on Computer and Information Technology[C],WuHan,2004:978-985.
    [46]Lan Yi,Bing Liu,Xiaoli Li.Eliminating Noisy Information in Web Pages for Data Mining[A].In:Proceeding of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining[C],Washington,2003:296-305.
    [47]苏新宁.信息检索理论与技术[M].北京:科技文献出版社,2004:325-329.
    [48]The 2004 Topic Detection and Tracking.Task Definition and Evaluation Plan[EB/OL].http://www.nist.gov/speech/tests/tdt/tdt2002/evalplan/htm,2004.
    [49]R.Papka,J.Allan.On-Line New Event Detection using Single Pass Clustering[A].In:Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval[C],Melbourne,1998:37-45.
    [50]Giridhar Kumaran,James Allan.Text Classification and Named Entities for New Event Detection[A].In:Proceeding of the 27th annual international ACM SIGIR conference on Research and development in information retrieval[C],Sheffield,2004:297-304.
    [51]王继成.基于元数据的Web信息检索技术研究[D].南京:南京大学,2000.
    [52]S.Dharanipragada,M.Franz and J.S.McCarley.Story segmentation and topic detection in the broadcast news domain[A].In:Proceeding of the DARPA Broadcast News Workshop[C],Herndon,1999:65-68.
    [53]J.Allan,R.Papka,V.Lavrenko.On-line New Event Detection and Tracking[A].In:Proceeding of the 21st annual international ACM SIGIR conference on Research and development in information retrieval[C],Melbourne,1998:37-45.
    [54]Hiemstra.D,W.Kraaij,D.van Leeuwen.TNO TREC-7 site report:SDR and Filtering[A]. In:Proceeding of the Seventh Text Retrieval Conference[C],Washington,1999:519-526,
    [55]张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程,2005,31(8):171-172.
    [56]虞玲玲.基于文本分类的话题跟踪及其一元语法模型的应用[D].南京:南京理工大学,2005.
    [57]Dharanipragada S,Franz M,McCarle J.Segmentation and Detection an IBM[J].The Kluwer International Series On Information Retrieval,2002,24(6):135-148.
    [58]Robertson,S.E,Walker,S.Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval[A].In:Proceeding of the Seventeenth International Conference on Research and Development in Information Retrieval[C],Dublin,1994:232-241.
    [59]高洁,吉银林.文本分类技术研究[J].计算机应用研究.2004,7(34):28-30.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700