网络舆情敏感话题发现平台的研究

英文题名：Research on the Detection Platform of Sensitive Topic in Internet-Mediated Public Sentiment
作者：冯颖
论文级别：硕士
学科专业名称：通信与信息系统
中文关键词：网络舆情 ; 敏感话题 ; 中文分词 ; 层叠隐马尔可夫模型 ; 完全二阶隐马尔可夫模型 ; 隐性语义标引
英文关键词：Internet-mediated Public Sentiment ; Sensitive Topic ; Chinese Segmentation ; Cascaded Hidden Markov Model(CHMM) ; Full Second-order Hidden Markov Model(FHMM2) ; Latent Semantic Indexing(LSI)
学位年度：2009
导师：孟嗣仪
学科代码：081001
学位授予单位：北京交通大学
论文提交日期：2009-06-01

摘要

互联网作为重要的交流渠道,其存储和传输的信息,尤其是一些敏感话题,对于大众舆论的形成和传播有着举足轻重的影响,其潜在的安全威胁也是不可估量的。因此,敏感话题主动发现技术已经成为一项紧迫而又重要的课题。网络舆情敏感话题发现平台围绕着网络信息分析和处理中的各项关键技术,主要是对预处理后的网络信息进行分词和结构化存储及在此基础上的敏感话题发现技术,进行了系统的研究。
     论文设计并实现了基于网络信息分词结果与敏感词库匹配的网络舆情敏感话题发现平台。针对中文网络舆情敏感信息的分词,本系统实现了基于层叠隐马尔可夫模型的中文词法分析方法,将中文分词、切分歧义排除、未登录词识别和词性标注整合到一个框架中。对敏感词库的管理,通过链表和序列化方式保证敏感词库的完整性和可传递性。关于敏感话题的发现,采用逆向思维的识别过程,将处理后的话题与敏感词库匹配,即将分词结果在敏感词库中查询并识别出敏感话题,从而提高了敏感话题的识别发现效率。
     基于以上工作,对提高敏感话题发现平台的性能上进行了以下几点探索:通过实验比对完全二阶隐马尔可夫模型(FHMM2)与隐马尔可夫模型(HMM)的分词准确率与召回率,得出FHMM2在统计效果和精确率上有着明显的优势;对现有分词词典的改进提出了基于四字Hash机制的分词词典;在基于语义的敏感话题发现方面,提出了基于关键词和隐性语义标引的敏感词识别和敏感度评测方法。
     本论文基于以上的工作,最终设计并实现了网络舆情敏感话题发现平台,在实验室范围内测试,并经校园网内部试运行,结果证明此系统运行稳定,效果良好。
As an important communicating channel,the information carried and transmitted by the internet,especially the sensitive topics,seriously influences the formation and dissemination of public opinion,and it poses inestimable latent security threat.Therefore,the initiative detection technology of sensitive topic is urgently needed.The Detection Platform of Sensitive Topic in Internet-mediated Public Sentiment conformed to main techniques of network information analysis and processing,completed segmentation and the structured storage of processed network information,and realized the detection of sensitive topics in Internet-mediated public sentiment.
     This thesis designed and realized the detection platform based on the match of segmentation results and sensitive words.To the word segmentation module of the system,this paper brings forward an approach for Chinese lexical analysis using Cascaded Hidden Markov Model(CHMM),which aims to integrated Chinese word segmentation,disambiguation,unknown word recognition and part-of-speech tagging into one theoretical frame.Then the system realized the sensitive word management through the single link data structure and the serializing way,thus ensuring the integrity and transitivity of the database.To the detection of the sensitive topic,with a thoroughly retro perspective,the system matches the processed topics with the sensitive words,and that is,inquiring the data table of sensitive topics with the segmentation results and then distinguishes the sensitive topic,and this method increases the efficiency of the detection.
     Base on the work above,this thesis makes a preliminary exploration to improve the capability of the system,which includes the following:compared the recall rate and the precision of segmentation using Full Second-order Hidden Markov Model(FHMM2) and Hidden Markov Model(HMM) through the experiment,the paper comes to an conclusion that FHMM2 has an obvious advantage in the statistics effectiveness and accuracy;based on the improvement of existing segment dictionary,it put forward a segment dictionary basing on Four-character Hash Mechanism;aiming at detecting the sensitive topic using semantic information,it present detection of sensitive topic and evaluation of sensitivity using Latent Semantic Indexing and key words.
     Summing up all the work,the paper designed and realized the Detection Platform of Sensitive Topic in Internet-mediated Public Sentiment.Processing a testing run in then environment at the laboratory and in campus network,the system turned out to be efficient and stable.

引文

[1].王来华,刘毅.中国2004年舆情研究综述[J].新华文摘,2005(18)
    [2].吴绍忠,李淑华.互联网络舆情预警机制研究[J].中国人民公安大学学报(自然科学),2008(3).38
    [3].AUTONOMY行吗.http://hi.baidu.com/gedaodan/blog/item/7eec7ef47c9a55d9f3d38554.html。2006.10.13
    [4].搜索3.0波及中国Autonomy圈地.http://www.robust.net.cn/news/hangye/20070108/1010115.html.2007.01.08
    [5].Charles L.Wayne.Multilingual Topic Detection and Tracking:Successful Research Enabled by Corpora and Evaluation.Language Resources and Evaluation Conference(LREC).2000.1487-1494.
    [6].谢海光,陈中润.互联网内容及舆情浓度分析模式[J].中国青年政治学院学报,2006(3).96
    [7].Goonie的互联网舆情监控分析系统.http://www.goonie.cn/products/2008/01/content3.html2008.01.10
    [8].TRS网络舆情监控系统.http://www.cdsuntun.com./cpjs/dlcp/trs/200810/t20081017_2654.html.2008.10.17
    [9].李舒晨,刘云,李勇.网络舆情分析中网页信息预处理方案的实现[J].电脑与电信。2008(10).30-31
    [10].梁晓弘,杨文安.分词技术在信息处理中的研究综述[J].电脑知识与技术.2007(22).1101-1102
    [11].何莘,王琬芜.自然语言检索中的中文分词技术研究进展及应用[J].情报科学.2008.26(5).787-788
    [12].苏武华.汉语自动分词和自动标引方法研究[J].农业图书情报学刊.2004.15(7).103-104
    [13].卢微.隐马尔可夫模型在自然语言理解研究中的应用[J].电脑与信息技术.2007(1).33-35
    [14].杜兴勇,刘延平,王忠文.Dijkstra算法程序的优化与实现[J].通化师范学院学报.2008.29(12).19
    [15].魏晓宁.基于隐马尔科夫模型的中文分词研究.电脑知识与技术[J]:学术交流.2007(11).886
    [16].刘群,张华平,俞鸿魁等.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展。2004.41(8)
    [17].张华平,刘群.基于N-最短路径方法的中文词语粗分模型[J].中文信息学报.2002(5).2-4
    [18].俞鸿魁,张华平,刘群等.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学.2006.27(2).89-91
    [19].张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报.2004.27(1)87-88
    [20].俞鸿魁,张华平,刘群等.基于角色标注的中文机构名识别.
    [21].字符集编码知识详解. http://space.itpub.net/?uid-10159839-action-viewspace-itemid-166538.2008.01.30
    [22].Scott M.Thede&Mary P.Harper.A second-order hidden markov model for part-of-speech tagging.The 37th Annual Meeting of the Association for Computational Linguistics (ACL-99) College Park MD,USA,1999.175-180
    [23].梁以敏,黄德根.基于完全二阶隐马尔可夫模型的汉语词性标注.计算机工程。2005.31(10).178-179
    [24].马志强,周长胜,丁维等.自扩充中文分词词典的研究与实现[J].计算机与数字工程.2007(6).143-144
    [25].张培颖,李村舍.一种中文分词词典新机制一四字哈希机制[J].微型电脑应用.2006.22(10).35-36
    [26].郭屹.对中文自动分词机制的研究和改进.电脑知识与技术[J]:学术交流.2008(3).1240-1245
    [27].孙海霞,成颖.潜在语义标引(LSI)研究综述[J].现代图书情报技术.2007(9).49-51
    [28].王春红.基于语义的中文信息检索技术分析与研究[J].现代计算机:下半月版.2008(10).54

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700