基于ICE-LDA模型的中英文跨语言话题发现研究

英文篇名：Analysis and Research on Cross Language Topic Discovery in Chinese and English
作者：陈兴蜀 ; 罗梁 ; 王海舟 ; 王文贤 ; 高悦
英文作者：CHEN Xingshu;LUO Liang;WANG Haizhou;WANG Wenxian;GAO Yue;Cybersecurity Research Inst.,Sichuan Univ.;College of Computer Sci.,Sichuan Univ.;
关键词：话题发现 ; 跨英汉文本 ; ICE-LDA模型 ; TF-IDF特征提取 ; 共现话题
英文关键词：topic model;;cross language;;ICE-LDA model;;TF-IDF feature word extraction;;co-occurrence topic
中文刊名：SCLH
英文刊名：Advanced Engineering Sciences
机构：四川大学网络空间安全研究院;四川大学计算机学院;
出版日期：2017-03-20
出版单位：工程科学与技术
年：2017
期：v.49
基金：国家科技支撑计划资助项目(2012BAH18B05);; 国家自然科学基金资助项目(61272447);; 四川大学青年教师启动基金(2015SCU11079)
语种：中文;
页：SCLH201702012
页数：7
CN：02
ISSN：51-1773/TB
分类号：103-109

摘要

近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用。网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源。首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现。采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量。其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词。最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模。实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现。
With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis. Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion. Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens. Firstly,Internet news in Chinese and English network were collected. Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset. Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model. Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data. During model building,the TF-IDF algorithm was used to remove noise words of the text. Finally,two kinds of topic vectors were used to detect the co-occurrence topics. The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.

引文

[1]CNNIC中国互联网络发展状况统计报告[EB/OL].[2016-01-22].http://www.cnnic.com.cn/hlwfzyj/hlwxzbg/201601/P020160122469130059846.pdf.
    [2]Xu Xiaori.Study on the way to solve the paroxysmal public feelings on internet[J].Journal of North China Electric Power University(Social Sciences),2007(1):89-93.[徐晓.网络舆情事件的应急处理研究[J].华北电力大学学报(社会科学版),2007(1):89-93.]
    [3]Wan Jiexi.Research on multilingual text cluster[D].Nanjing:Nanjing University.2013.[万接喜.多语言文本聚类研究[D].南京:南京大学,2013.]
    [4]Zhang Chenzhi,Wang Linghui.Survey on multilingual documents clustering[J].New Technology of Library and Information Service,2009,25(6):31-36.[章成志,王惠临.多语言文本聚类研究综述[J].现代图书情报技术,2009,25(6):31-36.]
    [5]Leftin L J.Newsblaster russian-english clustering performance analysis[R]//Computer Science Technical Report Series.New York:Columbia University,2003.
    [6]Littman M L,Dumais S T,Landauer T K.Automatic crosslanguage information retrieval using latent semantic indexing[M]//Cross-Language Information Retrieval.Heidelberg:Springer-Velag,1998:51-62.
    [7]de Smet W,Moens M F.Cross-language linking of news stories on the web using interlingual topic modelling[C]//Proceedings of the 2nd ACM Workshop on Social Web Search and Mining(SWSM’09).Hong Kong:Association for Computing Machinery,2009:57-64.
    [8]Ni X,Sun J T,Hu J,et al.Mining multilingual topics from Wikipedia[C]//Proceedings of the 18th International Conference on World Wide Web(WWW'09).Madri:Association for Computing Machinery,2009:1155-1156.
    [9]陆前.英、汉跨语言话题检测与跟踪技术研究[D].北京:中央民族大学,2013.
    [10]Gao Shengxiang,Yu Zhengtao,Long Wenxu,et al.Chinese-vietnamese blingual news event storyline analysis based on words co-occurrence distribution[J].Journal of Chinese Information Processing,2015,29(6):90-96.[高盛祥,余正涛,龙文旭,等.基于全局/局部共现词对分布的汉越双语新闻事件线索分析[J].中文信息学报,2015,29(6):90-97.]
    [11]Guo Jiacheng.Supervised topic model[D].Shanghai:Shanghai Jiao Tong University,2010.[郭佳骋.监督学习的话题模型[D].上海:上海交通大学,2010.]
    [12]Scott D,Dumais S T,Furnas G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41:391-407.
    [13]Hofmann T.Probabilistic latent semantic indexing[C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.Boston,MA:Association for Computing Machinery,2015:56-73.
    [14]Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,2(3):993-1022.
    [15]Griffiths T L,Steyvers M.Finding scientific topics[J].Proceedings of the National Academy of Sciences of the United States of America,2004,101(Suppl 1):5228-5235.
    [16]百度翻译开放平台[DB/OL].[2016-08-10].http://api.fanyi.baidu.com/api/trans/product/apidoc.
    [17]Liu Leping,Gao Lei,Yang Na.Development of MCMC methods and revial of modern baysian celebrating 250 Years of Bayes’s Theorem[J].Status and Information Forum,2014,29(2):3-11.[刘乐平,高磊,杨娜.MCMC方法的发展与现代贝叶斯的复兴——纪念贝叶斯定理发现250周年[J].统计与信息论坛,2014,29(2):3-11.]
    [18]Witten I H,Paynter G W,Frank E,et al.KEA:Practical automatic key phrase extraction[C]//Proceedings of the 4th ACM Conference on Digital Library.New York:Association for Computing Machinery,1999:254-255.
    [19]Voorhees E M.Variations in relevance judgments and the measurement of retrieval effectiveness[J].Information Processing&Management,2015,36(5):697-716.
    [20]Mcgibbney L J.Nutch Wiki nutch tutorial[EB/OL].(2016-11-21)[2016-08-10]http://wiki.apache.org/nutch/Nutch Tutorial.
    [21]Zhou Gang,Zou Hongcheng,Xiong Xiaobing.MB-SinglePass:Microblog topic detection based on combined similarity[J].Computer Science,2012,39(10):198-202.
    [22]有道翻译API[DB/OL].[2016-08-10].http://fanyi.youdao.com/openapi.
    [23]Salton G,Wong A,Yang C S.A vector space model for automatic indexing[M]//Readings in Information Retrieval.San Francisco:Morgan Kaufmann Publishers Inc,1997:273-280.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700