摘要
可比语料库是重要的基础资源,在线挖掘可比语料是构建大规模可比语料库的有效途径,合适的语料来源网站和有效的可比度计算方法能够简化在线挖掘过程。选择环球时报英文版和凤凰网作为语料来源,设计了一个中英新闻可比语料库在线构建系统。测试结果表明,系统能够连续稳定地生成可比语料。
Comparable corpora are useful lingual resources.Mining comparable texts online from the web is an effective way to building comparable corpora of large scale.Suitable source websites and effective comparability measurement will facilitate the mining process.An online mining system for Chinese-English bilingual news comparable corpus is designed with globaltimes.cn and ifeng.com as the English and Chinese news source websites respectively.The system test results indicate that it can output comparable news pair steadily.
引文
[1]柳路芳,李波,陈鹏,等.基于词向量与可比语料库的双语词典提取研究[J].计算机工程与科学,2018(2):368-373.
[2]庞伟.双语语料库构建研究综述[J].信息技术与信息化,2015(3):105-108.
[3]Talvensaari T,Laurikkala J,Jarvelin K,et al.Creating and exploiting a comparable corpus in cross-language information retrieval[J].ACM Transactions on Information Systems,2007(1):4-es.
[4]房璐,葛运东,洪宇,等.可比较语料库构建及在跨语言信息检索中的应用[J].广西师范大学学报(自然科学版),2010(3):126-130.
[5]Saad M,Langlois D,Smaili K.Extracting Comparable Articles from Wikipedia and Measuring their Comparabilities[J].Procedia-Social and Behavioral Sciences,2013,95:40-47.
[6]Malek Hajjem,Maroua Trabelsi,Chiraz Latiri.Building comparable corpora from social networks[C].Workshop on Building&Using Comparable Corpora.International Conference on Language Resources and Evaluation,2014.
[7]Li B,Gaussier E.Improving Corpus Comparability for Bilingual Lexicon Extraction from Comparable Corpora[C].23rd International Conference on Computational Linguistics,Proceedings of the Conference,2010.