Multilingual news extraction via stopword language model scoring
详细信息    查看全文
  • 作者:Yu-Chieh Wu
  • 关键词:Information extraction ; Content identification ; Text mining ; Corpus construction ; Knowledge acquisition
  • 刊名:Journal of Intelligent Information Systems
  • 出版年:2017
  • 出版时间:February 2017
  • 年:2017
  • 卷:48
  • 期:1
  • 页码:191-213
  • 全文大小:
  • 刊物类别:Computer Science
  • 刊物主题:Information Storage and Retrieval; Data Structures, Cryptology and Information Theory; Artificial Intelligence (incl. Robotics); IT in Business; Document Preparation and Text Processing;
  • 出版者:Springer US
  • ISSN:1573-7675
  • 卷排序:48
文摘
Web news provides a quick and convenient means to create collections of large documents. The creation of a web news corpus has typically required the construction of a set of HTML parsing rules to identify content text. In general, these parsing rules are written manually and treat different web pages differently. We address this issue and propose a news content recognition algorithm that is language and layout independent. Our method first scans a given HTML document and roughly localizes a set of candidate news areas. Next, we apply a designed scoring function to rank the best content. To validate this approach, we evaluate the systems performance using 1092 items of multilingual web news data covering 17 global regions and 11 distinct languages. We compare these data with nine published content extraction systems using standard settings. The results of this empirical study show that our method outperforms the second-best approach (Boilerpipe) by 6.04 and 10.79 % with regard to the relative micro and macro F-measures, respectively. We also apply our system to monitor online RSS news distribution. It collected 0.4 million news articles from 200 RSS channels in 20 days. This sample quality test shows that our method achieved 93 % extraction accuracy for large news streams.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700