Discovering rare categories from graph streams
详细信息    查看全文
  • 作者:Dawei Zhou ; Arun Karthikeyan ; Kangyang Wang ; Nan Cao…
  • 关键词:Rare category detection ; Time ; evolving graph ; Incremental learning
  • 刊名:Data Mining and Knowledge Discovery
  • 出版年:2017
  • 出版时间:March 2017
  • 年:2017
  • 卷:31
  • 期:2
  • 页码:400-423
  • 全文大小:
  • 刊物类别:Computer Science
  • 刊物主题:Data Mining and Knowledge Discovery; Artificial Intelligence (incl. Robotics); Information Storage and Retrieval; Statistics for Engineering, Physics, Computer Science, Chemistry and Earth Sciences;
  • 出版者:Springer US
  • ISSN:1573-756X
  • 卷排序:31
文摘
Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700