Mining massive data streams.

详细信息

作者：Hulten ; Geoffrey.
学历：Doctor
年：2005
导师：Domingos, Pedro
毕业院校：University of Washington
专业：Computer Science.
ISBN：9780542176296
CBH：3178083
Country：USA
语种：English
FileSize：8924581
Pages：168

文摘

Many organizations today have more than very large databases; they have databases that grow without limit at a rate of several million records per day. Mining these continuous data streams brings unique opportunities, but also new challenges. In this thesis we develop a method that can semi-automatically enhance a wide class of existing learning algorithms so that they can learn from such high-speed data streams in real time. In particular, our method can be applied to essentially any induction algorithm based on discrete search. After applying our method the algorithm: learns from data-streams in an incremental, any-time fashion; runs in time independent of the amount of data seen, while making decisions that are essentially identical to those that would be made from infinite data; uses a constant amount of RAM no matter how much data it sees; and adjusts its learned models in a very fine-grained manner as the data generating process changes over time. We evaluate our method by using it to produce a series of learning algorithms---for decision trees, Bayesian network structure, and clustering---which are all capable of learning from high-speed data streams. We evaluate these learners with extensive studies on synthetic data sets, and by applying them to a collection of massive real-world mining tasks.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700