An evolutionary machine learning framework for big data sequence mining.

详细信息

作者：Kamath ; Uday Krishna.
学历：Ph.D.
年：2014
毕业院校：George Mason University
Department：Information Technology
ISBN：9781303807275
CBH：3615022
Country：USA
语种：English
FileSize：3940675
Pages：179

文摘

Sequence classification is an important problem in many real-world applications. Unlike other machine learning data, there are no \explicit" features or signals in sequence data that can help traditional machine learning algorithms learn and predict from the data. Sequence data exhibits inter-relationships in the elements that are important in understanding and predicting future sequences. However, finding these relationships is proven to be an NPhard problem. When we use naive enumerations of combinations of elements or \brute force" iterative approaches for defining these features they often result in poor predictions. Some algorithms which perform well in prediction lack transparency, i.e., the discriminating features generated by these methods are not easily identifiable. In addition, the size of the sequence-based datasets presents practical challenges to most learning algorithms. Most sequence-based datasets contain millions or even billions of instances, for example, the genome-wide sequences of organisms in bioinformatics. At these sizes, classic learning algorithms often become prohibitively expensive, making scalability an important issue. Therefore, there is a need for an approach that can help find features/signals in complex sequences, oer meaningful discriminators, produce good predictions, and can scale well in time and space. This dissertation addresses the above issues by designing a comprehensive approach in the form of the Evolutionary Machine Learner (EML) framework. This framework can be employed on sequence-based datasets to generate explicit, human-recognizable features while solving the scalability issue. EML framework consists of a novel EA-based feature generation (EFG) algorithm for automatic feature construction. By modeling four complex sequencing problems in bioinformatics and generating meaningful, human-understandable features with comparable or better accuracy than the state of the art algorithms, the power and usefulness of the EFG algorithm is demonstrated. The EFG algorithm is also validated by applying it to time series classification problems showing the generic nature of the algorithm in finding the important discriminating patterns that assist in modeling sequence based data. EML framework addresses the scalability issue by means of a novel, parallel scalable machine learning algorithm (PSBML) based on spatially structured evolutionary algorithms. PSBML is validated on real-world \big data" classification problems for various properties of meta-learning, scalability and noise resilience using well known benchmark datasets. The PSBML algorithm is also proven theoretically to be a large margin classifier with linear scalability in training time and space, giving it a unique distinction among the existing large scale learning algorithms. Finally, the EML framework is validated on a large genome-wide bioinformatics classification problem and a large time series problem, showing that the combined algorithms achieve higher predictive performance, training time speed up, and the ability to produce human-understandable discriminating signals as features.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700