详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. It has achieved very significant results in multiple applications such as banking and communications in recent year. Both clustering and classification are key technologies of data mining. There are dozens of algorithms available for a specific data mining task. However, in practice, data mining techniques are severely limited by two main challenges. One is lack of rules or knowledge to help people to choose proper algorithms for their ongoing data mining project, followed by difficult to test the reliability and the operating efficiency of the selected model.
     For the above mentioned problems, this research focus on the model selection of data mining, specifically, on the clustering and classification:
     First, the theoretical framework of model selection for data mining is established. The framework disintegrates the model selection problem into tasks space, algorithm space, evaluating criteria space and evaluating strategy. The tasks space contains the description of the ongoing data mining project. The algorithm space defines the available algorithms for the task. Evaluating criteria space contains the metrics used to evaluate the performance of the models. Evaluating strategy describes which and how multiple criteria decision making (MCDM) methods are used in the model ranking and final selection.
     Second, five representative clustering algorithms are selected from the division based algorithm, hierarchical algorithm, density based algorithm and model based clustering algorithm. And then constructes the evaluating space with11performance metrics which are from external, internal and relative clustering evaluation criteria. After comprehensive comparative analysis by using the MCDM methods, a mechanism to automatic algorithm selection for a specific data mining task are built.
     Third,12typical algorithms are selected as the algorithm space from the decision tree, functions based methods, Bayesian theory, lazy learning and association rule-based classification algorithms. This study examines the model selection problem at the direction of binary classification and ensemble learning for the software defect prediction. This study acquired two main achievements based on the ranking results. On the one hand, empirical knowledge was acquired for the software defect prediction task. On the other hand, the ranking results are used to guide the design of feature transform, new algorithms for specific tasks.
     Fourth, guided by the classification models selection results, a density based over sampling algorithm and an ensemble feature selection method for imbalanced learning are proposed. The experimental results indicated that the proposed algorithms are effective for the imbalanced learning problems.
     Finally, an open-source MCDM software named DSOLVER for data mining model selection and general decision making problems is developed. The DSOLVER consisted of14MCDM algorithms including the data visualization, standardize processing, decision analysis, sensitivity analysis, results comparison and group decision-making modules.
