Outliers in a dataset are those data points that deviate from the regular data points,which have completely different generation mechanism from the conventional data.Some outliers are caused by the error during the data formation processes; such outliershave no benefit on the description of the data set integrity, therefore they should beremoved in the process of data preprocessing. However, some outliers may contain veryimportant information; outliers are the main target of the data analysis in many fieldssuch as credit card fraud, communications, misappropriation, network intrusion;abnormal objects could give us new perspectives in the scientific research field ofdisease diagnosis, astronomical observation, etc., therefore resulting in the emergence ofnew theories or new applications. Outlier mining is to discover outliers in datasets foranalysis and processing by applying statistics, machine learning, intelligent computing,visualization and other techniques.
     Because outliers may contain important knowledge, outlier mining has a widerange of applications; related research will have important academic and practicalsignificance. However, faced with increasingly complex large-scale andhigh-dimensional datasets, it is a challenging problem that how to identify and deal withabnormal behavior quickly and effectively.
     Clustering structure is a common form in dataset with certain generatingmechanism, and distinct characteristic differences exist between different clusters. Wefocus on finding abnormity in datasets with clustering methods, analyzing andexplaining the outlying behavior of the outliers. The main contributions are listed asfollow.
     (1) The basic theory and classic algorithms of spectral clustering are analyzed andstudied roundly. Clustering on complex datasets can be implemented by applyingspectral method. A modified algorithm is proposed, in which an adaptive neighbor sizeparameter related to the density is introduced to calculate the similarity between objectsmore accurately. And the algorithm also automatically selects the optimal clusteringnumber according to calculating different dynamic validy indexes under differentcluster number. The stable cluster obtained by applying such algorithm is theprecondition of achieving effective outlier detection.
     (2) Spectral clustering is applied for outlier mining and a spectral clustering based unknown dataset structure analysis and outlier detection Progress is proposed, whichfirst clusters the dataset by applying spectral clustering algorithm, then calculates theoutlying factor of objects in “small” clusters and then confirms if the objects are outliersaccording to such values.
     (3) The basic theory and related algorithms of cloud model are analyzed andstudied and cloud model theory is applied for outlier mining. By combining the conceptof membership of cloud droplet in a cloud in cloud model theory and the concept ofoutlierness of outlier in a dataset in outlier mining theory, a cloud model based outlierand outlying behavior subspace detection algorithm is proposed.
     (4) According to the present situation of the researches in outlier mining area aremainly concentrated in searching outliers while the research of causes of outlieremergence and that of analysis and explanation of outlying beheavior are limited, anoutlying behavior subspace and key outlying behavior subspace search algorithm isproposed. The understanding of the concept of outlying paraphrase space related to acertain outlier is discussed and on the basis of that, the concepts of “strong paraphrasespace” and “weak paraphrase space” related to a class of outliers are proposed. And atlast a simple algorithm for generating “strong outlying paraphrase space” and “weakoutlying paraphrase space” related to outliers is proposed.
