多源环境中数据预处理与模式挖掘的研究

英文题名：Data Preprocessing and Pattern Mining in Multiple Data Sources
作者：林耀进
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：多数据源 ; 质量评估 ; 标签传播 ; 模式挖掘
英文关键词：Multiple Data Sources ; Quality Assessment ; Label Propagation ;
英文关键词：Pattern Mining
学位年度：2014
导师：胡学钢 ; 吴信东
学科代码：081203
学位授予单位：合肥工业大学
论文提交日期：2014-04-01
答辩委员会主席：曹杰

摘要

随着数据库、网络以及各种信息技术的迅猛发展,许多实际应用领域如：传感器网络、商业交易、社会媒体分析等数据的描述信息变得越来越多,产生了种海量、多源和异构表现形式的数据。这些多源异构数据蕴含着丰富的知识和有用的信息。然而,由于多数据源具有异构性、自治性、复杂性、不一致性等特征,使得传统的数据挖掘技术面临着巨大的挑战。因此,开展多数据源环境下标签传播、数据源质量评估、模式挖掘等知识挖掘研究具有重要的研究与应用价值。本文主要研究内容如下：
     1)由于数据源之间结构的不一致性,很难将多个数据源直接整合成单一数据源进行学习。在充分利用有标签数据源的标签信息与无标签数据源的内部结构信息基础上,分别提出了全局一致化和局部一致化两种标签传播方法,利用此两种方法使无标签数据源的数据样本具有类标签。再次基础上,构建多数据源的集成学习方法,从分类精度、鲁棒性和扩展性等三方面验证了所提算法的有效性。另外,实验结果表明当无标签数据源较多时,局部一致化的标签传播方法效果优于全局一致化的标签传播方法。
     2)面对多数据源进行学习时,多数据源中可能存在无关的或冗余的数据源。从数据源的重要度和数据源间的冗余度出发,设计了一种基于最大重要度最小冗余度的数据源质量评估与选择算法。其中,重要度表示一个数据源对分类的贡献程度,冗余度表示不同数据源之间蕴含信息的重叠程度。最后,通过选择前p%个数据源进行多数据源的集成学习。实验结果表明该度量方法能有效地选择与任务相关的数据源。
     3)商场随着销售量的日益增长,存储了大量与时间相关的事务型销售数据。通过将销售数据按时间划分为多个时间戳数据库。针对多个时间戳数据库构成的多相关数据库,提出了一种以挖掘稳定模式为代表的有效算法。该算法首先通过定义两个约束条件：minsupp和varivalue以定义稳定数据项,然后基于灰色关联分析方法度量稳定数据项之间的相似度。在此基础上,提出了一种层次灰色聚类方法挖掘由稳定数据项组成的稳定模式。从模式的有效性、时间效率及拓展性等方面验证了所提算法的有效性。
With the raid development of database, network and other information technologies, multiple data sources with large volumes and heterogeneity have become ubiquitous in many practical applications, such as sensor networking, supermarket transactions and social media analysis. These databases contain plenty of useful information and valuable knowledge, and bring new characteristics as being heterogeneous, autonomous, complex, and inconsistent, which are challenging for traditional mining algorithms. Thus, knowledge discovery from multiple data sources, such as label propagation, quality of source evaluation, and pattern mining, is a significant problem with application values in real-world applications. The main contributions of this dissertation are as follows.
     1) It is difficult to merge multiple data sources into a centralized database for learning due to the inconsistency between different data sources. We present two label propagation methods to infer the labels of training objects from unlabeled sources by making a full use of class label information from labeled sources, and internal structure information from unlabeled sources, which are referred to as global consensus and local consensus, respectively. We test the classification accuracy, robustness and scalability of the proposed methods by constructing a multiple-data-source ensemble learning model. Experimental results show that the local consensus outperforms the global consensus when there exist plenty of unlabeled sources.
     2) It is noticeable that some sources might be irrelevant or redundant when constructing multiple-data-source learning. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance. We present an algorithm of source assessment and selection based on max-significance-min-redundancy, in which significance represents the degree to which an information source contributes to classification, and redundancy implies the information overlap among different information sources. Finally, we select the first p percent sources to construct multiple-data-sources ensemble learning. Experimental results show that the metric can effectively select some sources related to the target mining task.
     3) Every time when a customer interacts with a business, there is an opportunity to gain strategic knowledge. Transactional data collected over time contain a wealth of information about customers and their purchasing patterns. We divide transactional data into multiple time-stamped databases according to their sale periods. We present an efficient algorithm for mining four patterns represented by stable patterns. First, we define the notion of stable items according to two constraint conditions:minsupp and varivalue. We then measure the similarity between stable items based on gray relational analysis, and propose a hierarchical gray clustering method for mining stable patterns consisting of stable items. Finally, experimental results show that the proposed algorithm is effective, efficient and scalable.

引文

[1]Wu X., Zhu X., Wu G., et al. Data mining with big data [J]. IEEE Transactions on Knowledge and Data Engineering,2014,26(1):97-107.
    [2]Zhang S., Wu X., Zhang C. Multi-database mining. IEEE Computational Intelligence Bulletin, 2003,2(1):5-13.
    [3]Zhang P., Agarwal P., Obradovic Z. Computational drug reprositioning by ranking and integrating multiple data sources. Machine Learning and Knowledge Discovery in Databases [M]. Springer Berlin Heidelberg,2013:579-594.
    [4]Cheng F., Liu C., Jiang J., et al. Prediction of drug-target interactions and drug repositioning via network-based inference [J]. PLoS Computational Biology,2012,8(5):1-12.
    [5]Dudley J., Deshpande T., Butte J. Exploiting drug-disease relationships for computational drug repositioning [J]. Briefings in Bioinformatics,2011,12(4):303-311.
    [6]Zhang S., Zhang C., Wu X. Knowledge discovery in multiple databases [M]. Springer, New York,2004.
    [7]Adhikari A., Rao R., Pedrycz W. Developing multi-database mining applications [M]. Springer, New York,2010.
    [8]Lin Y., Hu X., Li X., Wu X. Mining stable patterns in multiple correlated databases [J], Decision Support Systems,2013,56:202-210.
    [9]Lin Y., Hu X., Wu X. Quality of Information-Based Source Assessment and Selection [J], Neurocomputing,2014,133:95-102.
    [10]Lu S., Hu S., Li S. Quality of information based data selection and transmission in wireless sensor networks [C]. In:Proceedings of Conference on IEEE 33rd Real-Time Systems Symposium,2012:327-338.
    11] Zhao X., Yuan J., Wang M., et al. Video recommendation over multiple information sources [J]. Multimedia systems,2013,19(1):3-15.
    [12]Fan W., Li J., Tang N., et al. Incremental detection of inconsistencies in distributed data [C]. In: Proceedings of the 28th IEEE International Conference on Data Engineering,2012:318-329.
    [13]Zhang S. Nearest neighbor selection for iteratively kNN imputation [J]. Journal of Systems and Software,2012,85(11):2541-2552.
    [14]Zhang S. Shell-neighbor method and its application in missing data imputation [J]. Applied Intelligence,2011,35(11):123-133.
    [15]汪晓庆,郑彦兴,史美林.一种有效的数据共享环境多数据源选择算法[J].软件学报,2008,19(2)：314-322.
    [16]Zhong N., Yan Y., Ohsuga S. Peculiarity oriented multi-database mining [C]. In:Proceedings of PKDD'99,1999:136-146.
    [17]Liu H., Lu H. J., Yao H. Toward multi-database mining:identifying relevant databases [J]. IEEE Transactions on Knowledge and Data Engineering,2001,13(4):541-553.
    [18]Wu X., Zhang S., Zhang C. Database classification for multi-database mining [J]. Information Systems,2005,30(1):71-88.
    [19]Wu X., Zhang S. Synthesizing high-frequency rules from different data sources [J]. IEEE Transactions on knowledge and data engineering,2003,15(2):353-367.
    [20]Zhang S., Zhang C., Yu J. An efficient strategy for mining exceptions in multi-databases [J]. Information Sciences,2004,165(1):1-20.
    [21]Zhang S., You X., Jin Z., et al. Mining globally interesting patterns from multiple databases using kernel estimation [J]. Expert Systems with Applications,2009,36 (8):10863-10869.
    [22]Ramkumar T., Srinivasan R. Modified algorithms for synthesizing high-frequency rules from different data sources [J]. Knowledge and Information Systems,2008,17 (2):313-334.
    [23]Adhikari A., Rao P. Enhancing quality of knowledge synthesized from multi-database mining [J]. Pattern Recognition Letters,2007,28 (16):2312-2324.
    [24]Adhikari A., Rao P. Synthesizing heavy association rules from different real data sources [J]. Pattern Recognition Letters,2008,29(1):59-71.
    [25]Adhikari A., Ramachandrarao R., Prasad B., et al. Mining multiple large data sources [J]. The International Arab journal of Information Technology,2010,7(3):241-249.
    [26]Zhu X., Li B., Wu X., et al. CLAP:Collaborative pattern mining for distributed information systems [J]. Decision Support Systems,2011,52(1):40-51.
    [27]林耀进,胡学钢.多数据源中局部模式挖掘研究综述[J].合肥工业大学学报(自然科学版),,2013 36(1)：16-21.
    [28]Zhu X., R. Jin. Multiple Information Sources Cooperative Learning [C]. In:Proceedings of the 21st International Joint Conference on Artificial Intelligence, California, July,2009: 1369-1376.
    [29]Shi X., Paiement J.-F., Grangier D., et al. Learning from Heterogeneous Sources via Gradient Boosting Consensus [C]. In:Proceedings of the 2012 SIAM International Conference on Data Mining,2012:224-235.
    [30]Gao J., Fan W., Sun Y., et al. Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation [C]. In:Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June,2009:339-347.
    [31]Gao J., Liang F., Fan W., et al. Graph-based consensus maximization among multiple supervised and unsupervised models [C]. In:Advances in Neural Information Processing Systems,2009:585-593.
    [32]Acharya A., Hruschka E. R., Ghosh J., et al. C3E:A Framework for Combining Ensembles of Classifiers and Clusters [C]. In Proceedings of 10th International Workshop on Multiple Classifier Systems, (MCS 2011), LNCS 6713,2011:269-278.
    [33]Lin Y., Hu X., Wu X. Ensemble Learning from Multiple Information Sources via Label Propagation and Consensus [J]. Applied Intelligence, DOI:10.1007/s10489-013-0508-7.
    [34]Hua M., Pei J. Clustering in applications with multiple data sources-A mutual subspace clustering approach [J]. Neurocomputing,2012,92:133-144.
    [35]Pei J., Jiang D., Zhang A. On mining cross-graph quasi-cliques [C]. In:Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Ming (KDD'05), Chicago, IL, USA, August,2005:228-238.
    [36]Lin C.-R., Liu K.-H., Chen M.-S., et al. Dual clustering:integrating data clustering over optimization and constraint domains [J]. IEEE Transactions on Knowledge and Data Engineering,2005,17(5):628-637.
    [37]Chaudhuri K., Kakade S., Livescu K., et al. Multi-view clustering via canonical correlation analysis [C]. In:Proceedings of the 26th Annual International Conference on Machine Learning, ICML'09, ACM, New York, NY, USA,2009:129-136.
    [38]Li T., Ogihara M., Peng W., et al. Music clustering with features from different information sources [J]. IEEE Transactions on Multimedia,2009,11(3):477-485.
    [39]Gao J., Fan W., Turaga D., et al. A spectral framework for detecting inconsistency across multi-source object relationships [C]. In:Proceedings of the 11th IEEE International Conference on Data Mining, Vancouver, Canada, December,2011:1050-1055.
    [40]Xu C., Tao D., Xu C. A Survey on Multi-View Learning, arxiv:1304.5634.
    [41]Luo Y., Tao D., Xu C., et al. Multiview vector-valued manifold regularization for multilabel image classification [J]. IEEE Transactions on Neural Network and Learning Systems,2013, 24(5):709-722.
    [42]Xia T., Tao D., Mei T., et al. Multiview spectral embedding [J]. IEEE Transactions on Systems Man Cybernetics. PartB:Cybernetics,2010,40(6):1438-1446
    [43]Nigam K., Ghani R. Analyzing the effectiveness and applicability of co-training [C]. In: Proceedings of the 9th International Conference on Information and Knowledge Management. New York, USA:ACM,2000:86-93.
    [44]Brefeld U., Scheffer T. Co-EM support vector learning [C]. In:Proceedings of the 21st International Conference on Machine Learning. New York, USA:ACM,2004:16-23.
    [45]Yu J., Tao D., Rui Y., et al. Pairwise constraints based multiview features fusion for scene classification [J], Pattern Recognition,2013,46:483-496.
    [46]Yu J., Wang M., Tao D., Semisupervised multiview distance metric learning for cartoon synthesis [J], IEEE Transactions on Image Processing,2012,21(11):4636-4648.
    [47]Xie B., Mu Y., Tao D., et al. m-SNE:multiview stochastic neighbor embedding [J]. IEEE Transactions on Systems Man Cybernetics. PartB:Cybernetics,2011,41(4):1088-1096.
    [48]Pan S., Yang Q. A Survey on Transfer Learning [J]. IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359.
    [49]薛贵荣.迁移学习.中国人工智能学会通讯2011. http://www.caai.cn/contents/51/133.html.
    [50]Dai W., Xue G., Yang Q., et al. Co-clustering based classification for out-of-domain documents [C]. The Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007). San Jose, California,2007:210-219.
    [51]Xue G., Dai W., Yang Q., et al. Topic-bridged PLSA for cross-domain text classification [C]. The Thirty-first International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR 2008). Singapore,2008:627-634.
    [52]Ling X., Dai W., Xue G., et al. Spectral domain-transfer learning [C]. The Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008). Las Vegas, Nevada,2008:488-496.
    [53]Dai W., Yang Q., Xue,G. et al. Self-taught clustering [C]. The Twenty-Fifth International Conference on Machine Learning (ICML 2008). Helsinki,2008:200-207.
    [54]Dai W., Yang Q., Xue G., et al. Boosting for transfer learning [C]. The Twenty-Fourth International Conference on Machine Learning (ICML 2007). Corvallis, Oregon,2007: 193-200.
    [55]Dai W., Chen Y., Xue G., et al. Translated learning:transfer learning across different feature spaces [C]. Advances in Neural Information Processing Systems (NIPS 2008). Vancouver, British Columbia,2008:353-360.
    [56]Ling X., Xue G., Dai W., et al. Can Chinese Web pages be classified with English data source? [C]. Seventeenth International World Wide Web Conference (WWW 2008). Beijing,2008: 969-978.
    [57]Lin Y., Chen Y., Xue G. et al. Text-aided image classification:using labeled text from Web to help image classification [C]. The 12th Asia-Pacific Web Conference (APWeb 2010). Busan, 2010:267-273.
    [58]Yang Q., Chen Y., Xue G., et al. Hetegeneous transfer learning for image clustering via the social web [C]. The Conference of the 47th Annual Meeting of the ACL (ACL 2009). Suntec, 2009:1-9.
    [59]周志华.基于分歧的半监督学习[J].自动化学报,2013,39(11)：1871-1878.
    [60]Chapelle O., Scholkopf B., Zien A. Semi-Supervised Learning [M]. Cambridge, MA:MIT Press,2006.
    [61]Zhou Z., Li M. Semi-supervised learning by disagreement [J]. Knowledge and Information Systems,2010,24(3):415-439.
    [62]张春霞,张讲社.选择性集成学习算法综述[J].计算机学报,2011,34(8)：1399-1499.
    [63]Breiman L. Bagging predictors [J]. Machine Learning,1996.,24(2):123-140.
    [64]Freund Y., Schapire R. Experiments with a new boosting algorithm [C]. In:Proceedings of the 13th International Conference on Machine Learning,1996:123-140.
    [65]Ho T.The random subspace method for constructing decision forests [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998:20(8):832-844.
    [66]Dietterich T., Bakiri G. Solving multiclass learning problems via error-correcting output codes [J]. Journal of Artificial Intelligence Research,1995,2:263-286.
    [67]Zhou Z., Wu J., Tang W. et al. Ensembling neural networks:many could be better than all [J]. Artificial Intelligence,2002:137(1-2):239-263.
    [68]Adhikari A., Ramachandrarao R., W. Pedrycz. Study of select items in different data sources by grouping [J]. Knowledge and Information Systems,2011,27(1):217-235.
    [69]Preece A., Hui K., Gray A., et al. Desiging for scalability in a knowledge fusion system [J]. Knowledge-Based Systems,2001,14:173-179.
    [70]The DBLP Computer Science Bibliography, http://www.informatik.uni-trier.de/ley/db.
    [71]Zhang P., Zhu, X. Tan J., et al. Classifier and cluster ensembles for mining concept drifting data streams [C]. In:Proceedings of the 10th IEEE International Conference on Data Mining (KDD-10),2010:1175-1180.
    [72]Hu Q., Yu D., Xie Z., et al. EROS:ensemble rough subspaces [J]. Patterns Recognition,2007, 40:3728-3739.
    [73]朱鹏飞,胡清华,于达仁.基于随机化属性选择和邻域覆盖约简的集成学习[J].电子学报,2012,40(2)：273-279.
    [74]Yuan L., Wang Y., Thompson P., et al. Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data [J]. Neurolmage,2012,61:622-632.
    [75]Augsten N., Bohlen M., Gamper J. The address connector:noninvasive synchronization of hierarchical data sources [J]. Knowledge and Information Systems,2013,37(3):639-663.
    [76]Ye M., Wu X., Hu X., et al. Multi-level rough set reduction for decision rule mining [J]. Applied Intelligence,2013,39(3):642-658.
    [77]Yin X., Han J., Yang J., et al. Efficient classification across multiple database relations:A CrossMine approach [J]. IEEE Transcations on Knowledge and Data Engineering,2006:18(6): 770-783.
    [78]Zhuang F., Luo P., Xiong H., et al. Cross-domian learning from multiple sources:a consensus regularization perspective [J]. IEEE Transcations on Knowledge and Data Engineering,2010, 22(12):1664-1678.
    [79]Li T., Ogihara M. Semisupervised learning from different information sources [J]. Knowledge and Information Systems,2005,7:289-309.
    [80]Zhang L.., Zhao Y., Zhu X., et al. Mining semantically consistent patterns for cross-view data [J]. IEEE Transactions on Knowledge and Data Engineering, DOI 10.1109/TKDE.2014. 2313866.
    [81]Guyon I., Weston J., Barnhill S., et al. Gene selection for cancer classification using support vector machines [J]. Machine Learning,2002,46:389-422.
    [82]Fujino A., Ueda N., Nagata M., Adaptive semi-supervised learning on labeled and unlabeled data with different distributions [J]. Knowledge and Information Systems,2013,7:129-154.
    [83]Shi X., Liu Q., Fan W., et al. Transfer across completely different feature spaces via spectral embedding [J]. IEEE Transactions Knowledge Data Engineering,2013,25 (4):906-918.
    [84]Fan W., Li J., Tang N., et al., Incremental detection of inconsistencies in distributed data [C]. In:The Proceedings of the 28th IEEE International Conference on Data Engineering, Arlington, April,2012:318-329.
    [85]Zhao Z., Liu H. Spectral feature selection for supervised and unsupervised learning [C]. In: The Proceedings of the 24th Annual International Conference on Machine Learning, Oregon, June,2008:1151-1157.
    [86]Zhao Z., Wang L., Liu H., et al. On similarity preserving feature selection [J]. IEEE Transactions Knowledge Data Engineering,2013,25 (3):619-632.
    [87]Fan W., Geerts F., Zheng L., View determinacy for preserving selected information in data transformations [J]. Information Systems,2012:37 (1):1-12.
    [88]Wang R., Strong D., Kahn B., et al. An information quality assessment methodology [C]. In: Proceedings of the International Conference on Information Quality (IQ), Cambridge, MA, 1999:258-265.
    [89]Muller H., Freytag J., Leser U. Improving data quality by source analysis [J]. Journal of Data and Information Quality,2012,2(4):1-15.
    [90]Hu W., Tran V., Bulusu N., et al. The design and evaluation of a hybrid sensor network for cane-toad monitoring [J]. ACM Transactions on Sensor Networks,2009,5 (1):1-30.
    [91]Su L., Hu, S. Li S., et al. Quality of information based data selection and transmission in wireless sensor work [J]. IEEE Real-time Systems Symposium, San Juan, December,2012: 327-338.
    [92]Adhikari A., Rao P. Synthesizing heavy association rules from different real data sources [J]. Pattern Recognition Letters,2008,29 (1):59-71
    [93]Hu Q., Zhang L., Zhang D., et al. Measuring relevance between discrete and continuous features based on neighborhood mutual information. Expert Systems with Applications,2011, 38:10737-10750.
    [94]Yu L., Ding C., Loscalzo S., Stable feature selection via dense feature groups [C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, August,2008:803-811.
    [95]Y. Zhang, S. Li, T. Wang, et al. Divergence-based feature selection for separate classes [J]. Neurocomputing,2013,101:32-42.
    [96]Tang K., Chen Y., Wu H. Context-based market basket analysis in a multiple store environment [J]. Decision Support Systems,2005,45(1):150-163.
    [97]Gaya M., Giraldez J. Merging local patterns using an evolutionary approach [J]. Knowledge and information systems,2011,29(1):1-24.
    [98]李柳青,冯志勇,刘超.基于多源异构数据的查询分解算法[J].计算机工程,2010,36(23)：56-58.
    [99]丁国辉,王国仁,赵宇海.基于使用信息和聚类方法的多模式集成[J].计算机研究与发展,2010,47(5)：824-831.
    [100]Aronis J., Kolluri, V. Provost F., et al. The WoRLD:Knowledge discovery from multiple distributed databases [C]. In:Proceedings of 10th International Florida AI Research Symposium,1997:337-341.
    [101]Liu H., Lu H., Yao J. Identifying relevant databases for multi database mining [C]. In: Research and Development in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg,1998:210-221.
    [102]Yan Y., Liu N., Yang, Q. et al. Mining adaptive ratio rules from distributed data sources [J]. Data Mining and Knowledge Discovery,2006,12(2-3):249-273.
    [103]Peng W., Liao Z Mining sequential patterns across multiple sequence databases [J]. Data & Knowledge Engineering,2009,68(10):1014-1033.
    [104]Agrawal R., Shafer J. Parallel mining of association rules [J]. IEEE Transactions on Knowledge and Data Engineering,1996,8(6):962-969.
    [105]Parthasarathy S., Zaki M., Ogihara M., et al. Parallel data mining for association rules on shared-memory systems [J]. Knowledge and Information Systems,2001,1(1):1-29.
    [106]Deng J. Introduction of grey system theory [J]. Journal of grey system,1989,1(1):1-24.
    [107]Kung C. Y., Wen K. L., Applying grey relational analysis and grey decision-making to evaluate the relationship between company attributes and its financial performance-A case study of venture capital enterprises in Taiwan [J], Decision Support Systems,2007,43 (3):842-852.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700