异构数据映射技术研究

英文题名：Research on Mapping of Heterogeneous Data Integration
作者：缪嘉嘉
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：数据集成 ; 异构数据集成 ; 模式映射 ; 实例匹配 ; 映射失效检测 ; 多策略学习 ; 组合算法
英文关键词：data integration ; heterogeneous data integration ; schema mapping ; instance matching ; broken mapping detecting ; multi-strategy learning
学位年度：2008
导师：吴泉源
学科代码：081202
学位授予单位：国防科学技术大学
论文提交日期：2008-10-01

摘要

数据集成是信息集成的基础。随着人们对信息综合利用要求的不断深化,大规模异构数据的集成已经成为当前信息集成领域的研究热点。异构数据集成的关键是通过映射技术建立异构数据之间的一致性,包括数据属性或模式的一致性,数据主体或元组实例的一致性。本文工作围绕大规模数据集成中建立模式和数据一致性的映射与匹配技术展开研究,利用机器学习、自然语言处理以及模糊理论对已有的模式映射、实例映射和失效映射检测方法进行发展与改进,并扩展了异构数据集成平台StarEAI,在实际应用中验证了本文给出的方法与技术的有效性。本文主要工作包括:
     1、在模式层面的一致性方面,本文提出了一种基于数据实例的多策略模式映射方法MSMA,首先针对实例数据具有良好的结构化特征的情况,根据大量样本特征信息,设计了数据格式、约束、均值、贝叶斯等基于实例结构的学习器,并产生预测分类模型,运用机器学习方法,抽取待匹配数据的特征信息,进行模式映射;进而改进了组合算法,将标签作为组合器的输入,有效降低了组合算法的的计算复杂度。实验结果表明MSMA方法的查全率最高达到89%,查准率到达93%,在模式信息缺失的情况下,较已有的著名映射方法LSD准确率提高7%。
     2、在数据层面的一致性方面,本文提出了一种基于聚类分析的元组实例匹配方法HIMA。首先从方法框架上,HIMA方法利用聚类算法,较一一匹配算法有更高的处理效率;在聚类算法中,采用基于条件概率分布的字符串相似性度量算法进行元素之间距离计算,能够有效的提高匹配准确率;此外,针对一些应用中实例描述冗长的现象,本文提出基于最大熵模型的关键词提取,有效去除无效信息。实验结果表明采用条件概率分布距离度量算法和关键词提取算法的匹配方法HIMA,其准确率达到83%,优于基于距离、基于令牌的算法,其准确率提高6%。
     3、在运行时模式映射失效方面,本文提出了一种基于模糊聚集算子的失效映射检测方法BMSD,研究了数值、趋势、布局等学习器之间结果融合的各种情况,加入了基于析取加权的模糊聚集算子,改善融合精度;在进行人工数据和真实数据训练结果融合时,引入变权方法,使得融合结果不但能够考虑到各因素的相对重要性的偏好,也顾及各因素状态均衡程度的偏好。实验结果表明BSMD方法的平均准确率达到85%,较已有的Marveric方法提高7%。
     4、在上述研究的基础上,对我院的国家863成果异构数据集成平台(StarEAI)进行了扩展,增加了自动模式映射功能、元组实例匹配功能以及运行时失效映射检测功能,拓展后的平台在网络监控数据集成项目和军队项目中得到成功应用。
Data integration is the basis of the information integration technology. With the continuous increasing of the information utilization, the large-scale heterogeneous data integration has become a hot issue in the information research. The mapping technology is the key to establish the consistency among the heterogeneous data, including the consistency of data model, the consistency of data instance and so on. This dissertation focuses on making a deep research on the mapping and matching technologies to maintain the consistency among the heterogeneous data. By introducing the technologies of machine learning, natural language processing, as well as the theory of fuzzy model, we improve the schema mapping approach and the instance matching approach while optimize the broken mapping detecting algorithm. In practice, we extend the platform of heterogeneous data integration (StarEAI), and finally we verify our approaches with the real-world widely used applications. This dissertation makes four contributions as follows:
     Firstly, to address the consistency issue of schema level, we proposed an Instance-based Multi-Strategy Schema Matching Approach (MSMA). In the schema mapping research, we are supposed to use the information of schema and other descriptions, along with the characteristics of instances,, to identify the relation between different schemas. There are rule based and machine learning based approaches to tackle this problem. Examining the existing mapping approaches, we can draw a conclusion that they build the decision model automatically or artificially. The machine learning based approach is more adaptable. A single leaner determine whether the relationship is established by a certain type of information available, but the multi-strategy approach refers to considering a variety of information. Consequently, the multi-strategy approach can increase the utilization of information, thus it can improve the accuracy of mapping. MSMA designs a number of learners to grasp the information of instances, and improves the multi-strategy approach. The experimental results show that the precision of MSMA is up to 89%, and the recall of MSMA is up to 93%. As to the pattern of lack of schema information, MSMA has more precision of the original approach.
     Secondly, considering the consistency of instance level, we come up with a Holistic Data Instance Matching Approach (HIMA). The heterogeneous instance refers to the same entity in different data sources, which has different descriptions. The instance matching approach can eliminate the heterogeneous data. Firstly, we measure the similarity of instances with the algorithm of string distances. The condition probabilistic based algorithm can improve the accuracy of the whole approach. From the perspective of framework, the traditional methods can just take two input data sources, and perform the pair-wise matching. HIMA makes use of the clustering algorithm, which it can handle, a large scale of data source holistically. In addition, we use the keyword extracting method, which is based on the maximum entropy model, to get rid of the useless information. The experimental results show that the keyword extracting algorithm can get 70% precision, and the condition probabilistic based algorithm is more precise than the token-based algorithm. HIMA method can achieve 83% accuracy.
     Thirdly, to process the run time broken mapping detecting issue, we put forward a Fuzzy-based Broken Schema Mapping Detecting Approach (BSMD). In this dynamic distributed environment, the data sources trend to suffer changes that invalidate the mappings. Such continuous monitoring is extremely labor intensive, and poses a key bottleneck to the widespread deployment of the data integration systems. The kernel of BSMD is a set of computationally inexpensive modules called sensors, which capture salient characteristics of data sources, like Maveric system. We develop two novel improvements: Disjunction-Weighted Average Operators are leveraged to calculate the score, which implies whether the mapping is broken; Change Weight Operators is introduced combine artificial data with real data in the training phase. The experiments over the real-world sources demonstrate the effectiveness of our fuzzy-based approach over existing solutions, as well as the utility of our improvements.
     Finally, based on the above-mentioned studies, we extend the platform of heterogeneous data integration (StarEAI), which is the outcome of an 863 project. We extend this platform with tree modules: the automatic schema mapping module, the instance matching module, as well as the broken mapping detecting module. The StarEAI+ system has been successfully deployed in the projects of armed forces and network monitoring.

引文

[1] S. Haag, D.J. McCubbrey, M. Cummings. Management and Information Systems for the Information Age, 3e. 2001: McGraw-Hill Higher Education.
    [2]孟小峰,周龙骧,王珊.数据库技术发展趋势.软件学报, 2004. 15(12): p. 1822-1836.
    [3]艾静.数据集成:历史、现状、未来. 2006,中国人民大学WAMDM实验室:北京.
    [4]中国计算机报.数据库BI CM信息集成把脉信息管理. 2004.
    [5] IDC中国. 2008年中国企业数据集成和数据质量调查. 2008, Informatica Corporation.
    [6]王珊,萨师煊.数据库系统概论(第四版). 2007,北京:高等教育出版社.
    [7] D. Anurew. INGRESJSTAR. A product and application overview, Colloquium on Distr~ buted Database Systems. IEE, London. 1987. p. 21-27.
    [8] W.S. Li, C. Clifton, S.Y. Liu. Database intergration using neural networks: implementation and experiences. Knowledge and Information Systems, 2000. 2(1): p. 73-96.
    [9] F. Hakimpour, A. Geppert. Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based Approach. 2001.
    [10] A. Halevy, A. Rajaraman, J. Ordille. Data integration: the teenage years, Proceedings of the 32nd international conference on Very large data bases. 2006, VLDB Endowment: Seoul, Korea. p. 9-16.
    [11] W. Litwin, L. Mark, N. Roussopoulos. Interoperability of multiple autonomous databases. ACM Comput. Surv., 1990. 22(3): p. 267-293.
    [12] D. Heimbigner, D. McLeod. A federated architecture for information management. ACM Trans. Inf. Syst., 1985. 3(3): p. 253-278.
    [13] A.P. Sheth, J.A. Larson. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv., 1990. 22(3): p. 183-236.
    [14] C. Diego. Source Integration in Data Warehousing. 1998.
    [15] J. Hammer, H. Garcia-Molina, J. Widom, W. Labio, Y. Zhuge. The Stanford Data Warehousing Project. Bulletin of the Technical Committee on, 2003. 51.
    [16] OMG. The official CORBA standard from the OMG group. 1991.
    [17]刘敏超,刘卫东.数据集成系统关键问题研究.计算机应用, 2006. 7.
    [18] R. Hull, G. Zhou. A framework for supporting data integration using the materialized and virtual approaches. ACM SIGMOD Record, 1996. 25(2): p.481-492.
    [19]中国计算机报.实施商业智能先完成数据集成统一. 2007.
    [20] M. Garca-Solaco, F. Saltor, M. Castellanos. Semantic heterogeneity in multidatabase systems. 1995.
    [21] D. Dey, S. Sarkar, P. De. A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2002: p. 567-582.
    [22] M.W. Bright, A.R. Hurson, S. Pakzad. Automated resolution of semantic heterogeneity in multidatabases. ACM Transactions on Database Systems (TODS), 1994. 19(2): p. 212-253.
    [23] F. Casati, S. Ilnicki, L. Jin, V. Krishnamoorthy, M.C. Shan. Adaptive and Dynamic Service Composition in eFlow. HP LABORATORIES TECHNICAL REPORT HPL, 2000.
    [24] R. McCann. Maveric: Mapping Maintenance for Data Integration Systems. Proceedings of International Conference on Very Large Databases (VLDB), 2005.
    [25] P.N. Tan, M. Steinbach, V. Kumar. Introduction to Data Mining. 2005, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
    [26] J. Han, M. Kamber. Data Mining: Concepts and Techniques. 2001: Morgan Kaufmann.
    [27] A.Y. Levy. Logic-based techniques in data integration. Kluwer International Series In Engineering And Computer Science, 2000: p. 575-595.
    [28] A. Levy. Combining artificial intelligence and databases for data integration. Artificial Intelligence Today: Recent Trends and Developments, 1999. 1600: p. 249篓C268.
    [29]刘开瑛,郭炳炎.自然语言处理. 1991,北京:科学出版社.
    [30]王晓龙,关毅.计算机自然语言处理. 2005,北京:清华大学出版社.
    [31] C.D. Manning, H. Schtze. Foundations of statistical natural language processing. 1999: MIT Press.
    [32] J. Berlin, A. Motro. Database Schema Matching Using Machine Learning with Feature Selection. LECTURE NOTES IN COMPUTER SCIENCE, 2002: p. 452-466.
    [33] E. Rahm, P.A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal The International Journal on Very Large Data Bases, 2001. 10(4): p. 334-350.
    [34] G. Leusch, N. Ueffing, H. Ney. A novel string-to-string distance measure with applications to machine translation evaluation, Proceedings of MT Summit IX. 2003. p. 33-40.
    [35] W.W. Cohen, P. Ravikumar, S.E. Fienberg. A comparison of string distance metrics for name-matching tasks, Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03). 2003.
    [36] K.C.C. Chang, B. He, Z. Zhang. Toward large scale integration: Building a metaquerier over databases on the web, Proc. of CIDR. 2005. p. 44-55.
    [37] W.W. Cohen, J. Richman. Learning to match and cluster large high-dimensional data sets for data integration, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002, ACM New York, NY, USA. p. 475-480.
    [38] W3C. Extensible Markup Language (XML) 1.1 (Second Edition). 2006.
    [39] W. Qing, Z. Jun-Mei, W.U. Hong-Wei, X. Jian-Chang, Z. Ao-Ying. Mapping XML Documents to Relations in the Presence of Functional Dependencies. 14, 2003. 7(6): p. 1275.
    [40] N. Guarino, I. Consiglio nazionale delle ricerc. Formal ontology in information systems. 1998: IOS Press.
    [41]邓志鸿,唐世渭. Ontology研究综述.北京大学学报:自然科学版, 2002. 38(5): p. 9.
    [42] J. Euzenat, P. Shvaiko. Ontology Matching. 2007: Springer-Verlag New York, Inc. Secaucus, NJ, USA.
    [43] D. Aumueller, H.H. Do, S. Massmann, E. Rahm. Schema and ontology matching with COMA++, Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2005, ACM New York, NY, USA. p. 906-908.
    [44] A. Doan, J. Madhavan, P. Domingos, A. Halevy. Ontology matching: A machine learning approach. Handbook on Ontologies in Information Systems, 2004: p. 397-416.
    [45] A. Sheth, V. Kashyap. So far (schematically) yet so near (semantically), Proceedings of the IFIP WG. 1993. p. 283-312.
    [46] P. Mitra, G. Wiederhold, J. Jannink. Semi-automatic integration of knowledge sources, Proc. of the 2nd Int. Conf. On Information FUSION. 1999.
    [47] S. Castano, V. De Antonellis, S.D.C. di Vimercati. Global Viewing of Heterogeneous Data Sources. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2001: p. 277-297.
    [48] J. Madhavan, P.A. Bernstein, E. Rahm. Generic Schema Matching with Cupid, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES. 2001. p. 49-58.
    [49] F. Giunchiglia, P. Shvaiko, M. Yatskevich. S-Match: an Algorithm and an Implementation of Semantic Matching. LECTURE NOTES IN COMPUTERSCIENCE, 2004: p. 61-75.
    [50] R.J. Miller, M.A. Hernndez, L.M. Haas, L. Yan, C.T.H. Ho, R. Fagin, L. Popa. The Clio project: managing heterogeneity. ACM SIGMOD Record, 2001. 30(1): p. 78-83.
    [51] S. Melnik, H. Garcia-Molina, E. Rahm. Similarity flooding: a versatile graph matching algorithm and itsapplication to schema matching, Data Engineering, 2002. Proceedings. 18th International Conference on. 2002. p. 117-128.
    [52] M.L. Lee, L.H. Yang, W. Hsu, X. Yang. XClust: clustering XML schemas for effective integration, Proceedings of the eleventh international conference on Information and knowledge management. 2002, ACM Press New York, NY, USA. p. 292-299.
    [53] W.S. Li, C. Clifton. SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering, 2000. 33(1): p. 49-84.
    [54] J. Berlin, A. Motro. Autoplex: Automated Discovery of Content for Virtual Databases. LECTURE NOTES IN COMPUTER SCIENCE, 2001: p. 108-122.
    [55] A.H. Doan, P. Domingos, A.Y. Halevy. Reconciling schemas of disparate data sources: a machine-learning approach. ACM SIGMOD Record, 2001. 30(2): p. 509-520.
    [56] D. Dey, S. Sarkar, P. De. A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases. Management Science, 1998. 44(10): p. 1379-1395.
    [57] J.R. Wang, S.E. Madnick. The inter-database instance identification problem in integratingautonomous systems, Data Engineering, 1989. Proceedings. Fifth International Conference on. 1989. p. 46-55.
    [58] J. Bischoff, T. Alexander. Data warehouse: practical advice from the experts. 1997: Prentice-Hall, Inc. Upper Saddle River, NJ, USA.
    [59] A. Chatterjee, A. Segev. Rule based joins in heterogeneous databases. Decision Support Systems, 1995. 13(3-4): p. 313-333.
    [60] M.D. Siegel, S.E. Madnick, S.S.o. Management. A Metadata Approach to Resolving Semantic Conflicts. 1991: International Financial Services Research Center, Sloan School of Management, Massachusetts Institute of Technology.
    [61] J.B. Copas, F.J. Hilton. Record linkage: statistical models for matching computer records. Journal of the Royal Statistical Society Series A, 1990. 153: p. 287-320.
    [62] I.P. Fellegi, A.B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 1969. 64(328): p. 1183-1210.
    [63] N. Kushmerick. Wrapper verification. World Wide Web, 2000. 3(2): p. 79-94.
    [64] K. Lerman, S. Minton, C. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 2003. 18: p. 149-181.
    [65] P. Ziegler. Data Integration Projects World-Wide. 2006.
    [66] P. Ziegler, K. Dittrich. Data Integration - Problems, Approaches, and Perspectives, Conceptual Modelling in Information Systems Engineering. 2007. p. 39-58.
    [67] P. Ziegler, K.R. Dittrich. Three Decades of Data Integration - All Problems Solved? 18th IFIP World Computer Congress (WCC 2004), 2004. 12: p. 3-12.
    [68] A. Levy. The Information Manifold Approach to Data Integration. IEEE Intelligent Systems, 1998. 13(5): p. 12-16.
    [69] K.C.C. Chang, B. He, Z. Zhang. Metaquerier over the deep web: Shallow integration across holistic sources. Proceedings of the VLDB Workshop on Information Integration on the Web, Toronto, 2004: p. 15-20.
    [70] A. Levy, A. Rajaraman, J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1996: p. 251-262.
    [71] H. Wache, T. V?gele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, S. Hbner. Ontology-based integration of information-a survey of existing approaches. IJCAI-01 Workshop: Ontologies and Information Sharing, 2001. 2001: p. 108-117.
    [72] E. Mena, V. Kashyap, A. Illarramendi, A. Sheth. Imprecise Answers on Highly Open and Distributed Environments: An Approach based on Information Loss for Multi-Ontology Based Query Processing. International Journal of Cooperative Information Systems, 2000. 9(4): p. 403-426.
    [73] A.Y. Halevy, Z.G. Ives, P. Mork, I. Tatarinov. Piazza: data management infrastructure for semantic web applications. Proceedings of the 12th international conference on World Wide Web, 2003: p. 556-567.
    [74]王宁,陈滢.一个基于CORBSA的异构数据源集成系统的设计.软件学报, 1998. 5(9).
    [75] N. Wang, Y. Chen, B. Yu, N. Wang. Versatile: a scalable CORBA-based system for integratingdistributed data. Intelligent Processing Systems, 1997. ICIPS'97. 1997 IEEE International Conference on, 1997. 2.
    [76]王宁,徐宏炳.基于带根连通有向图的对象集成模型及代数.软件学报, 1998. 12(9): p. 1502.
    [77]王宁,徐宏炳.数据源集成系统中动态字典构造方法研究.计算机学报, 1999. 22(1).
    [78] P. Cudr-Mauroux, S. Agarwal, A. Budura, P. Haghani, K. Aberer. Self-organizing schema mappings in the GridVine peer data management system. Proceedings of the 33rd international conference on Very large data bases, 2007: p. 1334-1337.
    [79] P.A. Bernstein, H. Ho. Model management and schema mappings: theory andpractice, Proceedings of the 33rd international conference on Very large data bases. 2007, VLDB Endowment: Vienna, Austria. p. 1439-1440.
    [80] B. planet. Deep web white paper. 2000.
    [81] D. Fetterly, M. Manasse, M. Najork, J.L. Wiener. A large-scale study of the evolution of Web pages. Software Practice and Experience, 2004. 34(2): p. 213-237.
    [82]王能斌.数据库系统教程. 2004,北京:电子工业出版社.
    [83] Y. Freund, R.E. Schapire. Experiments with a New Boosting Algorithm. MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, 1996: p. 148-156.
    [84] D.H. Wolpert. Stacked generalization. Neural Networks, 1992. 5(2): p. 241-259.
    [85] R.S. Michalski, J.G. Carbonell, T.M. Mitchell. Machine Learning: An Artificial Intelligence Approach. 1985: Morgan Kaufmann.
    [86] P. Domingos. Unifying Instance-Based and Rule-Based Induction. Machine Learning, 1996. 24(2): p. 141-168.
    [87] D. Freitag. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 2000. 39(2): p. 169-202.
    [88] P.K. Chan, S.J. Stolfo. Experiments on multistrategy learning by meta-learning. Proceedings of the second international conference on Information and knowledge management, 1993: p. 314-323.
    [89] L. Breiman. Bagging Predictors. Machine Learning, 1996. 24(2): p. 123-140.
    [90] L.S. Penrose. The elementary statistics of majority voting. Journal of the Royal Statistical Society, 1946. 109(1): p. 53-57.
    [91] J. Shavlik, M. Shavlik. Selection, combination, and evaluation of effective software sensors for detecting abnormal computer usage. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004: p. 276-285.
    [92] D. Bitton, D.J. DeWlTt. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 1983. 8(2): p. 255-265.
    [93] W.E. Winkler. The state of record linkage and current research problems. Statistics of Income Division, Internal Revenue Service Publication R, 1999. 4.
    [94] D.J. DeWitt, J.F. Naughton, D.A. Schneider. An Evaluation of Non-equijoin Algorithms. 1991: University of Wisconsin-Madison, Computer Sciences Dept.
    [95] M.A. Hernndez, S.J. Stolfo. The merge/purge problem for large databases. Proceedings of the 1995 ACM SIGMOD international conference on Management of data, 1995: p. 127-138.
    [96]周傲英,季文赟,田增平,邱越峰.一种高效的检测相似重复记录的方法.计算机学报, 2001. 01.
    [97] J.A. Hylton. Identifying and Merging Related Bibliographic Records. 1996.
    [98] Z. Jihong. Using keywords phrases in automatically generating hypertext links: An exploratory study. 2002, USA: University at Albany, State University of New York.
    [99] E. Brill. Some Advances in Transformation-Based Part of Speech Tagging. PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, 1994: p. 722-722.
    [100] T. News. Serving Up Knowledge -- Products from Fulcrum and Verity can help you tap previously inaccessible information streams. 1997.
    [101] P.D. Turney. Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2000. 2(4): p. 303-336.
    [102] P.D. Turney. Mining the Web for Lexical Knowledge to Improve Keyphrase Extraction: Learning from Labeled and Unlabeled Data. Arxiv preprint cs.LG/0212011, 2002.
    [103] P.D. Turney. Learning to Extract Keyphrases from Text. Submitted to J Information Retrieval, 1999.
    [104] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, C.G. Nevill-Manning, N.Z. Hamilton, N.J. Piscataway. KEA: Practical Automatic Keyphrase Extraction. International Conference on Digital Libraries. 1999. New York, NY, USA: Proceedings of the fourth ACM conference on Digital libraries. p. 254 - 255.
    [105] E. Sang, F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. Development, 2003. 922: p. 1341-1837.
    [106]李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究.计算机学报, 2004. 9.
    [107] J.N. Darroch, D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 1972. 43(5): p. 1470-1480.
    [108] R.A. Wagner, M.J. Fischer. The String-to-String Correction Problem. Journal of the ACM (JACM), 1974. 21(1): p. 168-173.
    [109] R. Lowrance, R.A. Wagner. An extension of the string to string correction problem. J. Assoc. Comput. Mach, 1975. 22(2): p. 177-183.
    [110] A.E. Monge, C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. Research Issues on Data Mining and Knowledge Discovery, 1997: p. 23-29.
    [111] W.W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 2000. 18(3): p. 288-321.
    [112] D. Lopresti, A. Tomkins. Block edit models for approximate string matching.Theoretical Computer Science, 1997. 181(1): p. 159-179.
    [113] J. Yang, W. Wang. CLUSEQ: efficient and effective sequence clustering. Data Engineering, 2003. Proceedings. 19th International Conference on, 2003: p. 101-112.
    [114] D. Gusfield. Algorithms on Strings, Trees, and Sequences: Cambridge University Press.
    [115] R. Grossi, J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract). Proceedings of the thirty-second annual ACM symposium on Theory of computing, 2000: p. 397-406.
    [116] V.I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 1966. 10: p. 707.
    [117] C. Spr. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.
    [118] A.E. Monge, C. Elkan. The Field Matching Problem: Algorithms and Applications. Knowledge Discovery and Data Mining, 1996: p. 267篓C270.
    [119] T. Nela. A Flexible Tool for Jaccard Score Evaluation. 1997, University of Belgrade, Belgrade, Serbia, Yugoslavia.
    [120] M. Bilenko, R.J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Submitted to CIKM-2002, 2002.
    [121] E.S. Ristad, P.N. Yianilos, M.T. Inc, N.J. Princeton. Learning string-edit distance. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1998. 20(5): p. 522-532.
    [122] A.E. Monge. An Adaptive and Efficient Algorithm for Detecting Approximately Duplicate Database Records, On-line document. 2000.
    [123] M.A. Jaro. Probabilistic linkage of large public health data files. Stat Med, 1995. 14(5-7): p. 491-8.
    [124] E. Ukkonen. On-line construction of suffix trees. Algorithmica, 1995. 14(3): p. 249-260.
    [125] L. Allison. Suffix Tree. 2003.
    [126] B. He, K.C.C. Chang, J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004: p. 148-157.
    [127] B. He, K.C.C. Chang. Making holistic schema matching robust: an ensemble approach. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005: p. 429-438.
    [128] M. Charikar, L. O'Callaghan, R. Panigrahy. Better streaming algorithms forclustering problems, Proceedings of the thirty-fifth annual ACM symposium on Theory of computing. 2003, ACM New York, NY, USA. p. 30-39.
    [129] S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan. Clustering Data Streams: Theory and Practice. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2003: p. 515-528.
    [130] B. Babcock, M. Datar, R. Motwani, L. O'Callaghan. Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 2003, ACM New York, NY, USA. p. 234-243.
    [131] L. O Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani. Streaming-Data Algorithms for High-Quality Clustering, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING. 2002, IEEE Computer Society Press; 1998. p. 685-696.
    [132] E.M. McCreight. A Space-Economical Suffix Tree Construction Algorithm. Journal of the ACM (JACM), 1976. 23(2): p. 262-272.
    [133] W. Cohen, P. Ravikumar, S. Fienberg. A comparison of string metrics for matching names and records. KDD Workshop on Data Cleaning and Object Consolidation, 2003. 3.
    [134] D. Calvanese, G. De Giacomo, M. Lenzerini, R. Rosati. Logical foundations of peer-to-peer data integration. Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2004: p. 241-251.
    [135] G. Greco, F. Scarcello. On the complexity of computing peer agreements for consistent query answering in peer-to-peer data integration systems. Proceedings of the 14th ACM international conference on Information and knowledge management, 2005: p. 36-43.
    [136] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 1988. 2(4): p. 285-318.
    [137] T.G. Dietterich. Machine-Learning Research. 1997.
    [138] J. Domingo-Ferrer, V. Torra. Median-based aggregation operators for prototype construction in ordinal scales. International Journal of Intelligent Systems, 2003. 18(6): p. 633-655.
    [139] R.R. Yager. On ordered weighted averaging aggregation operators inmulticriteria decisionmaking. Systems, Man and Cybernetics, IEEE Transactions on, 1988. 18(1): p. 183-190.
    [140]何新贵.模糊知识处理的理论与技术(第2版). 2005:国防工业出版社.
    [141] A.P. Li, Q.Y. Wu. On Aggregation Operators for Fuzzy Information Sources. LECTURE NOTES IN COMPUTER SCIENCE, 2004: p. 223-233.
    [142]陆志峰.模糊逻辑的研究.计算机工程与应用, 1999. 8.
    [143]汪培庄,李洪兴.模糊系统理论与模糊计算机. 1996,北京:科学出版社.
    [144] A.P. Li, Q.Y. Wu. On Harmonic Triangular Norm Aggregation Operators in Multicriteria Decision. Proceedings of the 8th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, 2004. 71.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700