用户名: 密码: 验证码:
征信系统中实体匹配方法及应用研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
实体是指社会经济运行过程中有经济活动能力个体或组织,在征信系统中它可以指称个人、家庭、企业、企业集团等。实体匹配就是判定语法不同的信用信息所描述的实体是否具有相同的语义。征信系统是覆盖全国每一个有经济活动能力的实体的信用档案信息系统,它通过采集分散在社会不同部门信用信息,并按照信用实体为主题进行归集和发布的信息系统,为全国每一个有经济活动能力的实体建立其信用档案。征信系统是社会信用体系的基础设施,随着市场经济的不断发展,征信系统在社会经济生活中扮演着越来越重要的角色。
     实体匹配是建设全国统一征信系统的技术基础。由于不同数据源信用记录标识主键不同,加之存在数据输入错误、格式、拼写差异等问题,为了实现征信系统的功能目标,需要开展大量信用记录的实体模糊匹配运算。征信系统中实体匹配可以分为三个层次的匹配运算,分别是字段级匹配、记录级匹配和复杂结构级匹配。除此之外还需要解决征信系统所特有的匹配数据量大,采集数据源差别大,范围广,不断扩展等技术难点问题。
     本文以征信系统中实体匹配运算为研究对象,按照从不同数据源数据特征学习相应匹配函数的研究思路,主要进行了以下几方面研究:
     (1)研究了自适应字段匹配问题,提出了基于关联token的自适应字符串相似度计算方法。该算法通过关联token操作集,形式化定义了同音字相似度,提炼不同数据源的词频与关联操作频度的数据特征,并通过对支持向量机训练,以计算适应词频、关联类型等数据特征的匹配分类及相似度计算函数。通过实验验证与对比分析,说明了该算法对于数据源的数据质量、关联类型等都具有良好的适应性。
     (2)研究了有标识字段的实体信用记录高效匹配问题,设计了联合分组模型。为了解决大数据量实体信用记录高效匹配问题,通过分组算子抽取了索引和分组运算特征,引入了析取式和析取范式的整体分组式概念,使用多个分组算子联合对实体记录进行分组,设计了联合分组模型,以减少匹配运算中比较次数,提高信用记录匹配运算的效率。最后使用求解覆盖集方法,在保证匹配运算精度的前提下,符合不同数据源特点的最优整体分组式。通过实验验证以上方法具有较高的匹配运算效率。
     (3)研究了多数据源无标识字段的实体记录匹配问题。设计了半监督式基于主动学习的实体匹配方法和无监督式基于迭代SVM的自动实体匹配方法。其中前者应用主动学习的思想,首先使用聚类队列建立多个匹配函数学习机组成学习委员会,其次使用匹配熵计算式,由学习委员会在候选训练样本中主动挑选最有利匹配函数学习的实体记录对,实现对实体记录对标识字段与匹配函数自主学习。后者是利用SVM学习机最大化分类超平面与支持向量之间距离的特性,自动学习新数据源的标识字段和匹配函数。首先使用最近邻居法自动选择初始训练样本集,其次应用最大化分类间隔的特点迭代对SVM进行自动训练,使分类超平面逐步逼近匹配实体对与非匹配实体对的分类边界,实现自动的实体匹配函数的学习。通过实验分析了主动学习实体匹配方法和迭代SVM自动实体匹配方法的优点及限制条件。
     (4)研究了复杂数据结构的记录簇实体匹配问题。根据记录簇实体的特殊的数据结构,应用赋权二部图理论建立了规范的记录簇实体匹配的数学模型。为了实现高效地记录簇实体匹配运算,设计了记录簇实体上下界匹配算法,使用快速推导出匹配实体阈值的上下界,减少实体所属子记录最大权匹配的计算次数。通过数据实验,验证了本文提出的匹配模型与方法可以有效提高记录簇实体匹配精度和效率。
     (5)研究了复杂数据结构的XML半结构化实体匹配问题,通过计算XML文本中不同类型的属性节点在父节点中的权重,设定匹配实体相似度阈值,求取XML转换规则和实体匹配函数,进行XML实体的匹配运算。使用实验数据说明该方法具有良好的匹配分类效率。
     本文是在中国人民银行负责建设的全国集中统一的企业与个人征信系统的基础上,通过总结其实体匹配运算所面临的技术瓶颈,分析目前方法中存在的缺陷,提炼,抽象出具体的研究问题。本文提出的实体匹配方法,目前多数都已在个人与企业征信系统中投用,解决了征信系统建设过程中遇到的多数据源、海量数量、复杂结构条件下的实体匹配技术难点问题,取得了实验结果基本一致的良好使用效果。目前企业征信系统实现信贷、结算账户、社保缴费、环境违法信息等15大类共882家机构的信用信息采集与匹配运算。个人征信系统实现信贷、公积金缴存、养老保险、电信欠费等11大类共702家机构的信用信息采集与匹配运算,基本实现了全面统一的实体信用信息归集整理的征信系统建设目标。
The entities symbolize the economic activity ability individuals or organizations. The entities in the credit reference system means individuals, families, enterprises and enterprise groups, etc. Entities'matching is used to check that whether the entities described by different syntax are the same semantics. The main function of the credit reference system is to collect the credit data which scatters in the different departments of society, and then, classify and release the credit information by different credit entities. The goal of this system is to build a credit information management system, which covers every economical ability entity over the country.
     Entity matching is the technical basis of the credit reference system. There are lots of fuzzy matching operations of credit entities in this system. The reasons are listed as follows: firstly, the primary key of credit entities is distinct in different information sources; secondly, there are various problems in the credit data, such as input error, spelling error and format difference, etc. Entity matching in the credit reference system can be classified into three levels:field level match, record level match and complex structure entities match. Furthermore, the credit reference system must resolve many difficult technology problems such as the huge volumes of matching data, great difference among the different data sources and so on.
     This paper studies entities matching on the credit reference system. It proposes the solutions and algorithms for entities matching based on studying the appropriate matching function by the different data source characteristics. The main content of this paper are as follows.
     (1) The problem of the adaptive field matching is studied, and an adaptive string similarity calculation method is proposed based on associated token. According to associate with token operator sets, the proposed algorithm formally defines the similarity of homophone, refines data characteristics from word frequency and associated operator frequency of different data sources. The method compute data characteristics such as adapt frequency, association types, etc. of matching classification and similarity through support vector machine training. This method verified through experiments and comparative analysis is well adaptive for different data quality and associated types.
     (2) The efficiently matching problem of massive entities data is studied, and a joint grouping model is designed. The indexing and grouping characteristics are abstracted through grouping operators, and the disjunctive and disjunctive formal overall group-style concepts are introduced. These could be used in the same data source with many group operators join, group the matching operation entities records, reduce the times of records comparison during the entities matching computing process. Then, the best overall group-style operators according with different data source which could solving the effective of massive data entities matching problem are computed by using cover set solving, ensuring the accuracy of matching operation. It's proved by experiment; this method could improve the effective of matching operation.
     (3) The matching problem of multi-source unmarked field is studied. this paper proposed a semi-supervised entity matching method based on active learning and an unsupervised automatic matching based on iterative learning of SVM. The method based on active learning constructs multiple matching functions learners and builds up learning committee. Then in the following learning process, learning committee chooses, on his own, candidate training samples as training sample, regarding most gains of the study entity match function information. The method based on iterative learning of SVM maximizes classification distance between the support vectors and the planes. This can be divided into two steps, the first stage is to use recent neighbor method to select initial training sample automatically. The second stage is to use the characteristic of SVM the maximization classified interval, the iteration carries on the automatic training to SVM. This paper has analyzed the active learning entity match method through the experiment and the merit and the limiting condition of the iterative SVM automatic entity match method.
     (4)The matching problem of recorded cluster entity is studied. According to the special data structure of recorded cluster entity, the normative recorded cluster entity matching model is set up with weighting bipartite graph theory. The recorded cluster entity's upper and lower bound matching algorithm is designed. Through quickly deriving the threshold's upper and lower bound of matching entity, the entity's sub-record max weighting matching times are decreased. Through data experimental, it is confirmed that the proposed matching model and method can effective raise the precision and efficiency of the recorded cluster entity matching.
     (5)The matching problem of XML semi-structured entity is studied. Through computing the weights between different types attributes nodes with its father node in XML text, setting up the threshold of matching entity similarity and seeking XML transform rules and entity matching functions, the XML entities matching operation is carried out. The experimental results prove that this method has good matching efficiency.
     Based on the credit reference system constructed by People's Bank of China, this paper sums up technical bottleneck of entities matching of credit reference system. The concrete research issues are proposed after analyzing weak points of present methods. Meanwhile, the algorithms and solutions that this paper provides have mostly applied to credit reference system of enterprises and individuals, to resolve the problems of entities matching of multi-source data, mass data and complicated structure. At present, credit reference system of enterprises has collected and matched 15 types credit information of 882 agencies, including financial credit, clearance account, social security and environmental illegality etc. Credit reference system of individuals, has collected and matched for 11 types credit information of 702 agencies, including financial credit, housing provident fund, endowment insurance, and telecommunications arrearage etc. In a word, the credit reference system for universal entities credit information collecting has been realized.
引文
[1]张周,信用信息共享和中国征信模式选择研究[D].上海:复旦大学.2003
    [2]张维迎.博弈论与信息经济学[M].上海三联书店,上海人民出版社,1996
    [3]杜金富,张新泽,王振营,等.征信理论与实践[M].北京:中国金融出版社.2004
    [4]陈新权.社会信用研究[J].新华文摘,2002,(6):87-96
    [5]张亦春.中国社会信用问题研究[M].中国金融出版.2004
    [6]张萌.评估体系与评估方法研究[D].西安:西北大学.2006
    [7]杜晓伟.我国征信体系的构建及模式选择[D].济南:山东大学,2006
    [8]陈凌,强保华,余建桥.一种基于BP神经网络的实体匹配方法[J].计算机应用研究.2006,12:38-39
    [9]Alvaro M, Charles E. An Efficient Domain-independent Algorithm for Detecting Approximately Duplicate Database Records. Workshop on Research Issues on Data Mining and Knowledge Discovery(DMKD'97). Tucson, AZ,1997
    [10]林钧跃.征信技术基础[M].北京:中国人民大学出版社.2007
    [11]惠瑶.企业信用信息支持问题研究[D].大连:东北师范大学.2004
    [12]托马斯.林,戴维.埃德尔曼,著.王晓蕾,石庆焱,吴晓惠,译.信用评分及其应用[M].北京:中国金融出版社.2006
    [13]Ahmed K E, Panagiotis G I, Vassilios S V. Duplicate record detection:a survey[J]. IEEE Transactions on Knowledge and Data Engineering.2007,19(1):1-15
    [14]曹文炼,李海鹏.当前我国社会信用体系的建设与发展[J].经济研究参考.2003,63:2-26
    [15]企业和个人征信体系建设参考资料[R].中国人民银行统计司征信体系专题工作小组研究报告.2002
    [16]刘开瑛.中文文本自动分词和标注[M].北京:商务印书馆,2000
    [17]张春霞,郝天永.汉语自动分词的研究现状与困难[J].系统仿真学报,2005,17(1):139-140
    [18]Akiko A, Keizo 0. A fast linkage detection scheme for multi-source information integration[C]. In:Proceedings of the 2005 international workshop on challenges in web information retrieval integration.2005,30-39
    [19]Yancey W E. Evaluating String Comparator Performance for Record Linkage. Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C.2005.
    [20]Sarawagi S. editor. Special issue on data cleaning[J]. IEEE Data Engineering Bulletin.2000,23(4):2-5
    [21]任兴洲.建立社会信用体系的国际经验与启示[R].国务院发展研究中心市场经济研究所“建立我国社会信用体系的政策研究”课题报告.2002
    [22]李颖.我国个人信用征信体系研究[D].上海:同济大学.2005
    [23]美国信用体系考察团.美国信用服务体系发展状况及对我国当前社会信用体系建设的启示[J].经济研究参考.2005,(8):37-44
    [24]王征宇,于江,黎晓波,等.美国的个人征信局及其服务[M].北京:中国方正出版社.2003
    [25]黄健斌,姬红兵,孙鹤立.近似重复记录的自适应距离度量检测[J].西安电子科技大学学报(自然科学版).2007,34(2):331-336
    [26]Chaudhuri S, Ganjam K, Ganti V, et al. Robust and Efficient Fuzzy Match for Online Data Cleaning[C]. In:Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,2003.313-324
    [27]Navarro G. A guided tour to approximate string matching[J]. ACM Computing Surveys. 2001,33(1):31-88
    [28]Masek W, Paterson M A. Faster Algorithm Computing String Edit Distance[J]. Journal of Computer System Science.1980,20(1):18-31.
    [29]杨宏娜.基于数据仓库的数据清洗技术研究[D].石家庄:河北工业大学.
    [30]Smith T F, Waterman M S. Identification of Common Molecular Subsequence[J]. Journal of Molecular Biology.1981,147(1):195-197
    [31]Verykios V S, Moustakides G V. A Bayesian Decision Model for Cost Optimal Record Matching[J]. VLDB Journal,2003,12(1):28-40
    [32]Jaro M A. Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida[J]. Journal of the American Statistical Association 1989, 84:414-420
    [33]韩京宇,徐立臻,董逸生.一种大数据量的相似实体检测方法[J].计算机研究与发展.2005,42(12):2206-2212
    [34]Ukkonen E. Approximate String Matching with q-Grams and Maximal Matches[J]. Theoretical Computer Science.1992,92(1):191-211
    [35]Surajit C, Kris G, Venkatesh G. Robust and Efficient Fuzzy Match for Online Data Cleaning[C]. SIGMOD Conference 2003:313-324
    [36]Mikhail B, Raymond J. Adaptive duplicate detection using learnable string similarity measures[C]. In proceedings of ACM SIGKDD-03. Washington DC,2003, 39-48
    [37]Cohen W W. Integration of heterogeneous databases without common Domains Using Queries Based on Textual Similarity[C]. Proc.1998 ACM SIGMOD international conference on Management of Data.1998.201-212.
    [38]Kanga I S, Nac S H, Lee S, et al. On co-authorship for author disambiguation[J]. Information Processing & Management.2009,45(1):84-97
    [39]Mong L L, Wynne H, Vijay K. Cleaning the Spurious Links in Data[J].2004, 19(2):28-33
    [40]Valentin J, Mahboob A K, Maarten M, Maarten D R, et al. Named entity normalization in user generated content[C]. Proceedings of the second workshop on Analytics for noisy unstructured text data.2008,23-30
    [41]S. Ram, H. Zhao. Detecting both schema-level and instance-level correspondences for the integration of e-catalogs[C]. in Proceedings of the WITS, New Orleans, LA, USA,2001,193-198.
    [42]Widrow B, Rumehart D E, Lehr M A. Neural networks:Applications in industry, business, and science[J]. Communications of the ACM, (1994) 37:93-105
    [43]Milena Y, Horacio S, Hamish C. Adopting ontologies for multisource identity resolution[C]. Proceedings of the first international workshop on Ontology-supported business intelligence. Karlsruhe, Germany.2008
    [44]Hernandez M, Stolfo S. The Merge/Purge problem for large databases[A]. In:Proc ACM SIGMOD International Conference on Management of Data.1995:127-138
    [45]Monge A E, Elkan C. The field matching problem:algorithms and applications[A]. In:Proceedings of the 2nd International Conference on Knowledge Discovery and Data Minging. Oregon, AAAI Press.1996,267-270
    [46]Monge A E. Matching algorithms within a duplicate detection system [J]. IEEE Data Engineer Bulletin.2000,23(4):14-20
    [47]Jin L, Li C, Mehrotra S. Efficient Record Linkage in Large Data Sets[C]. In:Eighth International Conference on Database systems for Advanced Applications, March 2003.137-146
    [48]Faloutsos C, Lin K I. FastMap:A Fast Algorithm for Indexing Data-mining and Visualization of Traditional and Multimedia Datasets[C]. ACM SIGMOD 1995,163-174
    [49]McCallum A, Nigam K, Ungar L. Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching[A]. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining. Santiago:Morgan Kaufmann.2000,169-178
    [50]Clarke C, Cormack G. Dynamic Inverted Indexes for a Distributed Full-Text Retrieval System[R]. Technical Report MT-95-01, University of Waterloo.1995
    [51]Fellegi I, Sunter A. A theory for record linkage. American Statistical Association [J].1969 12:1183-1210.
    [52]Winkler W E. Improved Decision Rules in the Felligi-Sunter Model of Record Linkage[R]. Technical Report Statistical Research Report Series RR93/12, US Bureau of the Census. Washington, D.C.1993.
    [53]Winkler W E. Methods for Record Linkage and Bayesian Networks [R]. Technical Report Statistical Research Report Series RRS2002/05, US Bureau of the Census. Washington, D.C.2002.
    [54]William E W. The State of Record Linkage and Current Research Problems[EL/OB]. http://www. census. gov/srd/www/byname. html.2006
    [55]Bilenko M, Mooney R J. Adaptive Duplicate Detection Using Learnable String Similarity Measure[c]. In:Proceeding of the Ninth ACM SIGKDD
    [56]Caruso F, Cochinawala M, Ganapathy U, et al. Telcordia's Database Reconciliation and Data Quality Analysis Tool[C]. In:the Proceedings of the 26th International Conference on Very Large Database. Egypt,2000
    [57]William E W. Overview of record linkage and current research directions[R]. US Bureau of the Census, Statistical Research Report Series RRS2006/02,2006
    [58]Indrajit B, Lise G. Collective Entity Resolution in Relational Data[J]. ACM Transaction on Knowledge Discovery from Data,2007,1(1):1-36
    [59]Ristad E S, Yianilos P N. Learning String Edit Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence.1998,20:522-532.
    [60]Melanie W, Felix N. Detecting Duplicate objects in XML Documents[C]. Proceedings of the 2004 international workshop on Information quality in information systems, 2004,10-19
    [61]Paolo C, Marco P, Pavel Z. M-tree:An Efficient Access Method for Similarity Search in Metric Spaces [C]. In Proceedings of 23rd International Conference on Very Large Data Bases. San Francesco, California.1997:426-435
    [62]陈伟.数据清理关键技术及其软件平台的研究与应用[D].南京:南京航空航天大学.2004
    [63]D. Dey, S. Sarkar, and P. De. A distance-based approach to entity reconciliation in heterogeneous databases[J]. IEEE transactions on knowledge and data engineering,2002,14(3):382-396
    [64]Yancey W E. Evaluating String Comparator Performance for Record Linkage. Technical Report Statistical Research Report Series RRS2005/05, US Bureau of the Census, Washington, D.C.2005
    [65]Agrawal R, Srikant R. Searching with Numbers[C]. Proc.11th International World Wide Web Conference.2002,420-431
    [66]Cohen W, Ravikumar P, Feinberg S. A comparison of String Metrics for matching names and records[C]. Proceedings of KDD-2003 workshop on data cleaning and object consolidation. New York. ACM Press,2003,103-108
    [67]Oktie H, Anastasios K, Lipyeow L. A framework for semantic link discovery over relational data[C]. Proceeding of the 18th ACM conference on Information and knowledge management.2009,1027-1036
    [68]Sheila T, Craig A K, Steven M. Leaning domain-independent string transformation weights for high accuracy object identification[C]. In Proceedings of ACM SIGKDD-02, Edmonton, Alberta, Canada,2002.
    [69]Carina F D, Marcos F N, Carlos A H. A strategy for allowing meaningful and comparable scores in approximate matching[J]. Information Systems.2008,34(8): 673-689
    [70]杨善林,倪志伟.机器学习与智能决策支持系统[M].北京:科学出版社.2004
    [71]张学工.关于统计学习理论与支持向量机[J].自动化学报.2006,26(1):32-37.
    [72]Jakub P, Karol W, Marcin S. On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages[J]. Information Retrieval. 2008,12(3):275-299
    [73]Vapnik V. The nature of statistical learning theory[M]. New York: Springer-Verlag,1999. Nello C, John S T. An introduction to support Vector Machines:and other kernel-based learning methods[M]. Cambridge University Press. New York, USA.1999
    [74]Nello C, John S T. An introduction to support Vector Machines:and other kernel-based learning methods[M]. Cambridge University Press. New York, USA.1999
    [75]Gusfield D. Algorithms on string, trees and sequences[M]. Cambridge University Press, New York.1997
    [76]Joachims T. Making large-scale support vector machine learning practical[M]. In advance in Kernel Methods:support vector machine. MIT Press, Cambridge, MA,1998
    [77]Matthew A J. Probabilistic linkage of large public health data files[J]. Statistics in Medicine.2008,14(5):491-498
    [78]Chuan X, Wei W, Xuemin L. Efficient similarity joins for near duplicate detection[C]. In Proceeding of the 17th international conference on World Wide Web. Beijing, China.2008,131-140
    [79]Ahmed K E, Panagiotis G I, Vassilios S V. Duplicate record detection:a survey [J]. IEEE Transactions on Knowledge and Data Engineering.2007,19(1):1-15
    [80]Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records[C]. Proc second ACM SIGMOD workshop research issues in data mining and knowledge discoversy.1997.23-29
    [81]Jordi N, victor M M, Morbert M B. On the use of semantic blocking techniques for data cleansing and Integration[C]. proc 11th international database engineering and applications symposium.2007
    [82]Munir C, Verghese K, Gail L, et al. Efficient data reconciliation[J]. Information sciences,2001,137:1-15
    [83]Carr R D, Doddi S, Konjevod G. On the Red-Blue Set Cover problem[C]. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms. San Francisco, CA. 2000
    [84]Peleg D. Approximation algorithms for the Label-CoverMAX and Red-Blue Set Cover problems[C]. In Proceedings of the 7th Scandinavian Workshop on Algorithm Theory.2000
    [85]Tom M M.曾华军,张银奎,等译.机器学习[M].机械工业出版社.2003年
    [86]Burges C. A tutorial on support vector machines for pattern recognition [J]. Data Mining and Knowledge Discovery,1998,2 (2):143-156
    [87]Catherine B, Esa R. Overview of Data Mining for Customer Behavior Modeling[R]. VTT Information Technology Research Report TTE1-2001-18.2001
    [88]李盼池,许少华.支持向量在模式识别中的核函数特性分析[J].计算机工程与设计,2005,26(2):302—304
    [89]Baxter R, Christen P, Churches T. A Comparison of Fast Blocking Methods for Record Linkage[C]. In proceedings of KDD-2003 workshop. Washington DC.2003
    [90]Freund Y, Seung H S, Shamir E. Selective sampling using the query by committee algorithm[J]. Machine Learning,1997,28(2):133-168
    [91]Andriy N, Victoria U, Enrico M. Integration of Semantically Annotated Data by the KnoFuss[J]. Lecture Notes in Computer Science:Knowledge Engineering:Practice and Patterns.2008,52:265-274
    [92]Partha P T, Marie J, Muhammad S M, et al. Learning to create data-integrating queries[C]. Proceedings of the VLDB Endowment.2008,785-796
    [93]萨师煊,王珊.数据库系统概论[M].北京:高等教育出版社.2000
    [94]邱越峰,田增平,季文赞,等.一种高效的检测相似重复记录的方法[J].计算机学报,2001,24(1):69-77
    [95]赵悦,穆志纯.基于委员会投票选择方法的主动学习的研究[J].太原理工大学学报.2006,37(4):469-472
    [96]Melanie W, Felix N, Ulrich J, et al. Industry-scale duplicate detection[C]. Proceedings of the VLDB Endowment.2008,1253-1264
    [97]Yu H. SVMC:Single-class classification with support vector machines[C]. In Proceeding of 18th international joint conference on Artificial Intelligence. Acapulca, Maxico,2003
    [98]余小鹏,周德翼.一种自适应k-最近邻算法的研究[J].计算机应用研究,2006,(2):70-72
    [99]张浩然,韩正之,李吕刚.支持向量机[J].计算机科学.2002,29(12):135-137
    [100]Yu H, Han J K, Chang C. PEBL:Positive-example based learning for Web page classification using SVM[C]. In Proc 8th Conference of Knowledge Discovery and Data Mining (KDD'02), Edmonton, Canada,2002.239-248
    [101]Yu H, Zhai C X, Han J. Text classification from positive and unlabeled documents[c]. In CIKM'03, New Orleans,2003.232-239
    [102]Weis M, Naumann F. Detecting Duplicate objects in XML documents[c]. in Proceedings of the 2004 international workshop on Information quality in information systems, Paris, France.2004,10-19
    [103]Eugenio C, Francesco F, Antonio L, et al. Boosting text segmentation via progressive classification[J]. Knowledge and Information Systems.2008,15(3): 285-320
    [104]Ahuja R K, Magnanti T L, Odin J B. Network flows:theory, algorithms and applications[M]. Prentice Hall,1993:665-671
    [105]Strehl A, Ghosh J, Mooney R. Impact of Similarity Measures on Web-page Clustering[C]. Proc of AAAI Workshop on AI for Web Search. Austin, USA,2000.
    [106]谢政.网络算法与复杂性理论[M].长沙:国防科技大学出版社,2003:168-177.
    [107]Min-Soo K, Kyu-Young W, Jae-Gil L, Min-Jae L. A space and time efficient two-level n-gram inverted index structure[C]. Proc of the 31st VLDB. Trondheim, Norway.2005
    [108]韩京宇.提高数据质量的若干关键技术研究[D].南京:东南大学.2005
    [109]Han J Y, Cheng Y, Dong Y S. Efficient Cleaning Approach for XML Data[J]. Computer Engineering.2008,34(15):47-50
    [110]Liang J, Chen Li. Efficient Record Linkage in Large Data Sets [C]. In proceedings of the 8th International Conference on Database Systems for Advanced Applications. 2003.137-149
    [111]KAILING K, KRIEGEL H P. Efficient similarity search for hierarchical data in large databases[C]. In Proceedings of the 9th International Conference on Extending Database Technology.2004 676-693
    [112]Luis L, Pavel C, Melanie W. Structure-based inference of xml similarity for fuzzy duplicate detection[C]. In proceedings of the 16th ACM conference on information and knowledge management, Lisbon, Portugal.2007,
    [113]Jens B, Felix N, Data fusion[J], ACM Computing Surveys.2008,41(1):1-41
    [114]Melanie W, Felix N, Franziska B. A duplicate detection benchmark for XML (and relational) data[C]. In In Proceedings of SIGMOD Workshop on Information Quality for Information Systems.2006.
    [115]Adrovane M K, Carlos A H. Matching XML documents in highly dynamic applications[C]. Proceeding of the 8th ACM symposium on document engineering. Sao Paulo, Brazil.2008.191-198
    [116]Sudipto G, Nick K, Divesh S, Ting Y. Integrating XML data sources using approximate joins[J]. ACM Transactions on Database Systems.2006,31(1):161-207
    [117]Melanie W, Felix N. DogmatiX tracks down duplicates in XML[C]. In proceedings of the 2005 ACM SIGMOD international conference on Management of data. Baltimore, Maryland.2005
    [118]Sheila T. Learning object identification rules for information integration [D]. Doctor Dissertation.2002
    [119]Mehmet A, Michael J F. Efficient Filtering of XML Documents for Selective Dissemination of information [C]. Proceedings of the 26th International Conference on Very Large Data Bases. San Francisco, CA, USA.2000.53-64
    [120]张浩,金海,聂江武.一种更加有效的分布式哈希表[J].小型微型计算机系统.27(8):1450-1454
    [121]Judice L Y. Mong L L, Wynne H, et al. Correlation-based Attribute Outlier Detection in XML[C]. Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. Washington, DC, USA.2008,1522-1524
    [122]Adrovane M K, Carlos A H. Matching XML documents in highly dynamic application[C]. Proceeding of the eighth ACM symposium on Document engineering. Sao Paulo, Brazil.2008,191-198
    [123]Hanna K, Erhard R. Frameworks for entity matching:A comparison[J]. Data & Knowledge Engineering.2009,31(1):36-56

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700