基于语义的科技文献元数据检索系统

英文题名：Semantic Based Scientific Literature Metadata Retrieval System
作者：褚帆
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：科技文献 ; 元数据检索 ; 语义关联 ; 语义推理 ; 重复记录清理
英文关键词：Scientific Literature ; Metadata Retrieval ; Semantic Association ; Semantic Reasoning ; Duplicate Records Cleansing
学位年度：2007
导师：邹德清
学科代码：081202
学位授予单位：华中科技大学
论文提交日期：2007-06-01

摘要

由于缺乏语义信息,传统的元数据检索难以准确地描述科技文献元数据的内在特征。从异构数据源导入的各种元数据存在差异性和重复性,不易获取基于语义关联的信息,导致结果容易出现语义偏差,元数据中会存在很多重复记录,使得检索结果也会出现很多重复记录,因此必须对其进行重复记录清理来提高检索质量。
     为了减少领域资源中单纯数据库和统计检索方法带来的缺陷,基于语义的科技文献共享平台-SemreX的元数据检索借鉴语义思想,提出了针对科技文献的元数据检索模型。采用英文名与中文拼音名的识别方法以及中文拼音切分算法,实现元数据的各种关联;提供元数据检索入口,使用各种语义推理规则、作者关联算法和三种语义关联检索方法,包括概念、实例和语义关系的关联,语义关系又进一步分为概念与概念、概念与实例、实例与实例三种子类型,来实现基于语义关联的元数据检索,使得元数据的检索结果更加准确而丰富,符合用户的直观语义需求;对检索结果中的重复记录进行清理,针对元数据重复记录清理各步骤中算法的缺陷进行了改进。在重复记录检测过程中,针对字段值的特点采用基于编辑距离的字段匹配算法;采取利用有效权值和长度过滤的优化算法进行记录匹配;在数据库级上对重复记录进行聚类操作过程中,针对传统的基本近邻排序算法的两个缺陷改进了基本近邻排序算法。
     SemreX的元数据检索系统基于元数据检索框架,利用语义关联检索以及相关技术,并结合元数据重复记录清理技术,实现高效的科技文献元数据检索。
For resources retrieval, traditional statistic strategy uses keyword based algorithms efficiently, but with the lack of semantic information, both search query and result have much misunderstanding. Meanwhile, data from heterogeneous sources may exist various quality problems.There are many duplicate records in the retrieve results. There is a strong need to carry out a cleansing process to improve the data quality.
     To overcome the disadvantage mentioned above, we use semantic thinking, and describe a metadata retrieval model for scientific literatures. In semantic retrieving, we provide a semantic search portal and use semantic reasoning rules to improve search result. At the same time, we put forward the semantic search for metadata including concept, instance and relationship. The relationship can be further divided into three types in detail, i.e., the relationship between concepts, between instances, and between concept and instance.We summarized and described the theories, methods, evaluating standards and basic workflow of data cleansing. Especially our researching emphasis is on the techniques and algorithms of duplicate records cleansing, and we put forward the relevant advanced algorithms. In duplicate records cleansing, we introduce its basic knowledge and workflow, depict the main techniques and algorithms in detail in each step respectively. At the same time, we give our advanced algorithms to improve the limitation of original ones in each step. They mainly include the following: the advanced method using sorted key to sort the dataset. In duplicate records detection, we put forward the field match algorithm and abbreviation-discovered algorithm based on edit distance. In record match, we come up with the optimized method using valid weight value and length filtering to reduce the runtime of original algorithm and improve its efficiency. In clustering the duplicate records on database level, we amend two limitations of traditional sorted neighborhood method and give the advanced sorted neighborhood method.
     At last, based the metadata management model framework and previous research work on duplicate records cleansing, we apply the strategies of semantic retrieval to SemreX System.

引文

[1]赵健.数字图书馆中的元数据研究.图书情报学, 2004(1): 69~70
    [2] A. Maedche, B. Motik, L. Stojanovic, R. Studer, R. Volz. An infrastructure for searching, reusing and evolving distributed ontologies. In: Proceedings of the 12th international conference on World Wide Web, 2003. 439~448
    [3] Gudivada, V.N., Raghavan, V.V., Grosky, W.I., Kasanagottu, R. Information retrieval on the World Wide Web. IEEE Internet Computing, 1997. 58~68
    [4] Castells, P., Fernandez, M., Vallet, D. An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering, 2007. 261~272
    [5] Christoph Mangold, Holger Schwarz, Bernhard Mitschang. Improving intranet search-engines using context information from databases. In: Proceedings of the 14th ACM international conference on Information and knowledge management, 2005. 349~350
    [6] S. Cohen, J. Mamou, Y Kanza, Y Sagiv. XSEarch: A Semantic Search Engine for XML. In: Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003. 213~224
    [7] Jennifer Chu-Carroll, John Prager, Krzysztof Czuba, David Ferrucci, Pablo Duboue. Semantic search via XML fragments: a high-precision approach to IR. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006. 445~452
    [8] Yasushi Kiyoki, Xing Chen, Takashi Kitagawa. A Semantic Associative Search Method for WWW Information Resources. In: Proceedings of the First International Conference on Web Information Systems Engineering, 2000. 230~237
    [9] R. Guha, Rob McCool, Eric Miller. Semantic search. In: Proceedings of the 12th International Conference on World Wide Web, 2003. 700~709
    [10] T. Berners-Lee, J. Hendler, O. Lassila. The Semantic Web. Scientific American, 2001, 284(5): 34~43
    [11] H. Takeda. Semantic Web: A Road to the Knowledge Infrastructure on theInternet. New Generation Computing, 2004, 22(4): 395~413
    [12] Wallace Anacleto Pinheiro, Ana Maria de C. Moura. An Ontology Based-Approach for Semantic Search in Portals. In: Proceedings of the Database and Expert Systems Applications, 15th International Workshop on (DEXA'04), 2004. 127~131
    [13] V. Christophides, G. Karvounarakis, I. Koffina, G. Kokkinidis, A. Magkanaraki, D. Plexousakis, G. Serfiotis, V. Tannen. The ICS-FORTH SWIM: a powerful semantic web integration middleware,In: Proceedings of the First International Workshop on Semantic Web and Databases, Co-located with VLDB, Humboldt-Universidad, Berlin, Germany, 2003. 381~394
    [14] Goble, C.A., Liming Chen, Shadbolt, N.R. A Semantic Web-Based Approach to Knowledge Management for Grid Applications. IEEE Transactions on Knowledge and Data Engineering, 2007. 283~296
    [15] Alexander Maedache, Steffen Staab. Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 2001. 72~79
    [16] A. Magkanaraki, V. Tannen, V. Christophides, D. Plexousakis. Viewing the Semantic Web through RVL Lenses. In: Proceedings of the International Semantic Web Conference, 2003. 243~255
    [17] L. Ding, X. Li, Y. Xing. Pushing Scientific Documents by Discovering Interest in Information Flow within E-Science Knowledge Grid. In: Proceedings of the 4th International Conference on Grid and Cooperative Computing (GCC2005), Springer LNCS 3795, 2005. 498~510
    [18] Ian Horrocks. DAML+OIL: a description logic for the semantic web. IEEE Computer Society Technical Committee on Data Engineering Bulletin, 2002, 25(1): 4~9
    [19] Ruixuan Li, Kunmei Wen, Zhengding Lu, Xiaolin Sun, Zhigang Wang. An Improved Semantic Search Model Based on Hybrid Fuzzy Description Logic. 2006. 139~146
    [20] Protégé, [Online]. Available: http://protege.stanford.edu/
    [21] Berlin, F., Koppen, V., Lenz. Edits - Data Cleansing at the Data Entry to assert semantic Consistency of metric Data. In: Proceedings of the 18th International Conference on Scientific and Statistical Database Management, 2006. 235~240
    [22]佘春红.数据清理方法.计算机应用, 2002, 22 (12): 128~130
    [23] Monge, A, E. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 2000, 23(4): 14~20
    [24] Jonathan I. Maletic, Andrian Marcus. Data Cleansing: Beyond Integrity Analysis. In: Proceedings of the Conference on Information Quality, 2000. 200~209
    [25] Indrajit Bhattacharya, Lise Getoor. Iterative record linkage for cleaning and integration. In: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004. 11~18
    [26] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD-95), 1995. 127~138
    [27] Steven N. Minton, Claude Nanjo, Craig A. Knoblock, Martin Michalowski, Matthew Michelson. A Heterogeneous Field Matching Method for Record Linkage. In: Proceedings of the Fifth IEEE International Conference on Data Mining, 2005. 314~321
    [28] Monge, A, E., Elkan, C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996. 267~270
    [29] Horst Bunke, et al. On the Weighted Mean of a Pair of Strings. Pattern Analusis & Applications, 2002, (5): 23~30
    [30] D. Bitton, D. J. DeWitt. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 1983, 8(2): 255~265
    [31] Monge, A, E. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 2000, 23(4): 14~20
    [32] Li Zhao, Sung Sam Yuan, Sun Peng, Tok Wang Ling. A New Efficient Data Cleansing Method. In: Proceedings of the 13th International Conference on Database and Expert Systems Applications, 2002. 484~493
    [33] Hernandez, M. A., Stolfo, J. S. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 1998. 9~37
    [34] Panos Vassiliadis, Zografoul Vagena, Spiros Skiadopoulos, NikosKarayannidis. ARKTOS: A Tool for Data Cleansing and Transformation in Data Warehouse Environments. IEEE Data Engineering Bulletin, 2000. 42~47
    [35] Raman, V., Hellerstein, J. Potter's wheel: an interactive data Cleansing system. In: Proceedings of the 27th International Conference on Very Large Data Bases. Roroa: Morgan Kaufmann, 2001. 381~390
    [36] Lee, M.L., Ling, T.W.,Low, W.L. IntelliClean: a knowledge-based intelligent data cleaner. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Ming, 2000. 290~294
    [37] Hans-J. Lenz, Veit Koppen, Roland M. Muller. Edits - Data Cleansing at the Data Entry to assert semantic Consistency of metric Data. In: Proceedings of the 18th International Conference on Scientific and Statistical Database Management, 2006. 235~240
    [38]陈汉华,金海,宁小敏等. SemreX系统中一种基于语义相似度的Peer-to-Peer拓扑及路由算法.软件学报, 2006, 17(5): 1170~1181
    [39] Z. W. Huang, H. Jin, P. P. Yuan, Z. F. Han. Header Metadata Extraction from Semi-structured Documents Using Template Matching. In: Proceedings of OTM 2006 Workshop of On the Move to Meaningful Internet Systems, Montpellier, France, 2006. 1776~1785
    [40] X. M. Ning, H. Jin and H. Wu. SemreX: Towards Large-scale Literature Information Retrieval and Browsing with Semantic Association. In: Proceedings of 2nd IEEE International Symposium on Service-Oriented Applications, Integration and Collaboration (SOAIC'06), Shanghai, China, 2006. 602~609
    [41] CiteSeer Metadata: http://citeseer.ist.psu.edu/oai.html
    [42] C. L. Giles, K. Bollacker and S. Lawrence. CiteSeer: An Automatic Citation Indexing System. Digital Libraries: 3rd ACM Conf. on Digital Libraries, ACM Press, New York, 1998. 89~98
    [43] S. Lawrence, C. L. Giles and Kurt Bollacker. Digital Libraries and Autonomous Citation Indexing. IEEE Computer, 1999, 32(6): 67~71

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700