面向中文自然语言Web文档的自动知识抽取和知识融合
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
从Web文档中自动抽取出与领域本体匹配的事实知识不仅可以构建基于知识的服务,而且可以为语义Web的实现提供必要的语义数据。中文语言的特点使得从中文自然语言Web文档中自动抽取知识非常困难。本文研究了面向中文自然语言Web文档的自动知识抽取和知识融合方法。主要研究内容包括:(1)分析和总结了自动知识抽取和知识融合的研究现状及存在的问题;(2)提出了系统化的领域本体定义方法,用聚集体知识概念刻画N元关系并且强调了要为本体概念指定必要的属性约束;(3)研究了面向中文自然语言Web文档的自动知识抽取方法。针对自动知识抽取的三个步骤:知识三元组元素的识别、知识三元组的构造和知识三元组的清洗,分别提出了基于本体主题的属性识别方法、基于本体属性约束的三元组元素识别方法、基于启发式规则的三元组构造方法、基于句法分析的三元组构造方法和基于本体属性约束的知识清洗方法。与已有方法相比,该知识抽取方法能够在不借助大规模的语言知识库或同义词表的情况下对中文自然语言Web文档进行自动知识抽取,能够处理文档中的N元复杂关系,适合于一般内容的中文自然语言Web文档,具有较好的可移植性;(4)提出了基于本体属性约束的知识融合方法,能够在实例化领域本体过程中识别等价实例、冗余知识和矛盾知识,保证了知识库知识的一致性;(5)分析了传统搜索引擎存在的问题,设计并实现了一个基于语义的智能搜索引擎系统CRAB,该系统能够为用户提供基于语义的知识检索并且生成直接包含查询结果的图文并茂的检索结果报告。本文在面向中文自然语言Web文档的自动知识抽取、知识融合和基于语义的智能搜索引擎等方面的研究具有一定的理论意义和应用价值,丰富了对中文自然语言Web文档的自动知识抽取问题的研究。
The Web is the largest and richest information repository available today. But most of the information on the Web has only layout related syntax labels and is only human-readable. The computers can not search and utilize the information on the Web automatically and efficiently on behalf of people. The Semantic Web is an extension of the Web. It provides semantic meta data to the information on the Web and enables the computers“understand”and process the information automatically. One of the biggest challenges for the realization of the Semantic Web is the available of the semantic content, which can be solved by adding semantic annotation to the information already existed on the Web and generating new information associated with the semantic annotation directly.
     Automatic knowledge extraction method can recognize and extract the factual knowledge matching the ontology from the Web documents automatically. These factual knowledge can not only be used to implement knowledge-based services, such as building semantic-based intelligent search engine which can provide users convenient and correct information retrieval services, but also provide necessary semantic content to enable the realization of the Semantic Web.
     Most of the existing knowledge extraction methods only deal with the English Web documents. With the rapid increase of the amount of the Chinese Web users and the Chinese Web resources, researches on the automatic knowledge extraction from the Chinese Web documents have a good prospect. But due to the characteristics of Chinese, it is very difficult to analyze and understand the Chinese natural language documents efficiently and the existing knowledge extraction methods for English can not be used directly for Chinese. So, exploiting the method which can extract knowledge from the Chinese natural language documents automatically is challenging and meaningful.
     Based on the analysis of the related research and existing methods, this thesis performed researches on the domain ontology definition method, automatic knowledge extraction from the Chinese natural language Web documents, knowledge consolidation and semantic-based intelligent search engine et al. The main results obtained by this thesis are listed as follows:
     (1) The thesis has introduced and analyzed the current state of the art in the fields of knowledge extraction and knowledge consolidation. This thesis has classified the existing knowledge extraction methods according to the types of the documents these methods target at and the methods’automation degree, analyzed the unique characteristics of Chinese and pointed out the difficulties in analyzing and understanding Chinese, and summarized the related problems that should be solved in the fields of the knowledge extraction and knowledge consolidation.
     (2) The thesis has presented a domain ontology definition method which can depict the N-ary relations. After thoroughly analyzing the content character of the Chinese natural language Web documents, this thesis has pointed out that the Chinese natural language Web documents contain not only simple factual knowledge about the binary relations between two entities or entities and values, but also a lot of complex factual knowledge about N-ary relations among multiple entities and values. However, the existing ontology definition methods do not provide a systematic definition method for such kind of knowledge and the existing knowledge extraction methods do not extract such complex factual knowledge. To solve this problem, the thesis has presented a systematic domain ontology definition method, which advocates the Aggregated Knowledge Concepts to encapsulate such N-ary relations and emphasizes that the ontology concepts should be assigned appropriate property restrictions. This domain ontology definition method can not only characterize the domain knowledge comprehensively, but also provide powerful support for the automatic knowledge extraction and knowledge consolidation, such as recognizing the properties, instances and checking the knowledge validity and integrity in the process of knowledge extraction, and getting rid of the contradiction, redundancy and merging the knowledge in the process of knowledge consolidation.
     (3) The thesis has presented an automatic knowledge extraction method targeted at the Chinese natural language Web documents. The knowledge extraction process consists of three steps: knowledge triple elements recognition, knowledge triple composition and knowledge cleaning.
     After analyzing and summarizing the existing methods for recognizing the triple elements, this thesis has pointed out that most of the existing methods have to take advantage of large-scale linguistics databases or synonym tables to solve this problem or can only recognize those elements that directly correspond to the words in the texts. However, the existed general Chinese linguistics databases can not provide accurate interpretations for the domain-specific words and the construction of the large-scale linguistics databases or synonym tables is labor intensive and time consuming and thus unrealistic. At the same time, the elements that constitute the knowledge implied in the content of the documents may have no direct correspondence to the words literally. To solve this problem, the thesis has presented an ontology theme-based property recognition method and an ontology property restriction-based triple elements recognition method. Compared with the existing methods, these methods have two main advantages. Firstly, they do not need large-scale linguistics databases or synonym tables. Secondly, they can infer the elements that are implied in the content on the basis of the elements existed in the content explicitly and the domain ontology. The ontology theme-based method fits for the content with the obvious description themes and the ontology property restriction-based method fits for the normal Chinese natural language Web documents.
     After analyzing the problems about the knowledge triple composition targeted at the Chinese natural language Web documents, this thesis has shown that it is very difficult to group the recognized ontology resources into correct triples that represent the document’s meaning correctly. This thesis has presented a heuristic rules-based knowledge triple composition method and a syntactic analysis-based knowledge triple composition method. The syntactic analysis-based method searches for the helpful syntactic relations among the words on the basis of the sentence’s syntactic structure and the dependency relations between words. This method also takes advantage of the heuristic rules to solve the omission of the sentence’s components and the reference resolution. Experiments have shown that this method can gain a better precision rate than the heuristic rules-based method and is suitable for the normal Chinese natural language documents.
     Due to the imperfection of the triple element recognition method and the triple composition method and the complex nature of the Web information, the factual knowledge extracted from the Web documents initially may be invalid or incomplete. This thesis has presented an ontology property restriction-based knowledge cleaning method. This method can judge and delete the invalid and incomplete factual knowledge that do not follow the domain ontology and ensure the quality of the knowledge in the knowledge base (KB) and the quality of the services built on the knowledge.
     Experiments have shown that this automatic knowledge extraction method works well for the Chinese natural language Web documents even without the support of large-scale linguistics databases or synonym tables and can deal with the complex aggregated knowledge about the N-ary relations in the documents. The precision rate, recall rate and F1 measure is 87.26%, 58.82% and 70.27% respectively, better than the other related works. More importantly, this method has good portability and can be applied in different domains as long as the corresponding domain ontology is provided.
     (4) This thesis has performed researches on the knowledge consolidation related methods. Knowledge consolidation comprises the identification and unification of the equivalent instances, recognition and treatment of the redundant and contradictory knowledge. This thesis has presented an ontology property restriction-based knowledge consolidation method. This method can determine the key property set of the concepts according to the domain ontology and can identify the equivalent instances by comparing the values of all the key properties. This equivalent instances recognition method is simple and intuitive and suits for the normal domain ontology. The knowledge consolidation method also provides the definitions, recognition and processing method for the redundant and contradictory knowledge and can judge the semantically redundant or contradictory knowledge on the basis of the equivalent instances. This knowledge consolidation method can ensure the consistency of the KB after merging with the new factual knowledge.
     (5) This thesis has designed and developed a semantic-based intelligent search engine system—CRAB. After analyzing the main shortcomings of the traditional search engines, this thesis has pointed out that the key words-based querying method and the query result composed of a list of Web pages links can not satisfy the user’s need for querying the information more correctly and conveniently. Based on the automatic knowledge extraction and knowledge consolidation method presented, this thesis has advocated and implemented a semantic-based intelligent search engine system—CRAB. Compared with the traditional search engine, this system can extract automatically the factual knowledge that matches the domain ontology from the domain related Chinese Web documents and merging them into the domain ontology KB; can allow the users to input their query requests in a natural language-like manner; can search the KB for the factual knowledge that is semantically related to the users query request; and can generate a report containing the querying result directly and is composed of both associated texts and graphs. This system enables the users to acquire the thorough, direct, correct and visual information conveniently. At the same time, the success of this system has demonstrated the effectiveness of the related methods.
     The research results of this thesis including the automatic knowledge extraction from the Chinese natural language Web documents, knowledge consolidation and the semantic-based intelligent search engine will enrich and push forward the studies of the related areas in both theoretical and technological aspects.
引文
[1]. FREITAG D. Information Extraction from HTML: Application of a General Learning Approach[C]. Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, Wiconsin, USA, 1998: 517-523.
    [2]. MUSLEA I, MINTON S, KNOBLOCK C. A Hierarchical Approach to Wrapper Induction[C]. Proceedings of the Third International Conference on Autonomous Agents, Seattle, Washington, USA, 1999: 190-197.
    [3].林亚平,刘云中,周顺先,等.基于最大熵的隐马尔可夫模型文本信息抽取[J].电子学报, 2005, 33(2): 236-240.
    [4]. BERNERS-LEE T, FISCHETTI M, DERTOUZOS M L. Weaving the Web: the original design and ultimate destiny of the World Wide Web by its inventor[M]. San Francisco: Harper Audio, 1999.
    [5]. BERNERS-LEE T, HENDLER J, LASSILA O. The semantic web[J]. Scientific American, 2001, 284: 34-43.
    [6]. FENSEL D, HENDLER J, LIEBERMAN H, et al. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential[M]. Cambridge, MA: MIT Press, 2003.
    [7]. BENJAMINS V R, CONTRERAS J, CORCHO O, et al. Six challenges for the Semantic Web[C]. Proceedings of the Semantic Web workshop held at KR-2002, 2002.
    [8]. REEVE L, HAN H. Survey of semantic annotation platforms[C]. Proceedings of the 2005 ACM symposium on Applied Computing, Santa Fe, New Mexico, USA, 2005: 1634-1638.
    [9]. TENIER S, NAPOLI A, POLANCO X, et al. Knowledge extraction from webpages[C/OL]. Proceedings of Fifth International Workshop on Knowledge Markup and Semantic Annotation (SemAnnot 2005), Galway, Ireland, 2005[2008-09-28]. http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-185/semAnnot05-11.pdf.
    [10]. KIYAVITSKAYA N, ZENI N, MICH L, et al. Text Mining Through Semi Automatic Semantic Annotation[C]. Proceedings of the Sixth International Conference on ractical Aspects of Knowledge Management, Vienna, Austria, 2006: 143-154.
    [11]. KAMBHATLA N. Combing Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Extracting Relations[C]. Proceedings of the Forty-Second Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 2004: 178-181.
    [12]. ALANI H, KIM S, MILLARD D, et al. Automatic ontology-based knowledge extraction from web documents[J]. IEEE Intelligent Systems, 2003, 18(1): 14-21.
    [13]. WITBROCK M, PANTON K, REED S L, et al. Automated OWL Annotation Assisted by a Large Knowledge Base[C/OL]. Proceedings of the Fourth International Workshop on Knowledge Markup and Semantic Annotation located at the Third International Semantic Web Conference, Hiroshima, Japan, 2004 [2008-09-29]. http://ftp.informatik.rwth-aachen.de/Publi- cations/CEUR-WS/Vol-184/semAnnot04-08.pdf.
    [14]. JAVA A, NIRENBURG S, MCSHANE M, et al. Using a Natural Language Understanding System to Generate Semantic Web Content[J]. International Journal on Semantic Web and Information Systems, 2007, 3(4): 50-74.
    [15].中国互联网络信息中心.中国互联网络发展状况统计报告[R/OL]. 2008-07[2008-10-06]. http://www.cnnic.cn/up- loadfiles/pdf/2008/7/23/170516.pdf.
    [16].中国互联网络信息中心.中国互联网络发展状况统计报告[R/OL]. 2008-01[2008-10-06]. http://www.cnnic.net.cn/ uploadfiles/pdf/2008/1/17/104156.pdf.
    [17]. MILLER G A, BECKWITH R, FELLBAUM C, et al. Introduction to wordnet: An on-line lexical database[J]. Journal of Lexicography, 1990, 3(4): 235-312.
    [18]. DONG Zhendong, DONG Qiang. HowNet[OL]. 2000[2008-10-06]. http://www.keenage.com/zhiwang/e_zhiwang.html.
    [19].荆涛,左万利,孙吉贵,等.中文网页语义标注:由句子到RDF表示[J].计算机研究与发展, 2008, 45(7): 1221-1231.
    [20]. TANG Jie, LI Juanzi, LU Hongjun, et al. iASA: learning to annotate the semantic web[J]. Journal on Data Semantic, 2005, 4: 110-145.
    [21]. TANG Jie, HONG Mingcai, LI Juanzi, et al. Tree-structured Conditional Random Fields for semantic annotation[C]. Proceedings of the Fifth International Conference of Semantic Web, Athens, GA, USA, 2006: 640-653.
    [22]. PéREZ A G, CORCHO O. Ontology Languages for the Semantic Web[J]. IEEE Intelligent Systems, 2002, 17(1): 54-60.
    [23]. MANOLA F, MILLER E. RDF Primer[OL]. W3C Recommendation, 2004[2008-09-20]. http://www.w3.org/TR/rdf-primer/.
    [24]. LASSILA O, SWICH R R. Resource Description Framework(RDF) Model and Syntax Specification[OL]. W3C Recommendation, 1999[2008-09-29]. http://www.w3.org/TR/REC-rdf-syntax/.
    [25]. DEAN M, SCHREIBER G. OWL Web Ontology Language Reference[OL]. W3C Recommendation, 2004[2008-09-29]. http://www.w3.org/TR/owl-ref/.
    [26]. BOUQUET P, STOERMER H, GIACOMUZZI D. Okkam: Enabling a Web of Entities[C/OL]. Proceedings of the WWW2007 Workshop on Entity-Centric Approaches to Information and Knowledge Management on the Web. Banff, Canada. 2007[2008-09-29]. http://CEUR-WS.org/Vol249/submission_150.pdf.
    [27]. GRUBER T R. A translation approach to portable ontologies[J]. Knowledge Acquisition, 1993, 5(2): 199-220.
    [28]. BORST W. Construction of Engineering Ontologies[D]. University of Twente, Enschede, 1997.
    [29]. STUDER R, BENJAMINS V R, FENSEL D. Knowledge engineering: principles and methods[J]. Data and knowledge engineering, 1998, 25(102): 161-197.
    [30]. GENESERETH M R, FIKES R E. Knowledge Interchange Format, Version 3.0 Reference Manul[R]. Computer Science Department, Stanford University, 3.0 edition, 1992.
    [31]. MOTTA E. An overview of the OCML modeling language[C]. Proceedings of the Eighth Workshop on Knowledge Engineering: Methods & Languages (KEML98), Karlsruhe, Germany, 1998: 21-22.
    [32]. FARINAS L, HERZIG A. Interference logic= conditional logic+frame axiom[J]. International Journal of Intelligent Systems, 1994, 9(1): 119-130.
    [33]. MACGREGOR R, BATES R. The Loom knowledge representation language[R]. Technical Report ISI/RS-87-188, University of Southern California, Information Science Institute, Marina del Rey, CA, USA, 1987.
    [34]. BRICKLEY D, GUHA R V. RDF Vocabulary Description Language 1.0: RDF Schema[OL]. W3C Recommendation, 2004[2008-09-29]. http://www.w3.org/TR/rdf-schema/.
    [35]. HORROCKS I, PATEL-SCHNEIDER P F, HARMELEN F V. From SHIQ and RDF to OWL: The making of a web ontology language[J]. Journal of Web Semantics, 2003, 1(1): 7-26.
    [36]. ATZENI P, MECCA G. Cut & paste[C]. Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, ACM Press, 1997: 144-153.
    [37]. HAMMER J, GARCIA-MOLINA H, CHO J, et al. Extracting semistructured information from the Web[C]. Proceedings of the Workshop on Management of Semistructured Data, 1997: 18-25.
    [38]. BAUMGARTNER R, FLESCA S, GOTTLOB G. Visual web information extraction with lixto[C]. Proceedings of the Twenty-Seventh International Conference on Very Large Data Bases, 2001: 119-128.
    [39]. KUSHMERICK N, WELD D S, DOORENBOS R. Wrapper Induction for Information Extraction[C]. Proceedings of International Joint Conference on Artificial Intelligence, Nagoya, 1997.
    [40]. MUSLEA I, MINTON S, KNOBLOCK C. STALKER: Learning Extraction Rules for Semi-Structured, Web-Based Information Sources[C]. Proceedings of Workshop on AI and Information Integration, in Conjunction with the Fifteenth National Conference on Artificial Intelligence, Madison, Wisconsin, USA, 1998: 74-81.
    [41]. FREITAG D, KUSHMERICK N. Boosted wrapper induction[C]. Proceedings of the Seventeenth National Conference on Artificial Intelligence and the Twelfth Conference on Innovative Applications of Artificial Intelligence, Austin Texas, USA, 2000: 577-583.
    [42]. COHEN W, HURST M, JENSEN L. A flexible learning system for wrapping tables and lists in html documents[C]. Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, USA, 2002: 232-241.
    [43]. KOSALA R, BLOCKEEL H, BRUYNOOGHE M, et al. Information extraction from structured documents using k-testable tree automaton inference[J]. Data & Knowledge Engineering, 2006, 58 (2): 129-158.
    [44]. COMON H, DAUCHET M, GILLERON R, et al. Tree Automata Techniques and Applications[M/OL]. 1999[2008-09-29]. https://gforge.inria.fr/frs/?group_id=426.
    [45]. TENIER S, TOUSSAINT Y, NAPOLI A, et al. Instantiation of relations for semantic annotation[C]. Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong, China, 2006: 463-472.
    [46]. CHANG Chiahui, LUI ShaoChen. IEPAD: information extraction based on pattern discovery[C]. Proceedings of the Tenth international conference on World Wide Web, Hong Kong, 2001: 681-688.
    [47]. LIU Bing, GROSSMAN R, ZHAI Yanhong. Mining data records from web pages[C]. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery in Databases and Data Mining, Washington, D.C., USA, 2003: 601-606.
    [48]. CRESCENZI V, MECCA G, MERIALDO P. ROADRUNNER: towards automatic data extraction from large web sites[C]. Proceedings of the 2001 International VLDB Conference, Roma, Italy, 2001: 109-118.
    [49]. LO L,NG V T-Y, NG P, et al. Automatic Template Detection for Structured Web Pages[C]. Proceedings of the Tenth International Conference on Computer Supported Cooperative Work in Design, Nanjing, China, 2006: 1-6.
    [50]. YANG Shaohua, LIN Hailüe, HAN Yanbo. Automatic data extraction from template-generated Web pages[J]. Journal of Software, 2008, 19(2): 209-223.
    [51]. ZHAI Yanhong, LIU Bing. Structured data extraction from the web based on partial tree alignment[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(12): 1614-1628.
    [52].áLVAREZ M, PAN A, RAPOSO J, et al. Extracting lists of data records from semi-structured web pages[J]. Data & Knowledge Engineering, 2008, 64(2): 491-509.
    [53]. HONG Mingcai, TANG Jie, LI Juanzi. Semantic Annotation Using Horizontal and Vertical Contexts[C]. Proceedings of the First Asian Semantic Web Conference, Beijing, China, 2006: 58-64.
    [54]. FLESCA S, MANCO G, MASCIARI E, et al. Exploiting structural similarity for effective Web information extraction [J]. Data & Knowledge Engineering, 2007, 60: 222-234.
    [55].王海涛,曹存根,高颖.基于领域本体的半结构化文本知识自动获取方法的设计和实现[J].计算机学报, 2005, 28(12): 2010-2018.
    [56].王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展, 2004, 41(10): 1786-1792.
    [57]. CAO Cungen, FENG Qiangze, GAO Ying, et al. Progress in the development of national knowledge infrastructure[J]. Journal of Computer Science & Technology, 2002, 17(5): 523-534.
    [58]. KOGUT P, HOLMES W. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages[C]. Proceedings of the First International Conference on Knowledge Capture Workshop on Knowledge Markup and Semantic Annotation, Victoria, B.C., Canada, 2001.
    [59]. Lockheed Martin Corporation. AeroText. [2008-09-29]. http://www.lockheedmartin.com/products/AeroText/index.html.
    [60]. DARPA Agent Markup Language[OL]. 2005[2008-09-29]. http://www.daml.org.
    [61]. KHELIF K, DIENG R. Ontology-Based Semantic Annotations for Biochip Domain[C]. Proceedings of Engineering Knowledge in the Age of the Semantic Web, Whittlebury Hall, UK, 2004: 483-484.
    [62]. CUNNINGHAM H, MAYNARD D, BONTCHEVA K, et al. GATE: a framework and graphical development environment for robust NLP tools and applications[C]. Proceedings of the Fortieth Anniversary Meeting of the Association for Computational Linguistics, Philadelphia, PA, 2002: 168-175.
    [63]. BOURIGAULT D, FABRE C. Approche linguistique pour l'analyse syntaxique de corpus[J]. Cahiers de grammaire, 2000, 25: 131-151.
    [64]. ZELENKO D, AONE C, RICHARDELLA A. Kernel methods for relation extraction[J]. Journal of Machine Learning Research, 2003, 3: 1083-1106.
    [65]. CULOTTA A, SORENSEN J. Dependency tree kernels for relation extraction[C]. Proceedings of the Forty-Second Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, 2004: 423-429.
    [66]. SKOUNAKIS M, CRAVEN M, RAY S. Hierarchical Hidden Markov Models Information Extraction[C]. Proceedings of the Eighteenth International joint Conference Artificial Intelligence, Acapulco, Mexico, 2003: 427-433.
    [67].周顺先,林亚平,王耀南,等.基于二阶隐马尔可夫模型的文本信息抽取[J].电子学报, 2007, 35(11): 2226-2231.
    [68]. FREITAG D, McCALLUM A. Information extraction with HMM structures learned by stochastic optimization[C]. Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas, USA, 2000: 584-589.
    [69]. LI Jianming, ZHANG Lei, YU Yong. Learning to Generate Semantic Annotation for Domain Specific Sentences. Proceedings of the First International Conference on Knowledge Capture Workshop on Knowledge Markup and Semantic Annotation, Victoria, B.C., Canada, 2001.
    [70]. SLEATOR D, TEMPERLEY D. Parsing English with a Link Grammar[C]. Proceedings of the Third InternationalWorkshop on Parsing Technologies, 1993.
    [71]. GILDEA D, JURAFSKY D. Automatic labeling of semantic roles[J]. Computational Linguistics, 2002, 28(3): 245-288.
    [72]. CIRAVEGNA F, CHAPMAN S, DING L A, et al. Learning to harvest information for the Semantic Web[C]. Proceedings of the First European Semantic Web Symposium (ESWC 2004), Heraklion, Crete, Greece, 2004: 312-326.
    [73]. ETZIONI O, CAFARELLA M J, DOWNEY D, et al. Unsupervised named-entity extraction from the Web: an experimental study[J]. Artificial Intelligence, 2005, 165(1): 91-134.
    [74]. CIMIANO P, HANDSCHUH S, STAAB S. Towards the Self-Annotating Web[C]. Proceedings of the Thirteenth International World Wide Web Conference, New York, NY, USA, 2004: 462-471.
    [75]. CIMIANO P, LADWIG G, STAAB S. Gimme’the context: Context-driven automatic semantic annotation with CPANKOW[C]. Proceedings of the Fourteenth International WWW Conference, Chiba, Japan, 2005: 332-341.
    [76]. DILL S, GIBSON N, GRUHL D, et al. Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation[C]. Proceedings of the Twelfth International World Wide Web Conference, Bugdpest, Hungary, 2003: 178-186.
    [77]. DILL S, EIRON N, GIBSON D, et al. A Case for Automated Large-Scale Semantic Annotation[J]. Journal of Web Semantics, 2003, 1(1): 115-132.
    [78]. RILOFF E, SHEPHERD J. A corpus-based approach for building semantic lexicons[C]. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, Providence, RI, 1997: 117-124.
    [79]. BANKO M, CAFARELLA M J, SODERLAND S, et al. Open Information Extraction from the Web[C]. Proceedings of the International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007: 2670-2676.
    [80]. KLEIN D, MANNING C D. Accurate unlexicalized parsing[C]. Proceedings of the Forty-First Association for Computational Linguistics, Sapporo, Japan, 2003: 423-430.
    [81]. DOWNEY D, ETZIONI O, SODERLAND S. A Probabilistic Model of Redundancy in Information Extraction[C]. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, 2005: 1034-1041.
    [82].李维刚,刘挺,李生.基于网络挖掘的实体关系元组自动获取.电子学报, 2007, 35(11): 2111-2116.
    [83]. SANFILIPPO A, TRATZ S, GREGORY M, et al. Ontological Annotation with WordNet[C]. Proceedings of the Fifth International Workshop on Knowledge Markup and Semantic Annotation, Galway, Ireland, 2005[2008-09-29]. http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-185/semAnnot05-03.pdf.
    [84]. BUITELAAR P, DECLERCK T. Linguistic Annotation for the Semantic Web[J]. Annotation for the Semantic Web, Frontiers in Artificial Intelligence and Application Series, Amsterdam, the Netherlands: IOS Press, 2003, 96.
    [85]. ALANI H, KIM S, MILLARD D, et al. Automatic Extraction of Knowledge from Web Documents[C/OL]. Proceedings of Workshop on Human Language Technology for the Semantic Web and Web Services at the Second International Semantic Web Conference, Sanibel Island, Florida, USA, 2003[2008-09-29]. http://www.gate.ac.uk/conferences/iswc2003/proceedings/alani.pdf.
    [86]. SEKINE S, GRISHMAN R. A corpus-based probabilistic grammar with only two nonterminals[C]. Proceedings of the Fourth International Workshop on Parsing Technology, ACL/SIGPARSE, Prague, 1995: 216-223.
    [87]. POPOV B, KIRYAKOV A, KIRILOV A, et al. KIM-semantic annotation platform[C]. Proceedings of the Second International Semantic Web Conference, Florida, USA, 2003: 834-849.
    [88]. KIRYAKOV A, POPOV B, TERZIEV I, et al. Semantic annotation, indexing, and retrieval[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2004, 2(1): 49-79.
    [89]. MICHELSON M, KNOBLOCK C A. Semantic Annotation of Unstructured and Ungrammatical Text[C]. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, 2005: 1091-1098.
    [90]. MICHELSON M, KNOBLOCK C A. An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look[C]. Proceedings of the First IJCAI Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, 2007: 123-130.
    [91]. COHEN W, SARAWAGI S. Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods[C]. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, 2004: 89-98.
    [92].姚天顺,朱靖波,张俐,等.自然语言理解—一种让机器懂得人类语言的研究[M]. 2版.北京:清华大学出版社, 2002.
    [93].朱德熙.语法答问[M].商务印书馆, 1999.
    [94].许嘉璐.现状和设想-试论中文信息处理与现代汉语研究[J].中文信息学报, 2001, 15(2): 1-8.
    [95].黎锦熙.新著国文语法[M].北京:商务印书馆, 1992.
    [96].邢福义.汉语语法三百问[M].北京:商务印书馆, 2002.
    [97].侯敏.计算语言学与汉语自动分析[M].北京:北京广播学院出版社, 1999.
    [98].黄昌宁,赵海.中文分词十年回顾[J].中文信息学报, 2007, 21(3): 8-19.
    [99]. ZHAO Hai, KIT C. Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation[J]. Research in Computing Science, 2008, 33: 93-104.
    [100]. ZHAO Hai, HUANG ChangNing, LI Mu. An improved Chinese word segmentation system wish conditional random field[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, 2006: 108-117.
    [101]. XUE NianWen, SHEN Libin. Chinese word segmentation as LMR tagging[C]. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, 2003: 176-179.
    [102]. LOW J K, HWEE T N, GUO Wenyuan. A maximum entropy approach to Chinese words Segmentation[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005: 161-164.
    [103]. TSENG H, CHANG P, et al. A conditional random field word segmenter for SIGHAN Bakeoff 2005[C]. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, 2005: 168-171.
    [104]. SPROAT R W, SHIH C L, GALE W, et al. A stochastic finite-state word-segmentation algorithm for Chinese[J]. Computational Linguistics, 1996, 22: 377-404.
    [105].赵海,揭春雨.基于有效子串标注的中文分词[J].中文信息学报, 2007, 21(5): 9-13.
    [106]. PENG F, SCHUURMANS D. Self-supervised Chinese word segmentation[C]. Proceedings of the Fourth International Symposium on Intelligent Data Analysis, Lisbon, Portugal, 2001: 238-247.
    [107].孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报, 2004, 27(6): 736-742.
    [108].张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报, 2004, 27(1): 85-91.
    [109].王振华,孔祥龙,陆汝占,等.结合决策树方法的中文姓名识别.中文信息学报, 2004, 18(6): 10-15.
    [110].李丽双,黄德根,陈春荣,等. SVM与规则相结合的中文地名自动识别[J].中文信息学报, 2006, 20(5): 51-57.
    [111].周俊生,戴新宇,尹存燕,等.基于层叠条件随机场模型的中文机构名自动识别[J].电子学报, 2006, 34(5): 804-809.
    [112]. WANG Houfeng, SHI Wuguang. A simple rule-based approach to organization name recognition in Chinese text[C]. Proceedings of the Fifth CICLing, Heidelberg, German, 2005, LNCS 3406: 769-772.
    [113].王宁,葛瑞芳,苑春法,等.中文金融新闻中公司名的识别[J].中文信息学报, 2002, 16(2): 1-6.
    [114]. WU Youzheng, ZHAO Jun, XU Bo, et al. Chinese Named Entity Recognition Based on Multiple Features[C]. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, 2005: 427-434.
    [115]. GAO Jianfeng, LI Mu, HUANG Changning, et al. Chinese word segmentation and named entity recognition: A pragmatic approach [J]. Computational Linguistics: 2005, 31(4):531-574
    [116]. ZHANG Suxiang, WANG Xiaojie, WEN Juan, et al. A Probabilistic Feature Based Maximum Entropy Model for Chinese Named Entity Recognition[C]. Proceedings of the Twenty-First International Conference on the Computer Processing of Oriental Languages, Singapore, 2006: 189-196.
    [117]. JIANG Wei, GUAN Yi, WANG Xiaolong. A pragmatic Chinese word segmentation system[C]. Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 2006: 189-192
    [118].俞鸿魁,张华平,刘群,等.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报, 2006, 27(2): 87-94.
    [119].张玥杰,徐智婷,薛向阳.融合多特征的最大熵汉语命名实体识别模型.计算机研究与发展, 2008, 45(6):1004-1010.
    [120]. ZHAO Hai, KIT C. Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition[C]. Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, 2008: 106-111.
    [121].车万翔,刘挺,李生.实体关系自动抽取[J].中文信息学报, 2005, 19(2): 1-6.
    [122].张素香.信息抽取中关键技术的研究[D].北京邮电大学信息工程学院, 2007.
    [123].刘克彬,李芳,刘磊,等.基于核函数中文关系自动抽取系统的实现[J].计算机研究与发展, 2007, 44(8): 1406-1411.
    [124]. CANCEDDA N, GAUSSIER E, GOUTTE C, et al. Word-sequence kernels [J ]. Journal of Machine Learning Research, 2003, 3: 1059-1082.
    [125].钟义信.面向智能研究的全信息理论—纪念Shannon信息论50周年[J].北京邮电大学学报, 1998, 21(4).
    [126]. JI Donghong. Semantic annotation of Chinese phrases using recursive-graph[C]. Proceedings of the Second Workshop on Chinese Language Processing: held in conjunction with the Thirty-Eighth Annual Meeting of the Association for Computational Linguistics, Hong Kong, 2007: 101-108.
    [127]. XUE Nianwen, PALMER M. Automatic Semantic Role Labeling for Chinese Verbs[C]. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, 2005: 1160-1165.
    [128]. XUE Nianwen. Annotating the predicate-argument structure of Chinese nominalizations[C]. Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, 2006: 1328-1387.
    [129]. XUE Nianwen. Labeling Chinese Predicates with Semantic roles[J]. Computational Linguistics, 2008, 34(2): 225-255.
    [130]. LAI Y S, WANG R J. Towards automatic knowledge acquisition from text based on ontology-centric knowledge representation and acquisition[C]. Proceedings of the Second International Conference on Knowledge Capture, Sanibel Island, Florida, USA, 2003.
    [131]. AHO A V, ULLMAN J D. The Theory of Parsing, Translation, and Compiling[M]. Prentice Hall, Englewood Cliffs, N.J., 1972.
    [132]. CHEN K J, HUANG C R. Information-based Case Grammar[C]. Proceedings of the Thirteenth International Conference on Computational Linguistics, University of Helsinki, Finland, 1990: 54-59.
    [133]. LAI Y S, WANG R J, HSU W K. A DAML+OIL-compliant Chinese lexical ontology[C]. Proceedings of the Nineteenth International Conference on Computational Linguistics, Taipei, Taiwan, Morristown, NJ, USA: Association for Computational Linguistics, 2002: 1238-1242.
    [134]. FELLEGI I P, SUNTER A B. A theory for record linkage[J]. Journal of American Statistical Association, 1969, 64(328): 1183-1210.
    [135]. DING Li, FININ T. Characterizing the Semantic Web on the Web[C]. Proceedings of the Fifth International Semantic Web Conference, Atlanta, GA, USA, 2006, LNCS 4273: 242-257.
    [136]. ELMAGARMID A K, IPEIROTIS P G, VERYKIOS V S. Duplicate Record Detection: A Survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
    [137]. DONG Xin, HALEVY A, MADHAVAN J. Reference Reconciliation in Complex Information Spaces[C]. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 2005: 85-96.
    [138]. KALASHNIKOV D, MEHROTRA S. Domain-Independent Data Cleaning via Analysis of Entity-Relationship Graph[J]. ACM Transactions on Database Systems, 2006, 31(2): 716-767.
    [139]. SINGA P, DOMINGOS P. Object Identification with Attribute-Mediated Dependences[C]. Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, 2005, LNCS 3721: 297-308.
    [140]. BHATTACHARYA I, GETOOR L. Entity Resolution in Graph Data[R]. Technical Report CS-TR-4758, University of Maryland, 2005.
    [141]. BENJELLOUN O, GARCIA-MOLINA H, MENESTRINA D, et al. Swoosh: A Generic Approach to Entity Resolution[R]. Stanford InfoLab, 2006.
    [142]. MOLINA H G. Pair-wise entity resolution: overview and challenges[C]. Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, 2006: 1-1.
    [143]. MONGE A E, ELKAN C. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records[C]. Proceedings of the SIGMOD 1997 Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson Arizona, 1997: 23-29.
    [144]. COHEN W W. Data Integration Using Similarity Joins and a Word-based Information Representation Language[J]. ACM Transactions of Information System, 2000, 18(3): 288-321.
    [145]. Linking Open Data Project. 2008[2008-10-06]]. http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData.
    [146]. CLARK K G. SPARQL Protocol for RDF[EB/OL]. W3C Working Draft, 2005[2008-10-06]. http://www.w3.org/TR/2005/ WD-rdf-sparql-protocol-20050527/.
    [147]. JAFFRI A, GLASER H, MILLARD I. Uri Identity Management for Semantic Web Data Integration and Linkage[C]. Proceedings of the Third International Workshop on Scalable Semantic Web Knowledge Base Systems, Algarve, Portugal, 2007, LNCS 4806: 1125-1134.
    [148]. BOUQUET P, STOERMER H, NIEDEREE C, et al. Entity Name System: The Backbone of an Open and Scalable Web of Data[C]. Proceedings of the IEEE International Conference on Semantic Computing, Santa Clara, CA, USA, 2008: 554-561.
    [149]. GUHA R V, McCOOL R. Tap: A Semantic Web Platform[J]. Computer Networks, 2003, 42(5): 557-577.
    [150]. AIDAN H, ANDREAS H, STEFAN D. Performing Object Consolidation on the Semantic Web Data Graph[C/OL]. Proceedings of the WWW2007 Workshop on Entity-Centric Approaches to Information and Knowledge Management on the Web, Banff, Canada, 2007[2008-09-29]. http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-249/submission_135.pdf.
    [151]. BRICKLEY D. Rdfweb notebook: aggregation strategies[OL]. 2002[2008-09-29]. http://rdfweb.org/2001/01/design/smush.html.
    [152].谢能付.基于语义Web技术的知识融合和同步方法研究[D].北京:中国科学院计算技术研究所, 2006.
    [153]. SAIS F, PERNELLE N, ROUSSET M C. L2R: a logical method for reference reconciliation[C]. Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2007: 329-334.
    [154]. HORROCKS I, PATEL-SCHNEIDER P F, BOLEY H, et al. SWRL: A Semantic Web Rule Language Combining OWL and RuleML[EB/OL]. 2004[2008-10-06]. http://www.w3.org/Submission/2004/SUBM-SWRL-20040521/.
    [155]. DeMICHIEL L G. Resolving database incompatibility: an approach to performing relational operations over mismatched domains[J]. IEEE Transactions on Knowledge and Data Engineering, 1989, 1(4): 485-493.
    [156]. DUNEMANN O, GEIST I, JESSE R, et al. A database-supported workbench for information fusion: INFUSE[C]. Proceedings of the Eighth International Conference on Extending Database Technology, LNCS, Springer-Verlag, 2002: 756-758.
    [157]. GALHARDAS H, FLORESCU D, SIMON E, et al. Declarative data cleaning: language, model, and algorithms[C]. Proceedings of the Twenty-Seventh International Conference on Very Large Data Bases, 2001: 371-380.
    [158]. MOTRO A. Multiplex: A Formal Model for Multidatabases and Its Implementation[C]. Proceedings of the Fourth International Workshop on Next Generation Information Technologies and Systems, Zikhron-Yaakov, Israel, 1999, LNCS 1649: 138-158.
    [159]. MOTRO A, ANOKHIN P. Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources[J]. Information Fusion, 2006, 7: 176-196.
    [160].余传明.基于本体的语义信息系统研究—理论分析与系统实现[D].武汉大学情报学, 2005.
    [161]. CHOMSKY N. Topics in the theory of generative grammar[M]. The Hague: Mouton, 1978.
    [162]. WOODS W A. Augmented Transition Networks for Natural Language Analysis[R]. Report No. CS-I, Aiken Computation Laboratory, Harvard University, 1969.
    [163]. KAPLAN R, BRESNAN J. Lexical-Functional Grammar: A Formal System for Grammatical Representation[M]. The Mental Representation of Grammatical Relations, MIT Press, 1982.
    [164]. MARTIN Kay. Functional Unification Grammar: A Formalism for Machine Translation[C]. Proceedings of the Tenth International Conference on Computational Linguistics and the Twenty-Second Annual Meeting of the Association for Computational Linguistics, Stanford University, California, USA, 1984: 75-78.
    [165]. TESNIèRE L. Eléments de la syntaxe structurale[M]. Paris: Klincksieck, 1959.
    [166]. KROCH A, JOSHI A K. Linguistic relevance of tree adjoining grammars[R]. Technical Report MS-CIS-85-18, Department of Computer and Information Science, University of Pennsylvania, 1985.
    [167]. The Stanford Natural Language Processing Group. The Stanford Parser: A statistical parser[OL]. 2007[2008-09-29]. http://nlp.stanford.edu/software/lex-parser.shtml.
    [168]. CHAUDLHURI S, DAYAL U. An Overview of Data Warehousing and OLAP Technology[J]. ACM SIGMOD Record, 1997, 26(1): 65-74.
    [169]. JARKE M, LENZERINI M, VASSILIOU Y, et al. Fundamentals of Data Warehouses[M]. Springer, 2000.
    [170]. GALHARDAS H, FLORESCU D, SHASHA D, et al. AJAX: An Extensible Data Cleaning Tool[C]. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, 2000: 590-590.
    [171]. RAMAN V, HELLERSTEIN J M. Potter’s Wheel: An Interactive Framework for Data Cleaning[C]. Proceedings of the Twenty-Seventh International Conference on Very Large Data Bases, Roma, Italy, 2001: 381-390.
    [172].中国互联网络信息中心. 2007年中国搜索引擎市场调查报告[R/OL]. 2007[2008-10-06]. http://www.cnnic.cn/html/Dir/ 2007/09/26/4815.htm.
    [173]. KRAINES S, GUO W, KEMPER B, et al. EKOSS: A Knowledge-User Centered Approach to Knowledge Sharing, Discovery and Integration on the Semantic Web[C]. Proceedings of the Fifth International Semantic Web Conference, Athens, GA, USA, 2006, LNCS 4273: 833-846.
    [174]. ZHANG Yi, VASCONCELOS W, SLEEMAN D. OntoSearch: An Ontology Search Engine[C]. Proceedings the Twenty-Fourth SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, Cambridge, UK, 2004.
    [175]. ABOUELHODA M I, OHLEBUSCH E, KURTZ S. Optimal exact string matching based on suffix arrays[C]. Proceedings of the Ninth International Symposium on String Processing and Information Retrieval, Lisbon, Portugal, 2002, LNCS 2476: 31-34.
    [176]. MANBER U, MYERS G. Suffix arrays: a new method for on-line search[J]. SIAM Journal on Computing, 1993, 22(5): 935-948.
    [177]. SIRIN E, PARSIA B, GRAU B C, et al. Pellet: A Practical OWL-DL Reasoner[J]. Web Semantics: Science, Services and Agents on the World Wide Web, 2007, 5(2): 51-53.
    [178]. Jena2 Database Interface-Release Notes[OL]. [2008-10-06]. http://jena.sourceforge.net/DB/index.html.