实体搜索与实体解析方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
从非结构/半结构化数据中快速准确地搜索到各种实体(例如人名、组织机构、产品和药品)及其相关信息成为很多应用的关键,包括信息检索、推荐系统和社交网络等。近几年的研究成果显示,实体相关搜索占互联网查询的很大一部分,并且这个比例在不断上升。相对于单个字符或者指定长度的短语,实体能够更准确的描述文本的语义特征,从而帮助用户快速了解文本的核心内容。然而,随着互联网数据的不断增长,信息检索变得越来越困难,尤其是实体的不唯一性(歧义性)成为一个普遍存在的问题。首先,许多不同的实体拥有完全相同的名称,例如在中国有超过29万人叫“张伟”;在查询框中输入一个实体名称,搜索引擎返回的前100个网页常常会涉及到多个共享相同名字的不同对象。其次,同一个实体常常会以多种形式存在于不同数据源中(即别名),例如“中华人名共和国”常常被称为“中国”或“P.R.C”;刘翔曾被誉为“亚洲飞人”等。在医药业的“一药多名”和“一名多药”问题也很严重,药品名称的不唯一性匹配,为正确用药带来了巨大的阻碍。以上两个问题分别为实体同名歧义和实体别名识别,这两个问题的解决过程是相对的同时也是密切相关的,他们是实体搜索和解析过程中的两个最重要的问题。本篇文章针对实体搜索工作进行了大量的调研,分析了包括表层网络、社交网络以及企业内部网络等不同来源的数据特性。并针对实体同名歧义和实体别名问题分别提出有效的解决方案。此外,基于本文提出的实体同名消歧的解决方案,我们开发了一个人物搜索系统。并对本文提出的别名发现解决方案进行扩展,使其适用于动态数据环境。在这些研究中,我们重在对非结构化文本进行分析,充分利用自然语言处理方法探索文本中的单词、实体、句子的结构特征和内容特征,通过数据挖掘算法为这些信息建立联系,以解决实体搜索和实体解析中遇到的问题。本论文的主要贡献如下:
     1.实体搜索综述。介绍了实体搜索中遇到的问题及采用的技术方法,简单描述了现有人名搜索系统、人名搜索相关问题及未来研究方向。
     2.实体同名消歧。以人名消歧为例进行相关研究,利用自然语言处理工具对搜索引擎返回的非结构化文档进行命名实体提取,将提取的实体作为人物标签,建立基于实体标签的图结构,最终为拥有相同姓名的不同的人分配实体标签对其进行唯一性描述。另外,我们开发的人名搜索系统将给定的人名作为查询词,输入到现有搜索引擎(即谷歌、雅虎或必应)中,利用我们提出的消歧方法对返回的结果进行人物同名消歧,使得用户可以清晰看到拥有查询人名的不同人物的关键实体信息。
     3.实体别名发现。本文对实体-别名之间存在字符串相似性和无字符串相似性的两种情况分别进行研究。对于第一种情况,我们首先基于字符相似性提取出别名候选,然后建立实体-关系图进行别名选取。对于别名与原实体基本不存在字符相似性的情况,研究工作面临更多挑战,本文提出基于实体子集分割的方法进行别名候选的筛选,然后通过主动学习的分类方法来确定给定实体的最终别名。总体来说,本文的实体别名发现方法旨在通过探索给定数据集中实体之间的关系,设计初始过滤方法来提取给定实体的别名候选,然后使用非监督式/监督式方法来探寻给定实体与别名候选之间的相关性,最终为每一个给定实体输出一个别名列表。
     4.动态实体别名发现。随着新的数据添加到给定数据集中,基于这个数据集而建立的实体-关系图结构也需要进行相应的更新操作(点边的插入、删除和修改),以往的静态解决方案已不再适用于这样的动态环境,因此,本文提出基于实体索引的路径搜索方法,以此来实现动态图的更新,并将这个动态方案用于增量式的实体别名发现问题中。
Quickly and accurately searching the various entities (e.g., person names, organizations, locations, products, and drugs) from the unstructured or semi-structured data becomes more and more important in a wide range of applications, such as information retrieval, recommendation system, and social network mining. The survey in recent years shows that entity search accounts for a large part of the Internet queries, and this proportion has been rising. Compared with the words and n-grams, entities have a stronger ability to describe the context features, which can help users quickly get the key points of a document. However, with the increasing growth of the Internet data, entity search becomes more challenging, especially due to the tough problem of entity ambiguity. First, a number of different entities may have exactly the same name. For example, more than290,000people in China are named as "Zhang Wei"; given an entity name as query to a search engine, the top100results may refer to a number of different entities that share the same entity name. Second, a unique entity is often mentioned by a variety of forms (i.e., alias). For example,"the Republic of China" is well known as "China" or "P. R.C." and Liu Xiang has a nickname of "Asian night". In the pharmaceutical industry, the phenomenon that more than one drugs own the same name and a drug may have different variants is non-trivial and dangerous for medication.
     The entity name disambiguation and entity alias discovery are two relative procedures and closely related, which are known as the two most important problems in entity search and entity resolution. This thesis makes a survey on entity search over lots of previous research work, analyzes the different characteristics of data from different sources including surface networks, social networks and internal networks. Moreover, we propose effective solutions for entity disambiguation and entity alias discovery respectively. In addition, based on the solution of entity disambiguation, we develop a people search system, GRAPE. Moreover, we extend the proposed solution of entity alias discovery to adopt for the dynamic environments. The main contributions of this thesis are listed as follows:
     1. A survey on entity search. We present the various problems and solutions in entity search, and describe some exiting entity search systems. Moreover, some issues and future research directions about people search system are summarized.
     2. Entity name disambiguation. Given a person name as query, we obtain some unstructured documents returned by the existing search engines (i.e., Google, Bing, or Bing). After that, we use a natural language processing tool to extract eight types of named entities from the documents as tags. Based on the extracted tags, an entity-relationship graph is established and finally these tags are grouped into several clusters, each of which describes a people entity uniquely. Additionally, a practical entity search system-GRAPE is deployed based on the proposed solution and presents a cluster of tags for different persons owning the same name.
     3. Entity alias discovery. We design a string match method to extract a few alias candidates for each given entity. Through exploring the entity relationships from both structured data and unstructured data, an entity-relationship graph is built and then we search the graph-based connectivity between a given entity and all its alias candidates. Finally a given entity is assigned a list of candidates. Moreover, to handle the aliases without string similarity with the original entity, we present a subset-based method to choose alias candidates and ultimately obtain a few aliases for each given entity through prediction by a logistic regression classifier.
     4. Dynamic entity alias discovery. With the data corpus updating, the corresponding entity-relationship graph is ever changing. However, the previous solutions based on static datasets are not applicable any longer. In this thesis, we propose an entity-index strategy for path searching in a dynamic graph and then apply this strategy in the real application of incremental entity alias discovery.
引文
8 1www.miv.t. u-tokyo.ac.jp/danushka/aliasdata.zip
    9 2http://www.autonlab.org/autonweb/15962.html?branch=1&language=2
    [1]Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.:Automatic linkage of vital records. Science 130(3381), pages 954-959,1959.
    [2]Hern'andez, M.A., Stolfo, S.J.:The merge/purge problem for large databases. In Proceedings of ACM SIGMOD, pages 127-138,1995.
    [3]R. Nuray-Turan, D. V. Kalashnikov, and S. Mehrotra. Self-tuning in graph-based reference disambiguation. In Proc. of the 12th International Conference on Database Systems for Advanced Applications (DASFAA),2007
    [4]Fan, W., Jia, X., Li, J., and Ma, S. Reasoning about Record Matching Rules. PVLDB, 2(1):407-418,2009
    [5]Sarawagi, S., Bhamidipaty, A.:Interactive deduplication using active learning. In Proceedings of ACM SIGKDD,2002.
    [6]Dong, X., Halevy, A.Y., Madhavan, J.:Reference reconciliation in complex information spaces. In Proceedings of ACM SIGMOD,2005. Tejada, S., Knoblock, C.A., Minton, S.:Learning object identification rules for information integration. Information Systems Journal 26(8), pages 635-656,2001.
    [7]T. Kirsten, L. Kolb, M. Hartung, A. Gross, H. Kpcke, and E. Rahm. Data partitioning for parallel entity matching. Computing Research Repository,2010.
    [8]Lars Kolb, Hanna Kopcke, Andreas Thor, and Erhard Rahm. Learning-based entity resolution with MapReduce. In Proceedings of the third international workshop on Cloud data management (CloudDB), pages 1-6.2011
    [9]Surajit C., Kris G., Venkatesh G., and Rajeev M.. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (SIGMOD), pages 313-324,2003.
    [10]Tianfang Y., and Hans U. A Novel Machine Learning Approach for the Identification of Named Entity Relations. Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in NLP, Pages 1-8,2009
    [11]Gjergji K., Shady E., and Gerhard W.2009. MING:mining informative entity relationship subgraphs. In Proceedings of the 18th ACM conference on Information and knowledge management (CIKM), pages 1653-1656,2009.
    [12]Gae-won Y., Seung-won H., Zaiqing N., and Ji-Rong W.2011. SocialSearch: enhancing entity search with social network matching. In Proceedings of the 14th International Conference on Extending Database Technology (EDBT/ICDT),2011.
    [13]S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data. In EMNLP Conference, pages 708-716,2007.
    [14]R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In European Chapter of the Assocation for Computational Linguistics (EACL).2006.
    [15]S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proceedings of KDD, pages 457-466,2009.
    [16]A. Bagga and B. Baldwin, Entity-based cross-document coreferencing using the vector space model. In Proceedings of COLING-ACL, pages 79-85,1998.
    [17]Guha, R. and Garg, A.2004. Disambiguating people in search. In Proceedings of WWW, 2004.
    [18]X. Fan, J. Wang, B. Lv, L. Zhou, and W. Hu, Ghost: An effective graph-based framework for name distinction, In Proceedings of CIKM, pages.1449-1450,2008.
    [19]R. Bunescu and M. Pasca, Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL, pages.9-16,2006.
    [20]D. V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra. Towards breaking the quality curse: a web-querying approach to web people search. In Proceedings of International ACM SIGIR Conference, pages.27-34,2008.
    [21]Bollegala, D., Honma, T., Matsuo, Y., and Ishizuka, M. Automatically extracting personal name aliases from the web. Proceedings of NLP, pages 77-88,2008.
    [22]Holzer, R., Malin, B., and Sweeney, L. Email alias detection using social network analysis. In Proceedings of LinkKDD, pages 52-57,2005.
    [23]Chaudhuri, S., Ganti, V., and Xin, D. Exploiting web search to generate synonyms for entities. In Proceedings of WWW, pages 151-160,2009.
    [24]Surajit, C., venkatesh, G., and Dong, X. Mining document collections to facilitate accurate approximate entity matching. Proceedings Of VLDB Endow 2,1:395-406, 2009.
    [25]Sapena, E., padro, L., and turmo, J. Alias assignment in information extraction. Procesamiento del Lenguaje Natural 69,39:105-112,2007.
    [26]Benjelloun, O., garcia-molina, H., menestrina, D., SU, Q., whang, S., and widom, J. Swoosh:a generic approach to entity resolution. VLDB J.18,1:255-276,2009
    [27]Brizan, D. G., and Tansel, A. U. A survey of entity resolution and record linkage methodologies. Communications of LIMA 6,3,2006.
    [28]S., Xiong, Y., Yao, C., Zheng, L., and Liu, W. Acronym extraction and disambiguation in large-scale organizational web pages. In Proceeding of CIKM, pages 1693-1696,2009.
    [29]Zahariev, m. A (Acronyms). Ph.d. thesis, School of Computing Science, Simon Fraser University,2004.
    [30]G. Hu, J. Liu, H. Li, Y. Cao, J.-Y. Nie, and J. Gao. A Supervised Learning Approach to Entity Search. In Information Retrieval Technology, Lecture Notes in Computer Science, pages 54-66,2006.
    [31]T. Cheng and K. C.-C. Chang, Entity search engine: Towards agile best-effort information integration over the web. In proceedings of CIDR,2007.
    [32]Tao Cheng, Xifeng Yan, and Kevin Chen-Chuan Chang. EntityRank:searching entities directly and holistically. In Proceedings of the 33rd international conference on Very large data bases (VLDB), pages 387-398,2007.
    [33]Henning Rode from Document to Entity Retrieval-Improving Precision and Performance of Focused Text Search. Dutch Research School for Information and Knowledge Systems. Ph.D thesis.2008.
    [34]S. Endrullis, A. Thor, and E. Rahm, Evaluation of query generators for entity search engines, in Workshop on Using Search Engine Technology for Information Management (USETIM),2009.
    [35]Soumen C., Devshree S., and Ganesh R. Web-scale entity-relation search architecture. In Proceedings of the 20th international conference companion on World Wide Web (WWW), pages 21-22,2011.
    [36]Stefan E., Andreas T., and Erhard R., Entity Search Strategies for Mashup Applications. In proceedings of ICDE,2012
    [37]Krisztian Balog, Marc Bron, and Maarten De Rijke.2011. Query modeling for entity search based on terms, categories, and examples. ACM Trans. Inf. Syst.29,4,31 pages, 2011.
    [38]Westerveld, T., de Vries, A. and de Jong, F., Generative probabilistic models. Multimedia Retrieval, Data-Centric Systems and Applications. Springer Berlin Heidelberg, pages 177-198,2007.
    [39]Sofia J. Athenikos and Xia Lin. Multifaceted EntityFactRelation Retrieval via Semantic Search Interface based on Domain Knowledge Extraction. WSDM,2012.
    [40]Cody Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to the Web. In Proceedings of the Tenth International World Wide Web Conference (WWW),2010.
    [41]E. Brill, S. Dumais, M. Banko. An analysis of the askMSR question-answering system. In Proceedings of the EMNLP, pages 257-264,2002.
    [42]Jimmy Lin and Boris Katz. Question answering from the Web using knowledge annotation and knowledge mining techniques. In Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM).2003.
    [43]E. Agichtein, S. Lawrence, and L. Gravano. Learning search engine specific query transformations for question answering. In Proceedings of the 10th International World Wide Web Conference (WWW), pages 169-178,2011.
    [44]Liu, K., Meng, W., Qiu, J., Yu, C., Wu, V. R. V. Z., and Y. Allinonenews:development and evaluation of a large-scale news metasearch engine. In Proceedings of the ACM SIGMOD international conference (SIGMOD),2007.
    [45]Carrot2. SE. http://project.carrot2.org/.
    [46]Wan, X., Gao, J., Li, M., and Ding, B.2005. Person resolution in person search results: Web-hawk. In Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM),2005.
    [47]S. Kotsiantis, D. Kanellopoulos and P. Pintelas, Data Preprocessing for Supervised Leaning, International Journal of Computer Science,2006.
    [48]N Tyagi,A. Solanki and S. Tyagi, An algorithmic approach to data preprocessing in web usage mining, International journal oflnformation technology and knowledge management, Volume 2, No.2, pages 279-283,2010.
    [49]Davis, Jonathan Jeremy and Clark, Andrew J. Data preprocessing for anomaly based network intrusion detection: a review. Computers & Security,30(6-7), pages 353-375, 2011.
    [50]Song, R., Liu, H., Wen, J.-R., and W.-Y, M.2004. Learning block importance models for web pages. In Proceedings of the 13th international conference on World Wide Web (WWW),2004.
    [51]Laender, A., Ribeiro-Neto, B., Silva, A., and Teixeira, J. A brief survey of web data extraction tools. SIGMOD Record 31,2 (June), pages 84-93.2002.
    [52]Kushmerick, N. Gleaning the web. IEEE Intelligent System 14,2, pages 20-22,1999.
    [53]Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D., and Yates, A. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence 165,1, pages 91-134,2005..
    [54]Tang, J., Hong, M., Zhang, D., Liang, B., and Li, J. Information extraction:Method-ologies and applications. In the book of Emerging Technologies of Text Mining: Techniques and Applications, pages 1-33,2007.
    [55]Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning 34,1-3, pages 233-272,1999.
    [56]Balog, K., Azzopardi, L., and de Rijke, M.2006. Formal models for expert fnding in enterprise corpora. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),2006.
    [57]William W. Cohen, Matthew Hurst, and Lee S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proceedings of the 11th international conference on World Wide Web (WWW). Pages 232-241,2002。
    [58]Kushmerick, N., Weld, D., and Doorenbos, R. Wrapper induction for information extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729-737,1997.
    [59]Hu, G., Liu, J., Li, Y. C. H., Nie, J.-Y., and Gao, J. A supervised learning approach to entity search. In Proceedings of Asia Information Retrieval Symposium (AIRS), pages 54-66,2006.
    [60]Popescu, O., FBK-irst, Magnini, T. B., FBK-irst, and Trento. Irst-bp:Web people search using name entities. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007).
    [61]Extractor, E. Website, http://search.iiit.ac.in/-xtract.
    [62]Zeng, H., Qicai, H., Zheng, C., and et al. Learning to cluster web search results. In Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR), pages 210-217,2004.
    [63]Bekkerman, R. and McCallum, A. Disambiguating web appearances of people in a social network. In Proceedings of the International World Wide Web Conference (WWW),2005.
    [64]Bing, L. and Wee, C. C. Searching people on the web according to their interests. In Proceedings of the 11th international conference on World Wide Web(WWW),2002.
    [65]Elmacioglu, E., Tan, Y. F., Yan, S., Kan, M.-Y., and Lee, D. Psnus:Web people name disambiguation by simple clustering with rich features. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [66]Iria, J., Xia, L., and Zhang, Z. Wit: Web people search disambiguation using random walks. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval, pages 480-483,2007.
    [67]Guha, R. and Garg, A. Disambiguating people in search. In Proceedings of the 13th International World Wide Web Conference (WWW),2004.
    [68]Hong, Y., On, B.-W., and Lee, D. System support for name authority control problem in digital libraries:Opendblp approach. European Conf. on Digital Libraries (ECDL), pages 134-144.2004.
    [69]Hui, H., Hongyuan, Z., and Lee, G. Name disambiguation in authorcitations using a k-way spectral clustering method. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL),2005.
    [70]Hui, H., Lee, G. C., Hongyuan, Z., Cheng, L., and Kostas, T. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 296-305,2004.
    [71]On, B.-W., Lee, D., Kang, J., and Mitra, P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 344-353,2005.
    [72]Bollegara, D., Matsuo, Y., and Ishizuka, M. Disambiguating personal names on the web using automatically extracted key phrases. In Proceedings of the biennial European Conference on Artificial Intelligence (ECAI),2006.
    [73]Li, X., Morie, P., and Roth, D. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine Spring, pages 45-68,2005.
    [74]Fleischman, M. and Hovy, E.2004. Multi-document person name resolution. In Proceedings of the Workshop on Reference Resolution and its Applications:ACL2004.
    [75]DeRose, P., Shen, W., Chen, F., Lee, Y., Burdick, D., Doan, A., and Ramakrishnan, R. 2007. Dblife: A community information management platform for the database research community. In Proceedings of CIDR 2007.
    [76]Vu, Q., Masada, T., Takasu, A., and Adachi, J. Disambiguation of people in web search using a knowledge base? In proceedings of the IEEE International Conference on Research Innovation and Vision for the Future,2007.
    [77]Niu, C., Li, W., and Srihari. R. K. Weakly supervised learning for cross-document person name disambiguation supported by information extraction. ACL Results Rand and Evaluation.2004.
    [78]Ravin, Y. and Kazi, Z. Is hillary rodham clinton the president? Disambiguating names across documents. In Proceedings of the CL 1999 Workshop on Conference and its Applications (ACL),1999.
    [79]Mann, G. S. and Yarowsky, D.2003. Unsupervised personal name disambiguation. In Proceedings of Conference on Computational Natural Language Learning(CoNLL2003).
    [80]SemEval.2007. http://nlp.cs.swarthmore.edu/semeval/tasks/taskl3/summary.shtml.
    [81]Balog, K., Azzopardi, L., and de Rijke, M.2007. Uva: Language modeling techniques for web people search. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 468-471,2007.
    [82]Artiles, J., Gonzalo, J., and Sekine, S. The semeval-2007 weps evaluation:Establishing a benchmark for web people search task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [83]Ellman, J. and Emery, G. Nn-weps:Web person search using co-present names and lexicalchains. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [84]Elmacioglu, E., Tan, Y. F., Yan, S., Kan, M.-Y., and Lee, D. Psnus:Web people name disambiguation by simple clustering with rich features. In Proceedings of the 4th InternationalWorkshop on Semantic Evaluations (SemEval),2007.
    [85]del Valle-Agudo, D., de Pablo-Sanchez, C., and Vicente-Diez, M. T.2007. Uc3m13: disambiguation of person names based on the composition of simple bags of typedterms. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 362-365,2007.
    [86]Iria, J., Xia, L., and Zhang, Z. Wit: Web people search disambiguation using random walks. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 480-483,2007.
    [87]Saggion, H. Shef: Semantic tagging and summarization techniques applied to cross-document coreference. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 292-295,2007.
    [88]Popescu, O., FBK-irst, Magnini, T. B., FBK-irst, and Trento. Irst-bp:Web people search using name entities. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [89]Heyl, A. and Neumann, G. Dfki2:An information extraction based approach to people disambiguation. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [90]Blume, P. K. M.2007. Fico:Web person disambiguation via weighted similarity of entity contexts. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [91]Lefever, E., Hoste, V., and Fayruzov, T. Aug: A combined classification and clustering approach for web people disambiguation, In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval),2007.
    [92]Okumura, K. S. M. Titpi:Web people search task using semi-supervised clustering approach. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 318-321,2007.
    [93]Kozareva, Z., Vazquez, S., and Montoyo, A.2007. Ua-zsa: Web page clustering on the basis of name disambiguation. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pages 338-341,2007.
    [94]Elmacioglu, E., Tan, Y. F., Yan, S., Kan, M.-Y., and Lee, D. Psnus:Web people name disambiguation by simple clustering with rich features. In Proceedings of the 4th InternationalWorkshop on Semantic Evaluations (SemEval),2007.
    [95]Han, J. and Kamber, M.2000. Data Mining:Concepts and Techniques. Morgan Kaufmann.
    [96]Zeng, H., Qicai, H., and Zheng, C. Learning to cluster web search results. In Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR), pages 210-217,2004.
    [97]Joachims, T. Evaluating retrieval performance using clickthrough data. In Proceedings of the SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2002.
    [98]Joachims, T.2002. Optimizing search engines using clickthrough data. In proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD),2002.
    [99]Xu, J., Cao, Y., Li, H., and Zhao, M.2005. Ranking definitions with supervised learning methods. Special interest tracks and posters of the 14th international conference on World Wide Web(WWW),2005
    [100]Freund, Y., Iyer, R., Schapire, R., and Singer, Y. An effiient boosting algorithm for combining preferences. Journal of Machine Learning Research,2003.
    [101]Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender,G. Learning to rank using gradient descent. In proceedings of the 22nd International Conference on Machine Learning,2005.
    [102]Qin, T., Zhang, X., Wang, D., Liu, T., Lai, W., and Li, H.. Ranking with multiple hyperplanes. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),2007.
    [103]Zheng, Z., Zha, H., Chen, K., and Sun, G. A regression framework for learning ranking functions using relative relevance judgments. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),2007.
    [104]Maconald, C. and Ounis, I. A belief network model for expert search,in studies in theory of information retrieval,. In Proceedings of 1st conference on Theory of Information Retrieval (ICTIR),2007.
    [105]Whang, Steven Euijong and Garcia-Molina, Hector Disinformation Techniques for Entity Resolution. Technical Report. Stanford InfoLab.
    [106]Becerra-Fernandez, I. Facilitating the online seach of experts at nasa using expert seeker people-finder. In Proceedings of the 3rd International Conference on Practical Aspects of Knowledge Management (PAKM),2000.
    [107]Bing, L. and Wee, C. C. Searching people on the web according to their interests. In Proceedings of the 11th international conference on World Wide Web (WWW),2002.
    [108]William, A. and Kirsten, D. Build Your Brand in Bits and Bytes. Morgan Kaufmann. 2007.
    [109]J. Artiles, J. Gonzalo, and F. Verdejo, A testbed for people searching strategies in the www. In Proceedings of the 28th annual International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), pages 569-570,2005.
    [110]A. Bagga and B. Baldwin, Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL), pages 79-85,1998.
    [111]M. B. Fleischman and E. Hovy. Multi-document person name resolution. In Proceedingsof the Association for Computational Linguistics (ACL), Reference Resolution Workshop,2004.
    [112]R. Bekkerman and A. McCallum, Disambiguating web appearances of people in a social network. In Proceedings of the 14th International World Wide Web Conference (WWW), pages 463-470,2005.
    [113]R. Bunescu and M. Pasca, Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL,2006.
    [114]D. Bollegara, Y. Matsuo, and M. Ishizuka, Disambiguating personal names on the web using automatically extracted key phrases. In Proceedings of the biennial European Conference on Artificial Intelligence (ECAI 2006),2006.
    [115]Ravin and Z. Kazi, Is hillary rodham clinton the president? Disambiguating names across documents. In Proceedings of the ACL 1999 Workshop on Conference and its Applications,1999.
    [116]G. S. Mann and D. Yarowsky, Unsupervised personal name disambiguation. In Proceedings of the 7th Conference on Computational Natural Language Learning. Edmonton, Canada,2003.
    [117]H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of JCDL,2004.
    [118]X. Fan, J. Wang, B. Lv, L. Zhou, and W. Hu. Ghost: An effective graph-based framework for name distinction. In Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM),2008.
    [119]D. V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray-Turan. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering, vol.20, no. 11, pages 1550-1565,2008.
    [120]D. V. Kalashnikov, R. Nuray-Turan, and S. Mehrotra. Towards breaking the quality curse:a web-querying approach to web people search. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008). NewYork, NY, USA:ACM, pages 27-34,2008.
    [121]E. Minkov, W. W. Cohen, and A. Y. Ng. Contextual search and name disambiguation in email using graphs. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR),2006.
    [122]J. Iria, L. Xia, and Z. Zhang. Wit: Web people search disambiguation using random walks. In Proceedings of the 4th International Workshop on Semantic Evaluation (SemEval-2007), pages 480-483.
    [123]J. Artiles, J. Gonzalo, and S. Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for web people search task. In Proceedings of Semeval 2007, Association for Computational Linguistics, pages 9-16,2007.
    [124]A. Javier, J. Gonzalo, and S. Sekine. Weps 2 evaluation campaign:overview of the web people search clustering task. In Proceedings of In 2nd Web People Search Evaluation Workshop (WePS), WWW Conference,2009.
    [125]L. Jiang, J. Wang, N. An, S. Wang, J. Zhan, and L. Li. Two birds with one stone:A graph-based framework for disambiguating and tagging people names in web search. In Proceedings of the 18th Internationa] World Wide Web Conference (WWW),2009.
    [126]C. Niu, W. Li, Srihari., and R. K. Weakly supervised learning for cross-document person name disambiguation supported by information extraction,. ACL Results Rand and Evaluation,2004.
    [127]A. Huang, D. N. Milne, E. Frank, and I. H. Witten, Clustering documents with active learning using Wikipedia. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), pages 839-844,2008.
    [128]P. A. Devijver and J. Kittler. Pattern recognition:A statistical approach. Prentice-Hall, London,1982.
    [129]E. Amigo, J. Gonzalo, J. Artiles, and F. Verdejo. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval Journal, Springer, Heidelberg,2008.
    [130]Benjelloun, O., Garcia-Molina, H., Menestrina, D., SU, Q., Whang, S., and Widom, J. Swoosh: a generic approach to entity resolution. VLDB J.18,1,2009.
    [131]Bollegala, D., Honma, T., Matsuo, Y., and Ishizuka, M. automatically extracting personal name aliases from the web. Proceedings of the International Conference on Natural Language Processing,2008.
    [132]Bollegala, D., Matsuo, Y., and Ishizuka, M. A co-occurrence graph-based approach for personal name alias extraction from anchor texts. In Proceedings of IJCNLP,2008.
    [133]Brizan, D. G., and Tansel, A. U. A survey of entity resolution and record linkage methodologies. Communications of IIMA 6,3,2006.
    [134]Chaudhuri, S., Ganti, V., and Xin, D. Exploiting web search to generate synonyms for entities. In Proceedings of the World Wide Web Conference,2009.
    [135]William W. Cohen, Pradeep R. and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI Workshop on IIWeb,2003.
    [136]Devijver, P. A., and Kittler, J. Pattern recognition:A statistical approach. Prentice-Hall, London,1982.
    [137]Feng, S., Xiong, Y., Yao, C., Zheng, L., and Liu, W. Acronym extraction and disambiguation in large-scale organizational web pages. In Proceeding of CIKM,2009.
    [138]Ferreira Da Silva, J., and Pereira Lopes, G. A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of Sixth MOL. pages 369-381,1999..
    [139]Fouss, F., Pirotte, A., Renders, J.-M., and Saerens, M. Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation. The IEEE Transactions on Knowledge and Data Engineering 19,2006.
    [140]Getoor, L., and Diehl, C. P. Link mining: a survey. SIGKDD Explor. Newsl.72005.
    [141]Grosvenor, D., and Seaborne, A. Using hybrid search and query for e-discovery identification. Technical Reports HPL-2009-155, HP Labs.
    [142]Holzer, R., Malin, B., and Sweeney, L. Email alias detection using social network analysis. In Proceedings of LinkKDD,2005.
    [143]I. Antonellis, H. M., and Chang, C. Simrank++:query rewriting through link analysis of the click graph. In Proceedings of the 34th International Conference on Very Large Data Bases,2008.
    [144]Jeh, G, and Widom, J. Simrank: a measure of structural-context similarity. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD),2002.
    [145]Kalashnikov, D., and Mehrotra, S. A probabilistic model for entity disambiguation using relationships. In Proceedings of the SIAM International Conference on Data Mining (SDM),2005.
    [146]Li, P., Liu, H., Yu, J. X., He, J., and Du, X. Fast single-pair simrank computation. In Proceedings of the SIAM International Conference on Data Mining (SDM),2010.
    [147]Makhoul, J., Kubala, F., Schwartz, R., and Weischedel, R. Performance measures for information extraction. In Proceedings of DARPA Broadcast News Workshop 1999.
    [148]Sapena, E., Padro,1., and Turmo, J. Alias assignment in information extraction. Procesamiento Del Lenguaje Natural 69,39,2007.
    [149]Singla, P., and Domingos, P. Entity resolution with markov logic. In Proceedings of the International Conference of Data Mining (ICDM) (2006).
    [150]Surajit, C., Venkatesh, G., and dong, X. Mining document collections to facilitate accurate approximate entity matching. Proceedings Of VLDB Endow 2,1,2009.
    [151]Winkler, W. E. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association.
    [152]Zahariev, M. A (Acronyms). Ph.d. thesis, School of Computing Science, Simon Fraser University,2004.
    [153]A. Ferreira. On models and algorithms for dynamic communication networks: The case for evolving graphs. In Proceedings of 4e rencontres francophones sur les Aspects Algorithmiques des Telecommunications (ALGOTEL), pages 155-161,2002.
    [154]J. Sun, C. Faloutsos, S. Papadimitriou, and P. S. Yu. Graphscope: Parameter-free mining of large time-evolving graphs. In KDD,2007.
    [155]Luhr S, Lazarescu M. Incremental Clustering on Dynamic Data Streams Using Connectivity Based Representative Points. Data & Knowledge Engineering,68:1-27,2009
    [156]Baldridge, J., and Osborne, M. Active learning for hpsg parse selection. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL-Volume 4, Association for Computational Linguistics, pages 17-24,2003.
    [157]Bollegalla, D., Honma, T., Matsuo, Y., and Ishizuka, M. Identification of personal name aliases on the web. In Proceedings of World Wide Web, pages 1107-1108,2008.
    [158]Boongoen, T., Shen, Q., and Price, C. Disclosing false identity through hybrid link analysis. Artificial Intelligence and Law 18,1,2010.
    [159]Coimbra, R. S., Vanderwall, D. E., and Oliveira, G. C. Disclosing ambiguous gene aliases by automatic literature profiling. In Proceedings of BMC Genomics,2010.
    [160]Davis, J., Dutra, I., Page, D., and Santos Costa, V., Establishing identity equivalence in multi-relational domains. In Proceedings of the International Conference on Intelligence Analysis (IA),2005.
    [161]Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. Approximate string joins in a database (almost) for free. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), pages 491-500, 2001.
    [162]Holzer, R., Malin, B., and Sweeney, L. Email alias detection using social network analysis. In Proceedings of the 3rd international workshop on Link discovery (LinkKDD), pages 52-57,2005.
    [163]Mauricio A. Hernandez and Salvatore J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, pages 9-37,1998.
    [164]Hsiung, P., Moore, A., Neill, D., and Schneider, J. Alias detection in link data sets. In Proceedings of International Conference on Intelligence Analysis,2005.
    [165]Lewis, D. D., and Gale, W. A. A sequential algorithm for training text classifiers. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of 17th ACM International Conference on Research and Development in Information Retrieval, pages 3-12,1994.
    [166]Oates, T., Bhat, V., and Shanbhag, V. Using latent semantic analysis to find different names for the same entity in free text. In Proceedings of the 4th international workshop on Web information and data management, pages 31-35,2002.
    [167]Pantel, P. Alias detection in malicious environments. In Proceedings of Proceedings of AAAI Fall Symposium on Capturing and Using Patterns for Evidence Detection, pages 14-20,2006.
    [168]Schrag, R. Eagle y2.5 performance evaluation laboratory documentation version 1.5. Internal report, Information Extraction and Transport, Inc.,2004.
    [169]Tong, S., and Koller, D. Support vector machine active learning with applications to text classification. In Proceedings of 17th International Conference on Machine Learning (ICML), pages 996-1006,2000.
    [170]Vlachos, A. Active learning with support vector machines. Master of Science, School of Informatics, University of Edinburgh,2004.