相关实体查找与主页查找研究

英文题名：Relatedentityfinding and Homepage Finding
作者：周文渊
论文级别：硕士
学科专业名称：电子与通信工程（专业学位）
中文关键词：TREC ; REF ; 文本查找 ; 相关度 ; 相关实体提取 ; Stanford工具包 ; Wikpedia
英文关键词：Trec ; Ref ; Text Search ; Correlation ; Related Entity
英文关键词：Extraction ; Stanford Tools ; Wikipedia
学位年度：2013
导师：徐蔚然
学科代码：0852
学位授予单位：北京邮电大学
论文提交日期：2012-11-25

摘要

REF (Related Entity Finding,相关实体查找)是TREC (Text Retrieval Conference,文本检索会议)实体检索中非常有前景的研究课题,对它的研究将对搜索引擎和人们对网络信息的处理方式带来巨大的改变。REF的要求是根据提供的topic的信息,通过互联网和相关数据库抽取出与topic相对应的相关实体答案以及对应实体主页。本文对国内外的现状和一些前沿的算法进行了研究,并对关键词的提取和扩展,文本的检索,段落的切分和相关度计算,命名实体识别,实体排序和支撑文档的检索等几个方面逐个分析和研究,对实现过程的改进和创新如下：
     (1)对于以往的对整个网页文本进行处理的方式做了改进,增加了对于短文本即段落的处理方式,从而剔除了大量的不相关文本内容,减小了返回文本的大小,提高了系统的处理效率。
     (2)根据Wikipedia的结构特点,利用Wikipedia中的同义词和上位词等构建基于Wikipedia的类别词典,并用于实体抽取部分,适应了今年REF项目的实体类型多而细的特点,同时提高了实体抽取的准确率。
     (3)添加了基于词密度的算法,实现了对DCM模型结果的校对,取得了比较好的效果。并根据去年的答案对DCM文档中心模型的计算公式中的参数做了调整,对模型进行了改进。
REF (Related Entity Finding) is the TREC (Text Retrieval Conference) physical retrieval is a promising research topic. REF requirement is that the topic information, extracted via the Internet and related database that corresponds with the topic of the relevant entities of the answers and the corresponding entities Home. The status quo at home and abroad, and some cutting-edge algorithms, calculated from the extraction and expansion of key words, text retrieval, paragraph segmentation and correlation, named entity recognition, entity sorting and supporting documentation to find, etc. the implementation process of research and analysis, mainly to complete the work of the following aspects:
     (1) For the entire page text improved approach for short text paragraph, which removed a lot of text content, reducing the size of the returned text to improve the system processing efficiency.
     (2) According to Wikipedia's structural features, the use of synonyms and hypernyms in Wikipedia is built based on the Wikipedia category dictionary, and for entity extraction part, adapted to the entity type of the REF project this year, and fine features, while improving the entity extraction the accuracy of.
     (3) Add the word density-based algorithm, the proofing of the DCM model results, and achieved fairly good results.According to the answer to last year's model of DCM Documentation Center in the calculation formula parameters adjusted, the model has been improved.

引文

[1]Voorhees E, Tice D. The TRECB question answering track evaluation[C].In:Proceedings of the 8th Text Retrieval Conference. Gaithersburg, 2000
    [2]Voorhees E. Overview of the TREC 2003 question answering t rack[C], In: Proceeding of the 11th Text Retrieval Conference. Gaithersburg,2003
    [3]H.Chen, H. Shen, J. Xiong, etal. Social Network Structure Behind the Mailing Lists:ICT_IIIS at TREC 2006 Expert Finding Track[C]. In:Proceeding of the 15th Text Retrieval Conference. Gaithersburg,2006
    [4]Neumann Gunter, XuFei-yu. Mining answers in German web pages[C].InProceedings of the IEEE/WIC International Conference on Web Intelligence (WI'03),2003
    [5]Ferret O, Grau B, Hurault-Plantet M. Finding an answer based on the recognition of the question focus[C].InProceeding of the 9th Text Retrieval Conference. Gaithersburg,2001
    [6]Kim Soo-Min, BaekDae-Ho, Kim Sang-Beom, etal. Question answering considering semantic categories and co-occurrence density[C}. InProceedings of the 8th Text Retrieval Conference. Gaithersburg,2000
    [7]Clarke C L A, Cormack G V, Lynam T R. Exploiting redundancy in question answering[C]. InProceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New Orleans.Louisiana, USA,2001
    [8]Ellen M. Voorhees. Overview of the TREC-9 Question Answering Track [C], InProceedings of the Ninth Text Retrieval Conference (TREC 2000). Gaithersburg, MD,US,2000
    [9]E. Voorhees. Overview of the TREC 2001 question answering Track[C].InProceedings of the 10th Text REtrieval Conference. Gaithersburg. Maryland,2001
    [I0]EIIen M. Voorhees. Overview of the TREC2002 Question Answering Track[C].In Proceedings of the Eleventh Text Retrieval Conference (TREC 2002).Gaithersburg,MD. US,2002
    [11]Evgeniy, G., Shaul, M., Computing Semantic Relatedness using Wikipedia-Based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI'07), Hyderabad, India,2007
    [12]Girju, R., Badulescu, A., Moldovan, D., Learning semantic constraints for the automatic discovery of part-whole relations. In Proceedings of HLT-NAACL'03, 2003
    [13]Roth, D., Yih, W., Probabilistic Reasoning for Entity & Relation Recognition. In Proceedings of 19th International Conference on Computational Linguistics (COLING'02),2002
    [14]Roth, D., Yih, W., A linear programming formulation for global inference innatural language tasks. In Proceedings of the 8th International Conference on Computational Natural Language Learning (CoNLL'04),2004
    [15]Ruiz-Casado, M., Alfonseca, E., Castells, P., Automatic extraction of semantic relationships for WordNet by means of pattern learning from Wikipedia. In Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB'05),2005
    [16]Kushmerick, N., Wrapper Induction for Information Extraction, [Dissertation],Univ. of Washington,1997
    [17]Auer, S., Lehmann, J., What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content. In Proceedings of the 4th European Semantic Web
    Conference (ESWC'07),2007
    [18]Brin, S., Extracting patterns and relations from the World Wide Web. In Proceedings of the 1st International Workshop on the Web and Databases (WebDB'98),Valencia, Spain,1998
    [19]Agichtein, E., Gravano, L., Snowball:Extracting Relations from Large Plain-text Collections. In Proceedings of the 5th ACM International Conference on Digital Libraries (DL'00),2000
    [20]Pantel, P., Pennacchiotti, M., Espresso:Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. In Proceedings of 23rd International Conference on Computational Linguistics (COLING'06),2006
    [21]张刚,刘挺,郑实福,车万翔,秦兵,李生.开放域中文问答系统的研究与实现[C].见中国中文信息学会二十周年学术会议,2001
    [22]姚天顺,张俐,高竹.WordNet综述[J],语言文字应用,2001,3(1)：27-32.
    [23]D Lin.An information-theoretic definition of similarity[C]. In Proceedings of the 15th International Conference on Machine Learning, Madison, WI,1998
    [24]付鸿鹊,张晓林.段落检索及其相关算法研究闭.知识组织与知识管理.2007,147(2)：39-43
    [25]Srihari R., Li W. A Question Answering System Supported by Information Extraction[C]. In Proceedings of the 1th Meeting of the North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-00),2000
    [26]Ravichandran D., HovyE..Learning Surface Text Patterns for a Question Answering System[C].In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL2002). Pennsylvania,2002
    [27]Attardi G., Cisternino A., Formica F., Simi M., and Tommasi A. PiQASso, Pisa Question Answering System[C]. In Proceedings of the 10th Text Retrieval Conference(TREC 2001), Gaithersburg, Maryland,2002
    [28]Brill E., Lin J., Banko M., DumaisS.,&Ng A. Data-Intensive Question Answering[C].In Proceedings of the 10th Text REtrieval Conference (TREC 2001). Gaithersburg,Maryland,2002
    [29]邓锦辉.受限域中文问答系统中答案抽取的研究[硕士毕业论文].昆明理工大学,2008
    [30]H H Chen, Y W Ding, S C Tsai. etal. Description of the NTU System Used forMET2[C]. InProceedings of the 7th Message Understanding Conference (MUC-7).San Francisco,1998
    [31]Peng, F., McCallum, A., Accurate information extraction from research papersusing CRFs. In Proceedings of Human Language Technology conference/NorthAmerican chapter of the Association for Computational Linguistics annual meeting(HLT/NAACL'04),2004
    [32]Tang, J., Hong, M., Li, J., Liang, B., Tree-structured Conditional Random Fieldsfor Semantic Annotation. In Proceedings of 5th International Semantic Web Conference(ISWC'06),2006
    [33]Lafferty, J., McCallum, A., Pereira, F., Conditional Random Fields: ProbabilisticModels for Segmenting and Labeling Sequence Data. In Proceedings of the 18thInternational Conference on Machine Learning (ICML'01),2001
    [34]Ray, S., Craven, M., Representing sentence structure in hidden markov modelsfor information extraction. In Proceedings of the 17thInternational Joint Conference onArtificial Intelligence (IJCAI'01), Seattle, Washington, USA,2001
    [35]Fien D.M., Walter D. Memory-based named entity recognition using unannotateddata[C]. InProceedings of CoNLL-2003.2003
    [36]Hideki L, Hideto K. Efficient support vector classifiers for named entity recognition[C].InProceedings of Coling-2002.2002
    [37]宗萍,施水才.基于条件随机场的英文地理行政实体识别[[J].现代图书情报技术.2009,175(2)：51-55
    [38]周雅倩,郭以昆.基十最大嫡方法的中英文基本名词短语识别[[J].计算机研究与发展.2003,40(3)：440-446
    [39]John Lafferty, Andrew McCallum, Fernando Pereira. Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data[C]. InProceedingsof the International Conference on Machine Learning (ICML-2001). 2001
    [40]Jenny Rose Finkel,TrondGrenager,and Christopher Manning. IncorporatingNon-localInformationinto Information Extraction Systems by Gibbs Sampling[C].InProceedings of the 43nd Annual Meeting of the Association for ComputationalLinguistics (ACL).2005
    [41]余正涛,毛存礼,韩露,邓锦辉,郭剑毅,基于模式学习的中文问答系统答案抽取方法研究[J].吉林大学学报(工学版).2008,38(1)：142-147
    [42]Balog, K., Azzopardi, L.&deRijke, M.. Formal models for expert finding inenterprise corpora[C]. InProceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR06). New York, NY,USA,2006
    [43]C.S.Campbell, P. P Maglio, A. Cozzi, B. Dom. Expertise identification using emailcommunications[C]. In Proceedings of the twelfth international conference onInformation and knowledge management.2003
    [44]Macdonald, C., Ounis, I..Voting for candidates:adapting data fusion techniques for an expert search task[C]. InProceedings of the 15th ACM International Conference onInformation and Knowledge Management{CIKM06).2006
    [45]Yi Fang, Luo Si, AdityaMathur, "FacFinder:Search for Expertise in Academic Institutions"[R], Technical Report, SERC-TR-294 and Department of ComputerScience. Purdue University,2008
    [46]Davenport,T., Prusak,L.. Working Knowledge:How Organizations Manage What TheyKnow:Harvard Business School Press,1998
    [47]Lin, C., Griffiths-Fisher, V Ehrlich, etal.SmallBlue:People Mining for ExpertiseSearch and Social Network Analysis[J]. IEEE Multimedia Magazine. 2008
    [48]Yoav Freund and Robert E-Schapire. A decision-theoretic generalization of on-linelearning and all application to boosting[J]. Journal of Computer and System Sciences.1997,55(1):119-139
    [49]Jarvelin, K. and Kekalainen, J. Cumulated Gain-based Evaluation of IR Techniques[J].ACM Transactions on Information Systems.2002,20(4):422446
    [50]Zhu, X., Ghahramani, Z., Learning from Labeled and Unlabeled Data fromLabel Propagation. Tech. Report. CMU, June 2002

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700