搜索引擎中命名实体查询处理相关技术研究

英文题名：Relevant Techniques of Named Entity Query Processing for Search Engine
作者：伍大勇
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：命名实体 ; 查询切分 ; 同义属性 ; 查询意图 ; 查询检索模式
英文关键词：named entity ; query segmentation ; synonymous attribute ; query intent ; query pattern
学位年度：2012
导师：刘挺 ; 张宇
学科代码：081203
学位授予单位：哈尔滨工业大学
论文提交日期：2012-09-01

摘要

当前互联网已经成为人们获取信息和进行事务活动的一个重要平台。随着互联网上各种数据和应用资源的快速增长，搜索引擎成为人们从海量的网上资源中快速准确地获取信息的必要工具。用户通过提交查询到搜索引擎表达他们的信息需求，搜索引擎则根据对查询的分析提供给用户需要的检索结果，查询是用户和搜索引擎之间必要的信息传递方式。为了使搜索引擎能够准确地理解查询中表达的信息需求，则需要开展查询自动分析处理技术的研究。
     命名实体查询是一类重要的查询，在搜索引擎查询中占有很高的比例，并且具有一些自身特点，研究命名实体查询的相关处理技术能够使搜索引擎更好地分析用户的检索意图，提供给用户准确的检索结果，改善用户的检索体验。命名实体查询处理技术通常包括获取查询中的语义片段，识别出查询中包含的实体，分析命名实体查询的检索意图等方面的研究。据此，本文从以下几个方面开展了命名实体查询处理的相关技术研究。
     1、基于单语词对齐模型的无指导查询自动切分。查询切分是一项基础和必要的查询处理工作，是将查询从字符序列切分出词汇或短语等语义单元的过程。由于查询中出现的词汇规模巨大并且包含许多不规范的词汇，有指导的方法需要人工标注大量的训练语料，使其不能很好地适应查询切分的任务。本文提出了一种基于单语词对齐模型的无指导查询切分方法。该方法仅利用查询日志自动训练查询切分模型，并在模型中能够结合字符的共现信息、位置信息以及繁殖度信息，获得了较好的查询切分效果。本文在查询词项切分的基础上进一步对查询进行了层次化切分，将查询表示为切分片段的树状结构，查询层次化切分结果可以表示出查询中哪些切分片段之间的关系更为紧密。实验结果显示与已有的切分方法相比，本文方法获得了更好的查询切分效果。
     2、基于图上随机游走模型的查询日志中命名实体挖掘。查询日志是一个包含大量命名实体的数据资源。从查询日志中挖掘出的命名实体，更加符合用户构造查询时使用命名实体的习惯，并且查询日志会不断更新，其中记录了一些新出现的实体名称，这使得研究查询日志中命名实体挖掘对于搜索引擎处理命名实体查询更具有实际意义。本文中采用了一种弱指导的方法进行命名实体挖掘，其中利用了少量的属于目标类别的命名实体名称作为种子，使用从查询日志中抽取出的候选命名实体、查询中命名实体的上下文模板以及用户点击URL构造三分图，采用图上的随机游走算法获取目标类别的命名实体。实验结果显示，本文方法能够有效结合查询日志中的命名实体相关信息，提高查询日志中获取命名实体的准确率。
     3、基于在线百科的命名实体同义属性短语获取。在命名实体的属性短语中，描述实体同一属性的不同表达形式的短语，被称为同义属性短语。获取实体的同义属性短语对命名实体查询的检索意图分析将有所帮助。在命名实体查询中，用户通常使用属性短语构建查询，表达对实体属性值的需求意图。本文从在线百科中获取命名实体的属性短语，并采用了分类的框架结合了多种特征去识别出其中的同义属性短语。据我们了解，本文方法是首次提出利用在线百科获取同义属性短语的研究。实验结果表明，在线百科是获取实体同义属性短语的有效资源，并且本文提出的方法能够有效地获取大量的同义属性短语。
     4、命名实体查询的检索意图识别。在本文中包括基于分类的查询检索意图识别和更细粒度的基于查询检索模式的检索意图识别两个部分。查询意图分类可以限制检索结果的类别空间，提高检索准确率。在查询意图分类中，采用融合多种资源信息的方法进行分类，其中根据对查询文本，查询日志以及互联网检索结果的分析，获取了有效的查询意图分类特征。本文进一步在查询意图分类模型识别出的信息类和事务类命名实体查询中，抽取用户经常使用的查询检索模式，并将具有相似检索意图的查询检索模式进行聚类。查询检索模式可以用来匹配用户提交的查询，帮助搜索引擎准确地分析查询的检索意图。本文中采用了基于图模型方法和基于相似度方法级联地进行命名实体查询的检索模式获取。实验结果显示本文方法在多个实体类别上均有效地获取了查询检索模式。
     综上所述，本文开展了命名实体查询处理一些关键技术的研究工作，其中有些查询处理技术出于更广泛适应性的考虑，其面向的对象不仅是命名实体查询，也可以应用到其他查询上。在研究中取得了一些初步的结论和成果，希望能对搜索引擎的命名实体查询处理任务有所裨益。
At present, Internet is an important platform on which people access toinformation and make transactions. With explosively increasing resources ofinformation and application on the Internet, search engine has been becoming anindispensable tool that guides people instantly and precisely access to their neededinformation on the Internet. Users issue queries to search engine and use the queriesto represent their information needs. Search engine provides users with the resultthey need according to analyzing the queries. Obviously, queries are the media inwhich users’ information need is delivered to a search engine. In order to makesearch engine to understand the information needs of queries better, it is necessary tocarry out research on the techniques of processing and analyzing queries.
     Named entity query is an important type of query, which is a high percentage inqueries of search engine. Named entity queries have special features and attributes.To carry out research on named entity query processing is beneficial for searchengine to better understand users’ search intent represented by their issued queries,which would help search engine to provide more precise search results and satisfyusers with better search experiences. There is some relevant research work on thenamed entity query processing such as acquiring semantic segments in queries,recognizing the named entities in queries, analyzing the search intent of queries, etc.The main contents in our research can be summarized as follows:
     1、Unsupervised query segmentation based on monolingual word alignmentmodel. Query segmentation, which is a fundamental and essential query processingtask, deals with obtaining a sequence of words or phrases by segmenting a sequenceof characters. There are a large numbers of words appearing in queries in them agreat number of informal words exist. The supervised segmentation methods need alarge amount of manually annotated training data, which is not suitable for querysegmentation. Therefore, in this work we propose an approach for unsupervisedquery segmentation in which the query segmentation model is trained only usingquery log. Due to effectively combining the information about charactersco-occurrence, position and fertility in queries, the query segmentation modelachieves a good performance. In this work, we further carry out research onmultilevel query segmentation in which a query can be parsed as a tree structure. Thetree structure of a query presents which segments in a query are closely related to each other. The experimental results show that our approach achieves higheraccuracy than existing methods, which demonstrates that our approach is effective.
     2. Mining named entities in query log based on random walk on graph. Thereare a lot of named entities contained in the queries of query log. The named entitiesmined from query log coincide with the queries that users construct in practice. Thequery log of a search engine is constantly updated and can contain a number of newnamed entities. Therefore, the work of mining named entities is useful for searchengine to process named entity queries. This work proposes a weakly supervisedmethod of mining named entities. Firstly, a few named entities selected manually areused as the seeds for a given named entity category. And then the context patterns,the candidate named entities and users’ clicked URLs are extracted from query logusing the seeds in a bootstrapping process and adopted to construct a tri-partite graph.Finally, the named entities belonging to the given category are extracted using therandom walk algorithm on the graph. The experimental results show that thealgorithm can effectively exploit information related to named entities in a query logto impove the performance of mining named entities.
     3. Acquiring synonymous attribute phrases for named entities via onlineencyclopedia. A named entity has a number of attributes which describe its propertiesor features. Synonymous attribute phrases are the phrases that refer to the sameattribute with different surface forms for a named entity category. In named entityqueries, the attribute phrases are usually used to represent the intent of thecorresponding attribute value. Therefore, synonymous attribute phrases are beneficialfor analyzing the search intents of named entity queries. This work exploits onlineencyclopedia to acquire the attribute phrases of named entities and identifysynonymous attributes among them using a classification framework combiningmultiple features. To our knowledge, this is the first attempt to acquire synonymousattribute phrases ultilizing online encyclopedia. The experimental results show thatonline encyclopedias are the rich resources for acquiring synonymous attributephrases, in which our approach can effectively acquire a great amount ofsynonymous attribute phrases.
     4. Recognizing the intents of named entity queries. This work includes two parts;one is identifying query intents based on classification from the perspective of coarsegrained intent analysis, another is acquiring search patterns of named entity queriesfrom the perspective of fine grained intent analysis. In query intent classificationwork, we adopt a classification approach which combines multiple effective features acquired from different resources including query text semantic and syntacticanalysis, information obtained from query log and contents of result returned bysearch engine. Query intent classification can limit the search space of search enginebased on classified information and thus improve precision of search result. We usethe informational and transactional named entity queries recognized by query intentclassification model to extract query patterns which users often use in queries. Thequery patterns are clustered into groups and those in a group have the same searchintent. When the query patterns match the queries issued to search engine, searchengine can accurately capture the search intent of the queries. This work proposes acascade method which graph based method and similarity based method aresuccessively applied to extract query patterns from named entity queries. Theexperimental results demonstrate that our method can effectively acquire the querypatterns for multiple named entity categories.
     In summary, this dissertation describes research on some crucial techniques ofnamed entity query processing, in which some of the query processing techniques cannot only be applied to named entity queries but also to general queries. This research ofthe dissertation has achieved some preliminary results, which we hope can be helpfulto the task of named entity query processing in search engine.

引文

1. N. J. Belkin, R. N. Oddy, H. M. Brooks. ASK for Information Retrieval: Part I:Background and Theory[J]. In Readings in Information Retrieval.1997:299-304.
    2. J. F. Guo, G. Xu, X. Q. Cheng, H. Li. Named Entity Recognition in Query[C].Proceedings of the32nd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval.2009:267-274.
    3. R. Grishman, B. Sundheim. Message Understanding Conference–6: A BriefHistory[C]. Proceedings of the16th Conference on Computitional Linguistics.1996:466-471.
    4. C. Silverstein, M. Henzinger, H. Marais, M. Moricz. Analysis of a Very LargeWeb Search Engine Query Log[C]. SIGIR Forum,1999,33(1):6-12.
    5. A. Spink, S. Ozmutlu, H. C. Ozmutlu, B. J. Jansen. U.S. versus European WebSearching Trends[C]. SIGIR Forum.2002,36(2):32-38.
    6. B. J. Jansen. Search Log Analysis: What it is, What’s been Done, How to DoIt[J]. Library&Information Science Research.2006,28(3):407-432.
    7. D Shen, J-T Sun, Q. Yang and Z. Chen. Building Bridges for Web QueryClassification[C]. Proceedings of the29th Annual International ACM SIGIRConference on Research and Development in Information Retrieval.2006:131-138.
    8. A. Z. Broder. A Taxonomy of Web Search[C]. SIGIR Forum.2002,36(2):3-10.
    9. D. E. Rose, D. Levinson. Understanding User Goals in Web Search[C].Proceedngs of the13th International World Wide Web Conference.2004:13-19.
    10. S. M. Beitzel, E. C. Jensen, O. Frieder, et al. Automatic Web QueryClassification Using Labeled and Unlabeled Training Data[C]. Proceedings ofthe28th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval.2005:581-582.
    11. S. M. Beitzel, E. C. Jensen, O. Frieder, et al. Improving Automatic QueryClassification via Semi-Supervised Learning[C]. Proceedings of the Fifth IEEEInternational Conference on Data Mining.2005:42-49.
    12. A. Z. Broder, M. Fontoura, E. Gabrilovich et al. Robust Classification of RareQueries using Web Knowledge[C]. Proceedings of the30th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval.2007:231-238.
    13. J. Hu, G. Wang, F. Lochovsky, et al. Understanding User’s Query Intent withWikipedia[C]. Proceedngs of the18th International World Wide Web Conference.2009:471-480.
    14. X. Li, Y-Y Wang, A. Acero, Learning Query Intent from Regularized ClickGraphs[C]. Proceedings of the31st Annual ACM SIGIR Conference onResearch and Development in Information Retrieval.2008:339-346.
    15. X. Li, Y-Y Wang, D. Shen, A. Acero, Learning with Click Graph for QueryIntent Classification[J]. In ACM Transaction on Information Systems.2010,(28)3: Article12:1-20.
    16. H. H. Cao, D. Hao, H. D. Shen, et al. Context-Aware Query Classification[C].Proceedings of the32nd Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval.2009:3-10.
    17. N. Craswell, D. Hawking, and S. Robertson. Effective Site Finding Using LinkAnchor Information[C]. Proceedings of the24th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval.2001:250-257.
    18. B. V. Nguyen, M. Y. Kan. Functional Faceted Web Query Analysis[C]. In QueryLog Analysis: Social And Technological Challenges, Workshop of the16thInternational Conference on World Wide Web.2007.
    19. B. Cao, J-T Sun, E. W. Xiang, et al. PQC: Personalized Query Classification[C].Proceedings of the18th International Conference On Information AndKnowledge Management.2009:1217-1226
    20. L. Gravano, V. Hatzivassiloglou, R. Lichtenstein. Categorizing Web Queriesaccording to Geographical Locality[C]. Proceedings of the12th InternationalConference on Information and Knowledge Management.2003:325-333.
    21. J. C. K. Cheung, X. Li. Sequence Clustering and Labeling for UnsuperviesedQuery Intent Discovery[C]. Proceedings of the fifth ACM InternationalConference on Web Search and Data Mining.2012:383-392.
    22. R. Baeza-Yates, C. Hurtado, and M. Mendoza. Improving Search Engines byQuery Clustering[J]. Journal of the American Society for Information Scienceand Technology.2007,58(12):1793-1804.
    23. J. R. Wen, J. Y. Nie, H. J. Zhang. Clustering User Queries of a Search Engine[C].Proceedings of the10th International Conference on World Wide Web,2001:162-168.
    24. R. Baeza-Yates, C. Hurtado, M. Mendoza. Query Recommendation Using QueryLogs in Search Engines[C]. Proceedings of the2004International Conference onCurrent Trends in Database Technology.2004:395-397.
    25. E. Sadikov, J. Madhavan, L. Wang, A. Halevy. Clustering Query Refinements byUser Intent[C]. Proceedings of the19th International Conference on World WideWeb.2010:841-850.
    26. D. Beeferman, A. Berger. Agglomerative Clustering of a Search Engine QueryLog[C]. Proceedings of the6th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining.2000:407-416.
    27. Y. Li, Z. Zheng, and H. K. Dai. Kdd Cup-2005Report: Facing a GreatChallenge[R]. SIGKDD Exploration Newsletter.2005,7(2):91-99.
    28. J. W. Du, Z. M. Zhang, J. Yan, et al. Using Search Session Context for NamedEntity Recognition in Query[C]. Proceedings of the33rd Annual InternationalACM SIGIR Conference on Research and Development In InformationRetrieval.2010:765-766.
    29. C. Barr, R. Jones, M. Regelson. The Linguistic Structure of English Web-SearchQueries[C]. Proceedings of the2008Conference on Empirical Methods inNatural Language Processing.2008:1021–1030.
    30. M. Bendersky, W. B. Croft, D. Smith. Structural Annotation of Search QueriesUsing Pseudo-Relevance Feedback[C]. Proceedings of the12th InternationalConference on Information and Knowledge Management.2010:1537-1540.
    31. X. Li, Y-Y Wang, A. Acero. Extracting Structured Information from UserQueries with Semi-Supervised Conditional Random Fields[C]. Proceedings ofthe32nd Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval.2009:572-579.
    32. M. Manshadi, X. Li. Semantic Tagging Of Web Search Queries[C]. Proceedingsof the Joint Conference of the47th Annual Meeting of the ACL and the4thInternational Joint Conference on Natural Language Processing of the AFNLP.2009:861-869.
    33. X. Li. Understanding the Semantic Structure of Noun Phrase Queries[C].Proceeding of the48th Annual Meeting of the Association for ComputationalLinguistics.2010:1337-1345.
    34. J. X. Xu, W. B. Croft. Query Expansion Using Local and Global DocumentAnalysis[C]. Proceedings of the19th Annual International SIGIR Conference onResearch and Development in Information Retrieval.1996:4-11.
    35. J. X. Xu, W. B. Croft. Improving the Effectiveness of Information Retrieval withLocal Context Analysis[J]. ACM Transactions on Information Systems,2000,18(1):79-112.
    36. M. J. Martin-Bantista, D. Sanchez, J. ChamorroMartinez, et al. Mining WebDocuments to Find Additional Query Terms using Fuzzy Association Rules[J].Fuzzy Sets and Systems,2004,148(1):85-104.
    37. H. Cui, J. R.Wen, J. Y. Nie, W. Y. Ma. Query Expansion by Mining User Logs[J].IEEE Transactions on Knowledge and Data Engineering,2003,15(4):829-839.
    38. R. Jones, B. Rey, O. Madani, W. Greiner. Generating Query Substitutions[C].Proceedings of the15th International Conference on World Wide Web.2006:387-396.
    39. H. H. Cao, D. X. Jiang, J. Pei, et al. Context-Aware Query Suggestion by MiningClick-Through and Session Data[C]. Proceedings of the14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining.2008:875-883.
    40. W. B. Croft, D. Metzler, T. Strohman. Chapter6, Search Engines InformationRetrieval in Practice[M]. Addison Wesley.2010.
    41. S. Cucerzan, E. Brill. Spelling Correction as an Iterative Process that Exploitsthe Collective Knowledge of Web Users[C]. Proceedings of the2004Conferenceon Empirical Methods in Natural Language Processingpp.2004:293-300.
    42. M. Li, Y. Zhang, M. Zhu, and M. Zhou. Exploring Distributional SimilarityBased Models for Query Spelling Correction[C]. Proceedings of the21stInternational Conference on Computational Linguistics and the44th AnnualMeeting of the Association for Computational Linguistics.2006:1025-1032.
    43. J. F. Guo, G. Xu, H. Li, X. Q. Cheng. A Unified and Discriminative Model forQuery Refinement[C]. Proceedings of the31st Annual International ACM SIGIRConference on Research and Development in Information Retrieval.2008:379-386.
    44. K. M. Risvik, T. Mikolajewski, P. Boros. Query Segmentation for WebSearch[C]. Proceedings of the12th International Conference on World WideWeb.2003.
    45. S. Bergsma, Q. I. Wang. Learning Noun Phrase Query Segmentation[C].Proceedings of the2007Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning.2007:819-826.
    46. X. Yu, H. Shi. Query Segmentation using Conditional Random Fields[C].Proceedings of the First International Workshop on Keyword Search onStructured Data.2009:21-26.
    47. B. Tan, F. Peng. Unsupervised Query Segmentation using Generative LanguageModels and Wikipedia[C]. Proceedings of the17th International Conference onWorld Wide Web.2008:347-356.
    48. C. Zhang, N. Sun, X. Hu, T. Huang, et al. Query Segmentation Based onEigenspace Similarity[C]. Proceedings of the ACL-IJCNLP2009Conference.2009:185-188.
    49. M. Hagen, M. Potthast, B. Stein, C. Br utigam. The Power of Na ve QuerySegmentation[C]. Proceedings of the33rd Annual ACM SIGIR Conference onResearch and Development in Information Retrieval.2010:797-798.
    50. M. Hagen, M. Potthast, B. Stein, C. Br utigam. Query SegmentationRevisited[C]. Proceedings of the20th International Conference on World WideWeb.2011:97-106.
    51. N. Mishra, R. S. Roy, N. Ganguly, S. L. M. Choudhury. Unsupervised QuerySegmentation Using only Query Logs[C]. Proceedings of the20th InternationalConference Companion on World Wide Web.2011:91-92.
    52. Y. Li, B-J Hsu, C. X. Zhai, K. Wang. Unsupervised Query Segmentation usingClickthrough for Information Retrieval[C]. Proceedings of the34th AnnualACM SIGIR Conference on Research and Development in Information Retrieval.2011:285-294.
    53. J. Huang, J. F. Gao, J. B. Miao, X. L. Li, et al. Exploring Web Scale LanguageModels for Search Query Processing[C]. Proceedings of the19th InternationalConference Companion on World Wide Web.2010:451-460.
    54. Nist. The ACE2007Evaluation Plan: Evaluation of the Detection and Recognition of ACE Entities, Values, Temporal Expression, Relations, and Events[R/OL]. http://www.nist.gov/speech/tests/ace/2007/doc/ace07-evalplan.v1.3a.pdf.
    55. N. Chinchor. MUC-7Named Entity Task Definition[C]. Proceedings of the7thMessage Understanding Conference.1998.
    56. L. F. Rau. Extracting Company Names from Text[C]. Proceedings of the7thIEEE Conference on Artificial Intelligence Applications.1991:29-32.
    57. G. R. Krupka, K. Hausman. IsoQuest. Inc.: Description of the NetOwl(TM)Extractor System as Used for MUC-7[C]. Proceedings of the7th MessageUnderstanding Conference.1998.
    58. W. J. Black, F. Rinaldi, D. Mowart. FACILE: Description of the NE SystemUsed for MUC-7[C]. Proceedings of the7th Message Understanding Conference.1998.
    59. D. M. Bikel, S. Miller, R. Schwartz, et al. Nymble: a High-PerformanceLearning Name-finder[C]. Proceedings of Fifth Conference on Applied NaturalLanguage Processing.1997:194-201.
    60. A. Borthwick. A Maximum Entropy Approach to Named Entity Recognition[D].Doctor Dissertation. New York University.1999.
    61. Y. Z. Wu, J. Zhao, B. Xu, et al. Chinese Named Entity Recognition Model Basedon Multiple Features[C]. Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing.2005:427-434.
    62. S. Sekine, R. Grishman, H. Shinou. A Decision Tree Method for Finding andClassifying Names in Japanese Texts[C]. Proceedings of the Sixth Workshop onVery Large Corpora,1998:171-178.
    63. M. Collins. Ranking Algorithms for Named Entity Extraction: Boosting and theVoted Perceptron[C]. Proceedings of40th Annual Meeting of the Association forComputational Linguistics.2002:489-496.
    64. A. McCallum, W. Li. Early Results for Named Entity Recognition withConditional Random Fields, Feature Induction and Web-enhanced Lexicons[C].Proceedings of the7th Conference on Natural Language Learning at HLTNAACL.2003:188-191.
    65. Y. F. Lin, T. Tsai, W. Chou, et al. A Maximum Entropy Approach to BiomedicalNamed Entity Recognition[C]. Proceedings of the4th ACM SIGKDD Workshopon Data Mining in Bioinformatics.2004.
    66.孙茂松,黄昌宁,高海燕,方捷.中文姓名的自动辨识[J].中文信息学报.1995,9(2):16-27.
    67.张小衡,王玲玲.中文机构名称的识别与分析[J].中文信息报学.1997,11(4):21-32.
    68.张玥杰,徐智婷,薛向阳.融合多特征的最大熵汉语命名实体识别模型[J].计算机研究与发展.2008.45(6):1004-1010
    69.刘非凡,赵军,吕碧波,徐波,于浩,夏迎炬.面向商务信息抽取的产品命名实体识别研究[J].中文信息学报.2006,20(1):7-13.
    70.俞鸿魁，张华平，刘群，吕学强，施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报.2006,27(2):87-93.
    71.张祝玉，任飞亮，朱靖波.基于条件随机场的中文命名实体识别特征比较研究[J].第4届全国信息检索与内容安全学术会议论文集.2008.
    72.冯元勇，孙乐，张大鲲，李文波.基于单字提示特征的中文命名实体识别快速算法[J].中文信息学报.2008,22(1):105-110.
    73.周俊生，戴新宇，尹存燕，陈家骏.基于层叠条件随机场的中文机构名自动识别[J].电子学报.2006，34(5):804-809.
    74.王浩畅，李钰，赵铁军.面向生物医学命名实体识别的多Agent元学习框架[J].计算机学报.2010,33(7):1256-1262.
    75.赵军.命名实体识别、排歧和跨语言关联[J].中文信息学报.2009,23(2):3-17.
    76. M. Pasca. Organizing and Searching the World Wide Web of Facts? Step two:Harnessing the Wisdom of the Crowds[C]. Proceedings of the16th InternationalConference on World Wide Web.2007:101-110.
    77. M. Pasca. Weakly-supervised Discovery of Named Entities using Web SearchQueries[C]. Proceedings of the16th ACM conference on Conference onInformation and Knowledge Management.2007:683-690.
    78. G. Xu, S. H. Yang, H. Li. Named Entity Mining From Click-Through Data usingWeakly Supervised Latent Dirichlet Allocation[C]. Proceedings of the15thACM SIGKDD International Conference on Knowledge Discovery and DataMining.2009:1365-1374.
    79.翟海军，郭勇，郭嘉丰，程学旗.基于转移学习的命名实体识别技术[J].上海交通大学学报.2011，45(2):164-167.
    80.张磊，王斌，靖红芳.中文网页搜索日志中的特殊命名实体挖掘[C].第五届全国信息检索学术会议论文集.2009.
    81. U. Lee, Z. Liu, J. Cho. Automatic Identification of User Goals in Web Search[C].Proceedings of the14th International Conference on World Wide Web.2005:391-400.
    82. Y. Q. Liu, M. Zhang, L. Y. Ru, S. P. Ma. Automatic Query Type IdentificationBased on Click through Information[J]. Lecture Notes in Computer Science
    4182.2006:593-600.
    83. R. Baeza-Yates, L. Calderon-Benavides, C. Gonzalez-Caro. The Intention behindWeb Queries[J]. Lecture Notes in Computer Science4209,2006:98-109.
    84.张森,王斌. Web检索查询意图分类技术综述[J].中文信息学报.2008,22(4):75-82.
    85. X. J. Yuan, Z. C. Dou, L. Zhang, F. Liu. Automatic User Goals IdentificationBased on Anchor Text and Click-through Data[J]. Wuhan University Journal ofNatural Sciences.2008:495-500.
    86. D. J. Brenes, D. Gayo-Avello. Automatic Detection of Navigational QueriesAccording to Behavioural Characteristics[C]. Proceedings of LWA2008Workshop,2008:41-48.
    87. M. R. Herrera, E. S. Moura, M. Cristo, T.P. Silva, et al. Exploring Features forthe Automatic Identification of User Goals in Web Search[J]. InformationProcessing and Management.2010,46(2):131-142.
    88. B. J. Jansen, D. L. Booth, A. Spink. Determining the Informational, Navigational,and Transactional Intent of Web Queries[J]. Information Processing andManagement.2008,44(3):1251-1266.
    89. G. Agarwal, G. Kabra, K. C. C. Chang. Toward Rich Query Interpertation: WalkBack and Forth for Mining Query Templates[C]. Proceedings of the19thInternational Conference on World Wide Web.2010:1-10.
    90. J. F. Guo, X. Q. Cheng, G. Xu, X. F. Zhu. Intent-Aware Query Similarity[C].Proceedings of the20th ACM International Conference on Information andKnowledge Management.2011:259-268.
    91. X. B. Xue, X. X. Yin. Topic Modeling for Named Entity Queries[C].Proceedings of the20th ACM International Conference on Information andKnowledge Management.2012:2009-2012.
    92.黄昌宁，赵海.中文分词十年回顾[J].中文信息学报.2007,21(3):8-19.
    93. P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, R. L. Mercer. The Mathematics ofStatistical Machine Translation: Parameter Estimation[J]. ComputationalLinguistics.1993,19(2):263-311.
    94. Z. Y. Liu, H. F. Wang, H. Wu, S. Li. Collocation Extraction using MonolingualWord Alignment Method[C]. Proceedings of the Conference on EmpiricalMethods in Natural Language Processing,2009:487-495.
    95. S. Brody. It Depends on the Translation: Unsupervised Dependency Parsing viaWord Alignment[C]. Proceedings of the Conference on Empirical Methods inNatural Language Processing,2010:1214-1222.
    96. Y. Al-Onaizan, J. Curin, M. Jahr. K. Knight, et al. Statistical Machine Translation.Final Report[C]. In John Hopkins Universtity Workshop.1999.
    97.孙茂松，肖明，邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J].计算机学报.2004,27(6):736-742.
    98. R. Sproat, C. Shih. A Statistical Method for Finding Word Boundaries in ChineseText[J]. Computer Processing of Chinese&Oriental Languages.1990,4(4):336-349.
    99. X. P. Ge, W. Pratt, P. Smyth. Discovering Chinese Words from UnsegmentedText[C]. Proceedings of the22nd annual International ACM SIGIR Conferenceon Research and Development In Information Retrieval.1999:271-272.
    100.K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis[J].Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon.1991,115-164
    101.T. Brants, A. Franz. Web1T5-gram Version1[DB/OL]. Linguistic DataConsortium LDC2006T13,2006.
    102.J. W. Han. Data Mining-Concepts and Techniques[M]. Higher Education Press&Morgan Kaufmann Publishers.2001.227-228.
    103.靖红芳.文本分类中特征选择形式化研究[D].中国科学院研究生院硕士学位论文.2009.
    104.M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, et al. Open InformationExtraction from the Web[C]. Proceedings of20th International Joint Conferenceson Artificial Intelligence.2007:2670-2676.
    105.M. Pasca and B. V. Durme. Weakly-supervised Acquisition of Open-DomainClasses and Class Attributes from Web Documents and Query Logs[C].Proceedings of the46th Annual Meeting of the Association for ComputationalLinguistics.2008:19–27.
    106.M. Pennacchiotti, P. Pantel. Entity Extraction via Ensemble Semantics[C].Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing.2009:238-247.
    107.Y. Shinyama, S. Sekine. Named Entities Discovery using Comparable NewsArticles[C]. Proceedings of the20th International Conference on ComputationalLinguistics.2004:848-853
    108.A. Jain, M. Pennacchiotti. Open Entity Extraction from Web Search QueryLogs[C]. Proceedings of the23rd International Conference on ComputationLinguistics.2010:510-518
    109.P. Pantel, D. Ravichandran. Automatically Labeling Semantic Classes[C].Proceeding of the Human Language Technology Conference of the North AmericanChapter of the Association for Computational Linguistics.2004:321–328.
    110.K. Tokunaga, J. Kazama, K. Torisawa. Automatic Discovery Of Attribute WordsFrom Web Documents[C]. Proceedings of the2nd International Joint Conferenceon Natural Language Processing.2005:106-118.
    111.E. Alfonseca, M. Pasca, E. Robledo-Arnuncio. Acquisition of Instance Attributesvia Labeled and Related Instances[C]. Proceedings of the33rd InternationalACM SIGIR Conference on Research and Development in Information Retrieval.2010:58-65.
    112.J. R. Curran, M. Moens. Improvements in Automatic Thesaurus Extraction[C].Proceedings of the Workshop on Unsupervised Lexical Acquisition.2002:59-67.
    113.D. K. Lin, S. J. Zhao, L. J. Qin, M. Zhou. Identifying Synonyms amongDistributionally Similar Words[C]. Proceedings of18th International JointConferences on Artificial Intelligence.2003.1492-1493
    114.H. Wu, M. Zhou. Optimizing Synonym Extraction using Monolingual andBilingual Resources[C]. Proceedings of the Second International Workshop onParaphrasing: Paraphrase Acquisition and Applications.2003:72-79.
    115.L. V. D. Plas, J. Tiedemann. Finding Synonyms using Automatic Word Alignmentand Measures of Distributional Similarity[C]. Proceedings of the44th AnnualMeeting of the Association for Computational Linguistics.2006:866-873.
    116.P. D. Turney.2001. Mining the Web for Synonyms: PMI-IR versus LSA onTOEFL[J]. Lecture Notes in Computer Science.2167:491-502.
    117.D. Ravichandran, E. Hovy. Learning Surface Text Patterns for a QuestionAnswering System[C]. Proceedings of the40th Annual Meeting of theAssociation for Computational Linguistics.2002:41-47.
    118.S. Cucerzan. Large-Scale Named Entity Disambiguation Based on WikipediaData[C]. Proceedings of the Joint Conference on Empirical Methods in NaturalLanguage Processing and Computational Natural Language Learning.2007:708-719.
    119.J. Nothman, T. Murphy, J. R. Curran. Analysing Wikipedia and Gold StandardCorpora for NER Training[C]. Proceedings of the12th Conference of the EuropeanChapter of the Association for Computational Linguistics.2009:612-620.
    120.Y. L. Yan, N. Okazaki, Y. Matsuo, Z. L. Yang et al. Unsupervised RelationExtraction by Mining Wikipedia Texts Using Information from the Web[C].Proceedings of the40th annual meeting of the Association for ComputationalLinguistics and the4th International Joint Conference on Natural LanguageProcessing.2009:1021-1029.
    121.R. Hoffmann, C. Zhang, D. S. Weld. Learning5000Relational Extractors[C].Proceedings of the48th Annual Meeting of the Association for ComputationalLinguistics.2010:286–295.
    122.V. N. Vapnik. Statistical Learning Theory[M]. Wiley-Interscience,1998.
    123.G. I. Webb. Decision Tree Grafting from the All-Tests-But-One Partition[C].Proceedings of the16th International Joint Conference on Artificial Intelligence.1999:702-707.
    124.Y. Freund, R. E. Schapire: Large Margin Classification using the PerceptronAlgorithm[C]. Proceedings of the11th Annual Conference on ComputationalLearning Theory.1998:209-217.
    125.张宇,宋巍,刘挺,李生.基于URL主题的查询分类方法[J].计算机研究与发展.2012,49(6):1298-1305.
    126.J. R. Landis, G. G. Koch. The Measurement of Observer Agreement forCategorical Data[J]. Biometrics.1977,33(1):159-174.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700