面向个性化主题搜索的用户—查询词语义本体构建

英文题名：Construction of User-Query Semantic Ontology(UQSO) for Personalized Topic Search Engine
作者：冯明丽
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：用户—查询词语义本体 ; 用户查询日志 ; 聚类 ; WordNet ; 本体构建 ; 主题搜索
英文关键词：User-Query Semantic Ontology ; User query logs ; Clustering ; WordNet ; Ontology Building ; Topic search engine
学位年度：2010
导师：杜亚军
学科代码：081202
学位授予单位：西华大学
论文提交日期：2010-05-01

摘要

目前,由于用户输入的查询词的简短以及表达语义的模糊性,大多数搜索引擎都面临查询词理解的问题。主题检索系统如何能够准确的理解用户输入的信息需求,同时具有关于检索信息源的语义知识?“不同的用户输入相同查询关键词”和“同一用户输入不同查询关键词”时怎样自动有区分的为每个用户返回准确的相关信息?这是本文研究的主要问题。大多数搜索引擎搜集了大量的用户查询日志,这些数据记录了用户历史查询点击信息,不同程度地反映了用户的兴趣和领域知识。用户记录越多,对用户领域知识的刻画越准确。而本体(Ontology)具有良好的概念层次结构和对逻辑推理的支持,具有通过概念之间的关系来表达语义的能力,能较好的为语义检索和概念检索提供知识基础。形如WordNet这样的词库中拥有大量的反映领域专家知识的同义词、近义词、词与词之间的is_a、part_of关系。因此利用丰富的用户查询日志信息和WordNet词库中的语义关系来为主题检索提供一个本体结构的语义背景,为开发新一代个性化主题信息检系统提供了广阔的天地。研究历史知识库中用户查询词与点击网页间的关系,建立用户查询词之间反映用户个性化知识的语义关系模型显得格外重要。
     本文的主要研究内容如下:
     首先,本文提出了一种新颖的个性化查询词语义聚类方法,该方法将用户查询词按用户个性化兴趣和知识背景进行主题分类。搜索引擎用户查询日志包含了丰富的用户历史访问记录,这些记录不同程度的反应了用户兴趣和领域知识。本文首先提出了基于用户查询日志的三种用户查询词语义相似关系,如基于查询词本身的相似关系,基于用户查询点击序列的相似关系和基于用户点击文档内容的相似关系,通过分析这三种语义关系,提出了一种新颖的计算用户查询词语义相似度的方法,基于这种用户查询词语义相似度得到聚类相似函数,利用层次凝聚聚类算法,从而将用户查询词根据用户查询日志中所反映的主题进行语义主题聚类,以基本消除了用户查询词的语义模糊性。
     其次,本文提出了一种利用用户查询词语义主题聚类结果和WordNet词库中词与词之间的关系建立一个用户查询词兴趣主题领域知识模型,即用户—查询词语义本体(User-Query Semantic Ontology,UQSO)的方法。UQSO具体描述了一个用户兴趣所在领域,形成了个性化主题检索的基础。该本体表达了用户兴趣偏好,将来可以由此产生用户群和用户群偏好,然后将其应用于主题搜索引擎,进而可以把信息采集从基于关键词的相关度匹配技术层面提高到基于语义层面的查找,以便为用户提取出更适合其潜意图的信息,从而实现个性化主题搜索的目的。
     最后,本文利用Porotégé2000本体构建工具,和C++进行了实验验证,对一个用户的查询词集进行了查询词聚类并借助WordNet词库构建了该用户的用户—查询词语义本体(UQSO)。实验表明,通过本文本体构建方法,用户查询词能更好的根据用户兴趣和知识背景来区分其真实语义,消除其语义模糊性。因此,UQSO为实现个性化主题搜索奠定了基础。
These years, because of the brevity and semantic ambiguity of user query words, most search engines face a problem to understand the meaning of query words。How topic search engine to not only accurately understand user submiting information needs, but also possess of the relevant semantic knowledge of query information source, and how to automatically and distinguishingly return the accurate relevant information to each user when“different users enter the same query keywords”and“the same user inputs different query keywords”to topic search engine, which is our main research issues. Most search engines gather a large number of user query logs, which record the user history queries and clicks on information, and reflect the user's interest and domain knowledge to varying degrees. More users record, more accurate to characterize the user’s domain knowledge. Ontology has a good concept structure and support for logical reasoning, owns the ability of expression semantics based on the relationship of concepts, and also can provide basic knowledges for semantic search and concept search. WordNet contains a large mumber of queries relations , such as“synonym”,“synonyms”,“isa”and“part of”, which can reflect expert’s knowledges. Therefore, to take use of rich user query logs and semantic relations in WordNet to construct ontology as semantic backgroud of topic search engine, it provide a vast world for developing a new generation of Personalized Topic information retrieval system。Studing the relations of user query words and web clicks in history knowledge records, and constructing the model of semantic relations which reflects the user personalized knowledge between user query words, has become particularly important.
     The main contents of this paper are summarized as follows:
     First, we present a new method of personalized user-query semantic clustering to classify user query words into subjects by user’s personal interests and background knowledge. User query logs contain a wealth of user-access history records, these records reflect user interests and domain knowledge to some extend. Above all, we propose three semantic relations based on user query logs, such as based on the query word itself ,based on user query click sequence and based on user query click content. Then, according to the analysis of these three semantic relati ons, we propose a novel computing method of user query semantic similarity. Based on this user query semantic similarity ,we can get the function of cluster similarity, and by hiera- rchical agglomerative clustering algorithm, we can cluster user query terms into semantic subjects based on the reflected topics in user query logs so as to disambiguated the semantic ambiguity of user query words.
     Secondly, we propose a method to construct user-query semantic ontology (UQSO) which is a model of user query interest domain knowledge in use of user query semantic clustering and queries relations in WordNet. UQSO describes user interest domian knowledge and formes the basis of personalized topic search engine. This ontology express the user interest preferences, and then based on this to establish user group and group preferences which if is applied to search engines, will improve the technical level of information collection from based on similarity matching of keywords to based on semantic query, and which is convenient for users to provide more suitable information, thus achieve the purpose of personalized search.
     Finally, we use Porotégé2000 ontology construction tools, and VC++ programming language for the experimental verification to cluster a user query word set, and take use of WordNet to build user-query semantic ontology (UQSO). Our experiment shows that, by this ontology construction method, the true meaing of user query words can be better distinguished according to the user interests and background knowledges, and query semantic ambiguity can be eliminated. Therefor UQSO can be a foundation of the realization of personalized topic search.

引文

[1] Xiaojun Wan, A novel document similarity measure based on earth mover’s distance, information Sciences 177 (2007) 3718–3730
    [2] H. Zha, Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering, in Proceedings of the 25th SIGIR Conference, 2002, pp. 113–120.
    [3] P. De Bra, G.Houben, Y. Kornatzky and R. Post. Information Retrieval in Distributed Hypertexts. In Proceedings of the 4th RIAO Conference. 4 New York.1994:81–-491.
    [4] M. Hersovici,M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim and S. Ur. The Shark-Search Algorithm– An Application: Tailored Web Site Mapping. In Proceedings of the Seventh International World Wide Web Conference. Brisbane, Australia. April 1998.
    [5] J. Cho,H. Garcia-Molina, L. Page.Efficient Crawling Through URL Ordering, In Proceedings of the 7th International WWW Conference. Brisbane, Australia. April 1998.
    [6] S. Chakrabarit, M. van den Berg and B.Dom.Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. In Proceedings of the 8th International WWW Conference, Toronto, Canada. May 1999.
    [7] Diligenti M, Coetzee F M,Lawrence S et al. Foecused Frawling Using Context Graphs. In:proc of the International Coference on Very Large Data Base (VLDB). 2000.
    [8] F. Menczer, G. Pant. P. Srinivasan and M. Ruiz.Evaluating Topic-Driven Web Crawlers.In Proceedings of the 24th Annual International ACM/SIGIR Conference,New Orleans, USA. 2001.
    [9] C. C Hsu, F. Wu. Topic-Specific Crawling on the Web with the Measurements of the Relevancy Context Graph. Information System, 2006, 31:232--246.
    [10]C. C. Hsu, F. Wu. Combining Text and Lnk Analysis for Focused Crawling--An Application for Vertical Search Engines. Information Systems. 2007, 32(6): 886--908.
    [11]Yuekui Yang, Yajun Du, Jingyu Sun, Yufeng Hai.“A topic-specific web crawler with concept similarity context graph based on FCA”, Proceeding of 4th International Conference on Intelligent Computing (ICIC 2008), pp: 840-847. (Published by Springer Verlag, selected into LNAI. EI: 20084111630427).
    [12]Yuekui Yang, Yajun Du, Yufeng Hai and Zhaoqiong Gao.“A topic-specific web crawler with web hierarchy based on HTML Dom-Tree”, having been accepted by APCIP 2009.
    [13]Zhaoqiong Gao, Yajun Du, Liangzhong Yi, Qiangqiang Peng, Yeukui Yang. Incrementally Updating Concept Context Graph(CCG) for Focused Web Crawling Based on FCA, SEWM 2010.
    [14]Zhaoqiong Gao, Yajun Du, Liangzhong Yi, Qiangqiang Peng, Yuekui Yang. Incrementally Updating Concept Context Graph(CCG) for Focused Web Crawling Based on FCA , APCIP 2009.
    [15]Qiangqiang PENG,Yajun DU,Yufeng HAI,Shaoming CHEN,Zhaoqiong GAO. Topic-specific crawling on the Web with concept context graph based on FCA, 2009,MASS’09.
    [16]Qiangqiang PENG,Yajun DU,Shaoming CHEN,Zhaoqiong GAO. Focused Web Crawling Strategy Based on Web semantics Analysis and Web Links Analysis, SEWM 2010.
    [17]王继民,陈翀,彭波.大规模中文搜索引擎的用户查询日志分析[J],华南理工大学学报, 2004, Vol.32(S1): 1-5.
    [18]Milad Shokouhi *, Justin Zobel, Saied Tahaghoghi, Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Information Processing and Management 43 (2007) 169--180.
    [19]Xiaofei He?, Pradhuman Jhala. Regularized query classification using search click information. Pattern Recognition 41 (2008) 2283--2288.
    [20]Mingli Feng, Yajun Du, Mingjun Feng, Yingyu Wang, Personalized user-query semantic clustering using search click information, Management and Service Science,2009,MASS’09.(EI:20100212629860)
    [21]Neches R ,Fikes R E ,Gruber T R ,et al. Enabling Technology for Knowledge Sharing. AIMagazine ,1991,12(3) :36～56.
    [22]Chandrasekaran B,et a1.W hat are Ontologies,and W hy do We Need Them[J].IEEE Intelligent Systems,1999,14(1):20.26
    [23]Natalya F Noy, Deborah L MeGuinness. Ontology Development 101:A Guide to Creating Your First Ontology .Stanford Knowledge Systems Laboratory Technical Report KSL-01.05 and Stanford Medical Informaties Technical Report SMI-2001-0880,March.
    [24]Gruber T R.A translation approach to portable ontologies.Knowledge Acquisition,1993,5(2):199-220.
    [25]Thomas R. Grubei. Toward Principles for the Design of Ontologies Used for Knowledge Sharing.Reed ,1993.
    [26]Bomt P,Akkermans H .An Ontology Approach to Product Disassembly.EKAW 1997,Sant Feliu de GuSxols,Spain,Oetober:15-19.
    [27]Brost W N. Construction of Engineering Ontologies for Knowledge Sharing and Reuse.PhD thesis,University ofTwente,Ensched e,1997.
    [28]Studer Rudi,Richard Benjamins,Dieter Fense1.Knowledge Engineering: Principles and Methods[J].Data and Knowledge Engineering,1998,25(1-2):161-197.
    [29]冯兰萍.本体在智能信息检索系统中的应用研究[D].河海大学, 2005.
    [30]Arpirez J ,Perez A G,Lozano A ,et al. (Onto) 2 agent :An Ontology2based WWW Broker to Select Ontologies. In : Go2mez2Perez A ,Benjamins V R , eds. Proceedings of the Workshop on Application of Ontologies and Problem2Solving Methods UK,1998 ,16-24
    [31]Ontobroker. http :PPontobroker. aifb. uni2karlsruhe. de
    [32]SKC. http :PPwww2db. stanford. eduPskc
    [33]Jacob Ko¨hler, Stephan Philippi, Michael Specht, Alexander Ru¨egg. Ontology based text indexing and querying for the semantic web. Knowledge-Based Systems 19 (2006) 744–754.
    [34]毛平.基于领域本体的文本信息语义检索研究[D].南京理工大学,2007.
    [35]Yuting Wang, Yajun Du, Bing Zhang, Selection of Personalized Initial-URLs Based on User Ontology, Journal of Computational Information Systems.2008,4(3), 899-906,(EI: 20083211445080)
    [36]Borgo,S.,Guarino,N.,Masolo,C.,and Vetere,G. (1997). Using a large 1inguistic ontology for internet-based retrieval of object-oriented components. In. Proceedings of 1997 Conference on Software Engineering and Knowledge Engineering. Mardrid, Knowledge Systems Institute,Snokie,IL,USA.
    [37]Yamaguehi,T.(1999), Constructing domain ontologies based on coneept drift analysis. In., Proceedings of IJCAI-99 Workshop on ontologies and Problem-Solving Methods: Lessons Learned and Future Trends, in conjunction with the Sixteenth International Joint Conference on Artificial Intelligence,August,Stockholm, Sweden.
    [38]Keng-Woei Tan, Hyoil Han, and Ramez Elmasri Web data cleansing and preparation for ontology extraction using WordNet Proeeedings of the First International Conference on Web Information Systems Engineering, 2000.
    [39]徐力斌,刘宗田,周文,宋二伟,基于WordNet和自然语言处理技术的半自动领域本体构建[J],计算机科学,2007.VOL.34 No.6,219-222.
    [40]费静婷,顾君忠,杨静,黄俊春,基于WordNet和聚焦爬虫的半自动领域本体构建[J],计算机应用,Vol28. Dec. 2008. 67-70.
    [41]Zinger S.,Millet,C.,Mathieu,B.,Grefenstette,G.,Hede,P., Moellic, P-A. ,“Extracting an ontology of Portrayable objects from Wordnet”, MUSCLE/ImageCLEF workshop on Image and Video retrieval evaluation,Vienna,Austria,pp.17-23,2005.
    [42]赵天忠,苗壮,张亚非,徐伟光,陆建江,基于WordNet重用的领域本体构建方法[J],系统仿真学报,Vol.19, No, 19.Oct,2007.4583-4586.
    [43]周子力,基于WordNet的本体构建和及其在安全领域应用关键技术研究[D],2009届研究生博士学位论文,华东师范大学,2009.04.
    [44]王继民,彭波,孟涛.基于搜索引擎日志发现相近Web查询[J].北京邮电大学学报,2005,Vol.28(S2):44-48.
    [45]Ji-Rong, W., N. Jian-Yun, et al. Query clustering using user logs. Proceedings of the 10th World Wide Web conference, New York, ACM Press(2001).
    [46]H. Zha, Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering, in Proceedings of the 25th SIGIR Conference, 2002, pp. 113–120.
    [47]贺玲,吴玲达,蔡益朝,数据挖掘中的聚类算法综述[J],计算机应用研究,2007.01,10-13.
    [48]Guha S, Rastogi R, Shim K CURE: An Efficient Clustering Algorithm for Large Databases[C]. Seattle: Proceedings of the ACM SIGMOD Conference, 1998.73-84
    [49]Guha S, Rastogi R, Shim K ROCK: A Robust Clustering Algorithm for Categorical Attributes[C]. Sydney: Proceedings of the 15th ICDE, 1999.512-521.
    [50]Karypis G, Han E-H, Kumar V. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling[J]. IEEE Computer, 1999,32(8):68-75.
    [51]http://wordnet.princeton.edu
    [52]Miller G,Beckwith R,Fellbaum C,et al.Introduction to WordNet: An on-line lexical database[J].International Journal of Lexicography (S0950-3846),1993,3(4):235-312.
    [53]刘大伟,基于WordNet本体库的文本分类方法[D],北京交通大学硕士学位论文,2008,6.
    [54]冯志勇,李文杰,李晓红,本体论工程及其应用[M],清华大学出版社,2007.5,22—47.
    [55]http://protege.stanford.edu/
    [56]Ricardo Baeza- Yates, Barthier Ribeiro- Neto.Mordern Informtion Retrieval[M].北京:机械工业出版社, 2004: 24-38
    [57]L Egghe, C Michel. Construction of weak and strong similarity measures for ordered sets of documents using fuzzy set techniques. http: //www.elsevier.com/locate/infoproman, 2003
    [58]宋玲,马军,连莉,张志军,文档相似度综合计算研究[J],计算机工程与应用,2006.30,160-163.
    [59]http://icl.pku.edu.cn/doubtfire/semantics/WordNet/C-wordnet/w-contents.htm
    [60]M. Sanderson and B. Croft. Deriving Concept hierarchies from text. In Proeeedings of the 22nd annual international ACM SIGIR conference on Research and Development in In formation Retrieval, Pages 206-213. SIGIR,1999.
    [61]P.Cimiano,A.Hotho,and S.Staab. Comparing Conceptual,divisive and agglomerative clustering for learning taxonomies from text. In Proceedings of the EuroPean Conference on Artificial Intelligence,Pages 435-439. ECAI,2004.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700