用户名: 密码: 验证码:
基于云模型理论的文档重排方法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
近年来,计算机以及互联网技术在我国信息化建设方面取得了自订所未有的普及和发展,这导致信息量不断增长。面对持续膨胀的海量信息,如何提高检索的效率以提升用户的检索体验,这给信息检索带来了巨大的挑战。
     本文首先介绍了文档重排的概念及其研究现状,并通过分析基于统计的和基于语义的两类文档重排方法,发现这两类文档重排方法均忽视了自然语言本身具有的不确定性特点,然后结合云模型理论,从发现不确定性知识的角度研究信息检索中的文档重排方法。
     本文通过发现查询词层次的不确定性知识,提出了一种基于云模型的文档重排方法。该方法通过获取查询关键词在文档中的分布情况,利用云模型实施定性定量转换,获取文档表征查询的不确定性,以此进行文档重排。论文进一步通过发现查询语句层次的不确定性知识,提出了一种基于概念跃升的文档重排方法。该方法是在获取查询词层次文档表征查询的不确定性的基础上,利用云综合算法对查询词进行概念跃升,得到查询语句层次文档表征查询的不确定性,综合这两个层次的不确定性知识进行文档重排。
     本文成功设计并实现了基于云模型理论的信息检索系统。该系统是在获取了首次检索结果的基础上,利用云模型理论的三个数字特征,分别从查询词以及查询语句两个层次获得用文档表征查询的不确定度,基于此不确定度由低到高完成文档重排,将重排后的结果返回给用户。
     本文采用NTCIR-5信息检索测试集,根据TREC评测标准对所提出的方法进行对比实验。实验结果表明,所提出的方法在relax和rigid这两种评测标准下均有所提高,尤其在rigid评测标准下有更好的效果。
In recent years, computer and Internet technology in the information construction of our country has made unprecedented popularization and development, which leads to the continual growth of information content. It's a big challenge for information retrieval (IR) to improve the retrieval efficiency and the user experience in respect of the continuous expansion of massive information.
     This thesis first introduces the concept of document re-ranking and its research progress, and thoroughly analyzes the two main methods of document re-ranking, the statistics-based method and the semantic-based method. It has been found out that the two methods both neglect the uncertainty in native language. So this thesis researched the method of document re-ranking in information retrieval based on cloud model theory from the perspective of uncertain knowledge discovery.
     This thesis has proposed a re-ranking method based on cloud model by means of the uncertain knowledge discovery on the query terms level. The re-ranking method based on cloud model acquired the distribution of the key terms in the documents, used cloud model to convert the distribution into the uncertainty of the document representing the query on the query terms level, and then re-ranked the documents. And then this thesis proposed a re-ranking method based on concept hierarchy using cloud model by means of the uncertain knowledge discovery on the query level. The re-ranking method based on concept hierarchy using cloud model first acquired the uncertainty degree of using the document to represent the query on the query terms level, and then elevated the query terms level to the query based on the concept hierarchy theory using the cloud model synthesized algorithm, therefore acquired the uncertainty degree of using the document to represent the query on the query level, finally used the uncertainty of the two level's to re-rank the documents.
     This thesis makes use of the methods proposed in this thesis in document re-ranking, and have designed and implemented the IR system successfully. The system firstly employs the three numerical characteristics of the cloud model to obtain the uncertainty of using the document in the first time research results to represent the query at two levels:the query terms level and the query level which is obtained from the cloud concept hierarchy promotion of the query terms level. And then re-rank the documents based on that uncertainty, returned the re-ranked documents to the user finally.
     This thesis performed experiments on the information retrieval test collections of NTCIR-5, and evaluated the results under TREC assessments. Experiments showed that the methods make improvements in both relax and rigid assessments and perform more excellent in the rigid assessment.
引文
[1]iProspect Blended Search Results Study:iProspect Blended Search Results Study:http://www.iprospect.com/.
    [2]PAGE L, BRIN S, MOTWANI R, et al., The PageRank citation ranking: Bringing order to the Web[R], Stanford, CA:Stanford Digital Libraries Working Paper,1998.
    [3]KLEINBERG J., Authoritative sources in a hyperlinked environment [C], Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, New Orleans:ACM Press, pp.668-677,1997.
    [4]Kurland O., Lee L., PageRank without Hyperlinks:Structural Re-Ranking using Links Induced by Language Models [C], In Proceedings of the 28th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Salvador, pp306-313,2005.
    [5]Hyun-kyu Kang, Key-sun Choi, Two-Level Document Ranking using Mutual Information in Natural Language Information Retrieval [J], Information Processing&Management, Vol.33, No.3, pp.289-306,1997.
    [6]Kyung-Soon Lee, Young-Chan Park, Key-Sun Choi, Re-ranking model based document clusters [J], Information Processing&Management, Volume 37, Issue 1,pp.1-14,2001.
    [7]Krishna Bharat, RANKING SEARCH RESULTS BY RERANKING THE RESULTS BASED ON LOCAL INTER-CONNECTIVITY METHOD FOR RE-RANKING DOCUMENTS RETRIEVED FROM A DOCUMENT DATABASE, United States, Patent No.:US 6,526,440 B1, Date of Patent:Feb. 25.2003. US.
    [8]Yang Lingpeng, Ji Donghong, Tang Li, Document Re-ranking Based on Global and Local Terms [C], Third SIGHAN Workshop on Chinese Language Processing, page 17-23,2004.
    [9]Jaroslaw Balinski, Czeslaow Danilowicz, Re-ranking method based on inter-document distances [J], Information Processing and Management, vol.41, pp.759-775,2005.
    [10]Lingpeng, Ji Donghong, Zhou GuoDong, et. al., Document re-ranking using cluster validation and label propagation [C], Proceedings of CIKM, pp.690-697, 2006.
    [11]何婷婷,许婷,瞿国忠,涂新辉,基于主题词对的文档重排方法[J],计算机工程与应用,43(11):196-163,2007.
    [12]Yang, LingPeng, Donghong Ji, Munkew Leong, Document reranking by term distribution and maximal marginal relevance for chinese information retrieval [J], Information Processing & Management, Volume 43, Issue 2, pp.315-326, 2007.
    [13]Donghong Ji, Shiju Zhao, Guozheng Xiao, Chinese document re-ranking based on automatically acquired term resource [J], Lang Resources & Evaluation,43(4):385-406,2009.
    [14]Maofu Liu, Fang Fang and Donghong Ji, Document Re-ranking via Wikipedia Articles for Definition/Biography Type Questions [C], Proceedings of the 23th Pacific Asia Conference on Language, Information and Computation (PACLIC 2009), pp.87-99,2009.
    [15]原永福,郭丽娜,毛伟伟,基于内部文档比较的重排序算法[J],现代图书情报技术,185(11):49-52,2009.
    [16]Dong Zhou, Vincent Wade, Latent Document Re-Ranking [C], Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp.1571-1580,2009.
    [17]Dong Zhou, Seamus Lawless, Jinming Min, Vincent Wade, Dual-Space Re-ranking Model for Document Retrieval [C], Coling 2010:Poster Volume, pages 1524-1532,2010.
    [18]Chong Teng, Yanxiang He, Donghong Ji, et. al., Clustering and OCCC approaches in Document Re-ranking [C], Proceedings of NTCIR-8 Workshop Meeting, pp.143-146,2010.
    [19]Boris Chidlovskii, Natalie S. Glance and M. Antonietta Grasso., Collaborative re-ranking of search results [C]. In Proceedings of AAAI-2000 workshop on artificial intelligence for web search, pp.18-22,2000.
    [20]Qu Youli, Xu Guowei, Wang Jun, Rerank Method Based on Individual Thesaurus [C], NTCIR Workshop 2 Proceedings of the Second NTCIR Workshop on Research in Chinese&Japanese Text Retrieval and Text Summarization,2001.
    [21]张敏,宋睿华,马少平,基于语义关系查询扩展的文档重构方法[J],计算机学报,27(10):1395-4001,2004.
    [22]Zhongmin Shi, Baohua Gu, Fred Popowich and Anoop Sarkar, Synonym-based Query Expansion and Boosting-based Re-ranking:A Two-phase Approach for Genomic Information Retrieval, TREC,2005.
    [23]Margaret M. Knepper, Kevin Lee Fox, Palm Bay, Ophir Frieder, METHOD FOR RE-RANKING DOCUMENTS RETRIEVED FROM A DOCUMENT DATABASE, United States, Patent No.:US 7,801,887 B2, Date of Patent:Sep. 21,2010.
    [24]LEMPEL R, MORAN S. SALSA, The stochastic approach for link-structure analysis [J], ACM Transactions on Information Systems,19(2):131-160,2001.
    [25]Cohn D, Chang H., Learning to probabilistically identify authoritative documents [C], In Proc 17th International Conference on Machine Learning, 2000.
    [26]Stevenson R L, Inverse halftoning via MAP estimation [J], IEEE Transactions on Image Processing,6(4):574-583,1997.
    [27]C. J. van Rijsbergen, A new theoretical framework for information retrieval [C], In Proceedings of the 1986 International Conference on Research and Development in Information Retrieval (SIGIR '86), pp.194-200,1986.
    [28]李德毅,杜鷁,不确定性人工智能[M],北京:国防工业出版社,2005.
    [29]D.Y. Li, X. Shi, and M.M. Gupta, Soft Inference Mechanism Based on Cloud Models, LPSC 1996, pp.38-62,1996.
    [30]陈贵林,一种定性定量信息转换的不确定性模型——云模型[J],计算机应用研究,27(6):2006-2010,2010.
    [31]付斌,李道国,王慕快,云模型研究的回顾与展望[J],计算机应用研,28(2):420-426,2011.
    [32]康海燕,李艳芳,林培光等,信息检索策略性能的云模型评价方法[J],中文信息学报,19(1):42-47,2005.
    [33]Hua Long, Zhongshi He, Shuangqing, et al., Automated Summarization Evaluation Based on Clouds Model [C], In Proceeding of China Information Retrieval Conference(CCIR 2009),2009:9-16,2009.
    [34]袁晓芳,李红霞,田水承等,于义类词典和云模型的重点突发事件CBR系统研究[C],第四届国际应急管理论坛暨中国(双法)应急管理专业委员会第五届年会,2009:529-534,2009.
    [35]代劲,何中市,胡峰,基于云模型的文本特征自动提取算法[J],中南大学学报(自然科学版),42(3):714-720,2011.
    [36]Jian, W., He, T.T., Chen, J.G., et al., Boosting native bayes text categorization by using cloud model [C],2011 International Conference on Computer, Electrical and Systems Sciences, and Engineering(CESSE 2011), 2011:165-170,2011.
    [37]Jinguang Chen, Tingting He. Query-focused Multi-document Summarization Using Cloud Model [J], Information-An International Interdisciplinary Journal. 2011.
    [38]张怀天,基于云模型的数据挖掘及其在交通流系统中的应用[D],天津大学,2006.
    [39]刘同明,数据挖掘技术及其应用[M],北京:国防工业出版社,2001.
    [40]蒋嵘,李德毅,范建华,数值型数据的泛概念树的自动生成方法[J],计算机学报,23(5):470-476,2000.
    [41]陈吴,李兵,基于逆向云和概念提升的定性评价方法[J],武汉大学学报(理学版),56(6):683-688,2010.
    [42]Michael McCandless, Erik Hatcher, Otis Gospodnetic, Lucene in Action, Second Edition [M], New York:Manning Publications Co.,2010.
    [43]http://research.nii.ac.jp/ntcir/permission/ntcir-5/perm-en-CLIR.html.
    [44]Voorhees E., Harman D., eds., TREC-Experiment and Evaluation in Information Retrieval [M], Masseachusettes, MIT,2005.
    [45]http://www.keenage.com/.
    [46]Maoyuan Zhang, Ming Liu, The Retrieval SystemBased onConcept Extending 2009 Second Asia-Pacific Conference on Computational Intelligence and Industrial Applications, PACⅡA, pp.365-368,2009.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700