一种自动分类的网页搜索排序算法

英文篇名：Web page search ranking algorithm using automatic classification
作者：刘铭瑀 ; 刘学亮 ; 胡骏
英文作者：Liu Mingyu;Liu Xueliang;Hu Jun;School of Computer & Information,Hefei University of Technology;
关键词：领域向量 ; BM25 ; softmax回归分类 ; 网页排序
英文关键词：domain vector;;BM25;;softmax regression classification;;Web page ranking
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：合肥工业大学计算机与信息学院;
出版日期：2018-02-08 17:15
出版单位：计算机应用研究
年：2019
期：v.36;No.327
基金：国家自然科学基金资助项目(61472116,61502139);; 安徽省自然科学基金资助项目(1608085MF128)
语种：中文;
页：JSYJ201901020
页数：4
CN：01
ISSN：51-1196/TP
分类号：93-96

摘要

针对传统网页排序算法Okapi BM25通常会出现网页与查询关键词领域无关的领域漂移现象,以及改进算法需要人工建立领域向量的问题,提出了一种基于BM25和softmax回归分类模型的网页搜索排序算法。方法对网页文本进行数据预处理并利用词袋模型进行网页文本的向量表示,之后通过少量的网页数据训练Softmax回归分类模型,来预测测试网页数据的类别分数,并与BM25信息检索的分数结合在一起,得到最终的网页排序结果。实验结果显示该检索算法无须人工建立领域向量,即可达到很好的网页排序结果。
In the traditional Web page ranking algorithm Okapi BM25,there exists a problem that the retrieval results are independent to the domain keywords,and the improved algorithm needs to build the domain vector manually. To address this issue,this paper proposed a Web page ranking algorithm based on BM25 and softmax regression classification model. The method first encoded the Web page text with the bag-of-words model. And then trained the softmax regression classification model by a small amount of Web data to predict the category scores of the test Web data. Finally it combined the category scores and the BM25 information retrieval scores to get the final ranking of Web page results. Experimental results show that this method can meet the user's information need better without even manually creating the domain vector.

引文

[1] Fonseca B M,Golgher P B,Moura E S D,et al. Using association rules to discover search engines related queries[C]//Proc of the 3rd IEEE/LEOS International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices. Piscataway,NJ:IEEE Press,2003:66-71.
    [2] Zhuang Ziming,Cucerzan S. Re-ranking search results using query logs[C]//Proc of the 15th ACM International Conference on Information and Knowledge Management. New York:ACM Press,2006:860-861.
    [3] Cooper W S. Getting beyond boole[J]. Information Processing and Management,1998,24(3):243-248.
    [4] Salton G,Yang C S,Yu C T. A theory of term importance in automatic text analysis[J]. Journal of the American Society for Information Science Banner,1975,26(1):33-44.
    [5] Robertson S E,Jones K S. Relevance weighting of search terms[M]//Document Retrieval Systems. London:Taylor Graham Publishing,1988:143-160.
    [6] Robertson S,Zaragoza H. The probabilistic relevance framework:BM25 and beyond[J]. Foundations and Trends in Information Retrieval,2009,3(4):333-389.
    [7] Niu Jianwei,Zhao Qingjuan,Wang Lei,et al. On Se S:a novel online short text summarization based on BM25 and neural network[C]//Proc of IEEE Global Communications Conference. Piscataway,NJ:IEEE Press,2016:1-6.
    [8] Li Ying,Sha Fei,Wang Shujuan,et al. The improvement of page sorting algorithm for music users in nutch[C]//Proc of the 15th IEEE/ACIS International Conference on Computer and Information Science. Piscataway,NJ:IEEE Press,2016:1-4.
    [9] Bestgen Y. Improving the character n-gram model for the DSL task with BM25 weighting and less frequently used feature sets[C]//Proc of the 4th Workshop on NLP for Similar Languages,Varieties and Dialects. 2017:115-123.
    [10]Kazemian S,Zhao Shunan,Penn G. Evaluating sentiment analysis in the context of securities trading[C]//Proc of the 54th Annual Meeting of the Association for Computational Linguistics.[S. l.]:Association for Computational Linguistics,2016:2094-2103.
    [11]Büttcher S,Clarke C L A,Lushman B. Term proximity scoring for Ad hoc retrieval on very large text collections[C]//Proc of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2006:621-622.
    [12]Blanco R,Boldi P. Extending BM25 with multiple query operators[C]//Proc of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2012:921-930.
    [13]潘澄,吴共庆,李磊,等.基于领域模型的网页搜索排序算法[J].计算机系统应用,2015,24(11):107-114.(Pan Cheng,Wu Gongqing,Li Lei,et al. Web page re-ranking algorithm for specific domain based on domain model[J]. Computer Systems and Applications,2015,24(11):107-114.)
    [14]Jones K S,Walker S,Robertson S E. A probabilistic model of information retrieval:development and comparative experiments:part 2[J]. Information Processing and Management,2000,36(6):809-840.
    [15]Manning C D,Raghavan P,Schütze H. An introduction to information retrieval[M]. Cambridge:Cambridge University Press,2008:1-18.
    [16]Agichtein E,Brill E,Dumais S. Improving Web search ranking by incorporating user behavior information[C]//Proc of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2006:19-26.
    [17]Wu Yao,DuBois C,Zheng A X,et al. Collaborative denoising autoencoders for top-n recommender systems[C]//Proc of the 9th ACM International Conference on Web Search and Data Mining. New York:ACM Press,2016:153-162.
    [18] Wu Chaoyuan,Ahmed A,Beutel A,et al. Recurrent recommender networks[C]//Proc of the 10th ACM International Conference on Web Search and Data Mining. New York:ACM Press,2017:495-503.
    [19]Zhuang Fuzhen,Luo Dan,Yuan N J,et al. Representation learning with pair-wise constraints for collaborative ranking[C]//Proc of the10th ACM International Conference on Web Search and Data Mining.New York:ACM Press,2017:567-575.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700