基于朴素贝叶斯的网络查询日志session划分方法研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on session segmentation of web search query logs based on naive Bayes
  • 作者:孙玫 ; 张森 ; 聂培尧 ; 聂秀山
  • 英文作者:Sun Mei;Zhang Sen;Nie Peiyao;Nie Xiushan;School of Public Finance and Taxation,Shandong University of Finance and Economics;School of Computer Science and Technology,Shandong University of Finance and Economics;School of Information and Intelligence Engineering,Sanya University;
  • 关键词:网络搜索 ; session划分 ; 朴素贝叶斯 ; 时间间隔 ; 查询项语义
  • 英文关键词:web search;;session segmentation;;naive Bayes;;time interval;;query item semantics
  • 中文刊名:NJDZ
  • 英文刊名:Journal of Nanjing University(Natural Science)
  • 机构:山东财经大学财政税务学院;山东财经大学计算机科学与技术学院;三亚学院信息与智能工程学院;
  • 出版日期:2018-11-30
  • 出版单位:南京大学学报(自然科学)
  • 年:2018
  • 期:v.54;No.243
  • 基金:教育部人文社会科学研究项目(15YJAZH042)
  • 语种:中文;
  • 页:NJDZ201806009
  • 页数:9
  • CN:06
  • ISSN:32-1169/N
  • 分类号:82-90
摘要
随着互联网的快速发展,网络查询日志分析技术成为提高网络搜索引擎表现和分析用户搜索行为的关键,而session划分是网络查询日志分析中的一个重要环节.目前常用的session划分方法主要是基于查询项的时间间隔进行划分,即将一段时间内的查询项视为同一session.这种方法实施简单,但是划分的准确率不高,无法满足对session划分精确度要求很高的应用场景的要求.因此提出了一种新的网络查询日志session划分方法——基于朴素贝叶斯的网络查询日志session划分方法.该方法将session划分问题转化为判断查询项是否为session边界的问题,分析了查询项时间间隔、查询项的语义和相邻查询项的加减词这三种影响session划分的重要因素,并通过朴素贝叶斯法对查询项是否为session边界进行分类,最后设计实验验证了该方法的有效性.
        With the development of Internet,web search query log analysis technology has become the key to improve the performance of web search engines and analyze user search behavior.Meanwhile,session segmentation is an important part of the web search query logs analysis.The session in the web search query log refers to a plurality of search query activities in which the user performs the same or similar intent within a time period,and is also a basic data processing unit commonly used in data processing of the web search query logs.Currently,the commonly used session segmentation method is mainly based on the time interval of the query items,and the query items within a certain period of time are regarded as the same session.This method is simple to implement,but the accuracy of the partitioning is not high,and it cannot meet the requirements of the application scene with high requirements forsession partitioning accuracy.Therefore,in this paper,we propose a new web search query logs session segmentation method:session segmentation of web search query logs based on naive Bayes.In order to divide the query log into different sessions,we translate the session segmentation problem into judging whether the query item is a session boundary.Then,we analyze three important factors of the session segmentation,the query item time interval,the query item semantics and the addition of neighbor query items.When calculating the semantic similarity of query items,we adopt the representation method of word vector in deep learning,and then the Query2 Vector model is proposed.The query items are represented by vectors,and the similarities of query items are calculated.Next,we apply naive Bayes method to classify whether the query item is a session boundary.Finally,we design a series of experiments to verify the effectiveness of the proposed method,and the results show that our methods proposed in this paper are more precise and credible than other common methods.
引文
[1]罗成,刘奕群,张敏等.基于用户意图识别的查询推荐研究.中文信息学报,2014,28(1):64-72.(Luo C,Liu Y Q,Zhang M,et al.Query recommendation based on user intent recognition.Journal of Chinese Information Processing,2014,28(1):64-72.)
    [2] Sordoni A,Bengio Y,Vahabi H,et al. A hierarchicalrecurrentencoder-decoderfor generative context-aware query suggestion∥Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.New York,NY,USA:ACM,2015:553-562.
    [3] Shokouhi M,Sloan M,Bennett P N,et al.Query suggestion and data fusion in contextual disambiguation∥Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:International World Wide Web Conferences Steering Committee,2015:971-980.
    [4]童国平,孙建军.基于搜索日志的用户行为分析.现代图书情报技术,2015,31(7-8):80-88.(Tong G P,Sun J J.User behavior analysis based on search engine log.New Technology of Library and Information Service,2015,31(7-8):80-88.)
    [5] Tan B,Shen X H,Zhai C X.Mining long-term search history to improve search accuracy∥Proceedingsofthe12th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining.Philadelphia,PA,USA:ACM,2006:718-723.
    [6] Eickhoff C,Teevan J,White R,et al.Lessons from the journey:A query log analysis of withinsession learning∥Proceedings of the 7th ACM International Conference on Web Search and Data Mining.New York,NY,USA:ACM,2014:223-232.
    [7]余慧佳,刘奕群,张敏等.基于大规模日志分析的搜索引擎用户行为分析.中文信息学报,2007,21(1):109-114.(Yu H J,Liu Y Q,Zhang M,et al.Research in search engine user behavior based on log analysis.Journal of Chinese Information Processing,2007,21(1):109-114.)
    [8]姚婷,张敏,刘奕群等.低频查询的用户行为分析和类别研究.计算机研究与发展,2012,49(11):2368-2375.(Yao T,Zhang M,Liu Y Q,et al. Empiricalstudyonrarequery categorization.Journal of Computer Research and Development,2012,49(11):2368-2375.)
    [9]万飞,赵溪,梁循等.基于移动互联网日志的搜索引擎用户行为研究.中文信息学报,2014,28(2):144-150.(Wan F,Zhao X,Liang X,et al.Search behavior study based on the mobile searchLog. Journal of Chinese Information Processing,2014,28(2):144-150.)
    [10]He D,G9ker A.Detecting session boundaries from web user logs∥Proceedings of the BCS-IRSG22nd AnnualColloquiumon Information Retrieval Research.Cambridge,UK:IEEE,2000:57-66.
    [11]张森,张晨,林培光等.基于用户查询日志的网络搜索主题分析.智能系统学报,2017,12(5):668-677.(Zhang S,Zhang C,Lin P G,et al.Web search topic analysis based on user search query logs.CAAI Transactions on Intelligent Systems,2017,12(5):668-677.)
    [12]Jiang D,Leung K W T,Ng W,et al.Beyond click graph:Topic modeling for search engine query log analysis∥Meng W,Feng L,Bressan S,et al.Database Systems for Advanced Applications.Springer Berlin Heidelberg,2013:209-223.
    [13]张磊,李亚楠,王斌等.网页搜索引擎查询日志的Session划分研究.中文信息学报,2009,23(2):54-61.(Zhang L,Li Y N,Wang B,et al.Session segmentation based on query logs of web search. JournalofChineseInformation Processing,2009,23(2):54-61.)
    [14]Mikolov T,Chen K,Corrado G,et al.Efficient estimation of word representations in vector space.arXiv:1301.3781.
    [15]Le Q,Mikolov T.Distributed representations of sentences and documents∥Proceedings of the31st InternationalConferenceonMachine Learning.Beijing,China:Journal of Machine Learning Research,2014:1188
    [16]段旭磊,张仰森,孙祎卓.微博文本的句向量表示及相似度计算方法研究.计算机工程,2017,34(5):143-148.(Duan X L,Zhang Y S,Sun Y Z.Research on sentence vector representation and similarity calculation method about microblog texts.Computer Engineering,2017,34(5):143-148.)
    [17]杨雷,曹翠玲,孙建国等.改进的朴素贝叶斯算法在垃圾邮件过滤中的研究.通信学报,2017,38(4):140-148.(Yang L,Cao C L,Sun J G,et al.Study on an improved naive Bayes algorithm in spam filtering.Journal on Communications,2017,38(4):140-148.)
    [18]Chandrasekar P,Qian K.The impact of data preprocessing on the performance of a naive bayes classifier∥2016 IEEE 40th Annual Computer Software and Applications Conference(COMPSAC).Atlanta,GA,USA:IEEE,2016,2:618-619.
    [19]钟磊.基于贝叶斯分类器的中文文本分类.电子技术与软件工程,2016(22):156-156.(Zhong L.Chinese text classification based on Bayesianclassifier.Electronic Technology&Software Engineering,2016(22):156-156.)
    [20]Jiang L X,Li C Q,Wang S S,et al.Deep feature weighting for naive Bayes and its application to text classification.Engineering Applications of Artificial Intelligence,2016,52:26-39.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700