基于Nutch的就业垂直搜索引擎研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Employment Vertical Search Engine Based on Nutch
  • 作者:肖红玉 ; 贺辉 ; 黄灼东 ; 蔡昭阳
  • 英文作者:XIAO Hong-yu;HE Hui;HUANG Zhuo-dong;CAI Zhao-yang;School of Information Technology,Beijing Normal Univerity;
  • 关键词:垂直搜索引擎 ; LinkRank算法 ; 就业 ; Nutch
  • 英文关键词:vertical search engine;;LinkRank algorithm;;employment;;Nutch
  • 中文刊名:WJFZ
  • 英文刊名:Computer Technology and Development
  • 机构:北京师范大学珠海分校信息技术学院;
  • 出版日期:2018-11-15 15:35
  • 出版单位:计算机技术与发展
  • 年:2019
  • 期:v.29;No.262
  • 基金:广东省自然科学基金-博士启动(2014A030310415);; 广东省教育研究课题(GDJY-2015-C-b048)
  • 语种:中文;
  • 页:WJFZ201902043
  • 页数:5
  • CN:02
  • ISSN:61-1450/TP
  • 分类号:213-217
摘要
针对通用搜索引擎专业性不够、查准率较低的问题,基于Nutch开源搜索引擎,采用基于本地词库和动态加载词库的正向迭代最细粒度切分算法实现中文分词。基于特征词和元数据标签的空间向量模型实现就业领域主题相关性判定,基于MapReduce引入网页链入链接权重因子和时间衰减因子改进LinkRank排序算法等对Nutch进行二次开发,并在网页信息抓取和过滤、就业信息搜索和特征词推荐等环节引入就业领域本体信息,采用Java框架技术对用户查询接口进行了二次开发,提供了如关键字智能提醒、定制爬虫、二次查找、设定查询结果日期、订阅查询等扩展查询接口,设计并实现了基于Nutch的就业垂直搜索引擎。实验结果表明,基于Nutch的就业垂直搜索引擎具有较高的查准率,可以满足用户专业检索的需求。
        Aiming at the problems that the general search engine has poor profession and low precision rate,based on Nutch,an open source engine,we use forward iteration and fine-grained segmentation algorithm based on local word lexicon and dynamically loaded word lexicon to achieve Chinese word segmentation.Vector space model based on feature words and metadata tags is used to determine topic relevance in employment field.The LinkRank sorting algorithm supporting MapReduce which is introduced the link weight factor and time decay factor is improved to make a secondary development of Nutch and employment domain ontology is applied to web information crawling and filtering,employment information retrieval and feature word recommendation stages.Spring MVC technology is used to develop the user query interface,which provides the extended query interface such as keyword intelligent reminder,customized crawler,secondary search,setting query result date,subscription query and so on.At last,the employment vertical search engine based on Nutch is designed and implemented.Experiment shows that the employment vertical search engine based on Nutch has a high precision and can meet the professional needs of user retrieval.
引文
[1]人社部:高校毕业生人数创新高鼓励去基层就业[J].中国研究生,2016(8):64.
    [2]荆德刚.2017年高校毕业生就业的新特点与新机遇[J].中国高教研究,2017(7):27-30.
    [3]柯进.2018年高校毕业生将达820万[N].中国教育报,2018-02-27(1).
    [4]袁威,薛安荣,周小梅.基于Nutch的分布式爬虫的优化研究[J].无线通信技术,2014,23(3):44-47.
    [5]BENASSI R,BERGAMASCHI S,VINCINI M.TUCUXI:the intelligent hunter agent for concept understanding and LeXical ChaIning[C]//IEEE/WIC/ACM international conference on web intelligence.Beijing,China:IEEE,2004:249-255.
    [6]LIN Qingfeng,SCOTT S,SETH S C.A machine learning framework for automatically annotating web pages with simple HTML ontology extension(SHOE)[C]//Proceedings of IAWTIC 2001.[s.l.]:[s.n.],2001.
    [7]MICARELLI A,GASPARETTI F,SCIARRONE F,et al.Personalized search on the world wide web[M]//The adaptive web.Berlin:Springer,2007:195-230.
    [8]夏树倩.基于Nutch的学术搜索引擎的研究与实现[D].沈阳:东北大学,2011.
    [9]纪晓阳.基于Nutch搜索引擎系统数据处理的中文分词技术的研究[D].成都:成都理工大学,2014.
    [10]王超,李书琴,肖红.基于文献的农业领域本体自动构建方法研究[J].计算机应用与软件,2014,31(8):71-74.
    [11]SALTON G,WONG A,YANG C S.A vector space model for automatic indexing[J].Communications of the ACM,1975,18(11):613-620.
    [12]SALTON G,BUCKLEY C.Term-weighting approaches in automatic text retrieval[J].Information Processing and Management,1987,24(5):513-523.
    [13]白晓丹.搜索引擎网页相关性及检索效率评价体系研究[D].北京:北京交通大学,2015.
    [14]裴一蕾,薛万欣,赵宗,等.基于用户体验视角的搜索引擎评价研究[J].情报科学,2013,31(5):94-97.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700