基于深度学习的简历信息实体抽取方法

英文篇名：Entity extraction method of resume information based on deep learning
作者：黄胜 ; 李伟 ; 张剑
英文作者：HUANG Sheng;LI Wei;ZHANG Jian;Key Laboratory of Optical Communication and Networks,Chongqing University of Posts and Telecommunications;Peking University Shenzhen Institute;
关键词：简历抽取 ; 信息实体 ; 序列标注 ; 长短期记忆 ; 条件随机场
英文关键词：resume extraction;;information entity;;sequence labeling;;long short term memory;;conditional random fields
中文刊名：SJSJ
英文刊名：Computer Engineering and Design
机构：重庆邮电大学光通信与网络重点实验室;北京大学深圳研究院;
出版日期：2018-12-16
出版单位：计算机工程与设计
年：2018
期：v.39;No.384
基金：国家自然科学基金项目(61371096);; 深圳市科技计划基金项目(JCYJ20170307151743672)
语种：中文;
页：SJSJ201812046
页数：6
CN：12
ISSN：11-1775/TP
分类号：281-286

摘要

针对传统的简历信息实体抽取方法泛化能力差、难以维护的问题,提出一种基于深层神经网络的简历信息实体抽取方法。经过数据清洗、分词等预处理将非结构化的简历文本信息处理为词序列,通过由Word2Vec在大规模语料库以无监督方式训练得到的词向量表,将每个词映射为低维实数向量,由双向LSTM层融合待标注词所处的语境信息,输出所有可能标签序列的分值给CRF层,由其引入前后标签之间的约束求解最优标签序列,以随机梯度下降法训练该模型,辅以Dropout防止过拟合。实验结果表明,该方法提升了相应的解析标注性能,提高了泛化能力。
The traditional information entity extraction methods of the resume(ERIE)are hard to be maintained because of poor generalization ability.To tackle above problems,an ERIE method based on deep neural network was proposed.After data cleaning and word segmentation,the unstructured resume text information was represented as a word sequence.Each word was mapped into a low-dimensional real vector,which was trained by using an unsupervised method Word2 Vec based on a large-scale corpus.The bidirectional LSTM layer was used to fuse the contextual information of the words to be marked,and the values of all possible tag sequences were exported to the CRF layer.The constraint between the front and rear tags was introduced to solve the optimal tag sequence.The model was trained using the stochastic gradient descent method,and the dropout was used to prevent overfitting.Experimental results show that the proposed method produces better parsing performance and improves the generalization ability.

引文

[1]Mesnil G,Dauphin Y,Yao K,et al.Using recurrent neural networks for slot filling in spoken language understanding[J].IEEE/ACM Transactions on Audio,Speech and Language Processing,2015,23(3):530-539.
    [2]Plank B,Sogaard A,Goldberg Y,et al.Multilingual part-ofspeech tagging with bidirectional long short-term memory models and auxiliary loss[J]. Meeting of the Association for Computational Linguistics,2016:412-418.
    [3]Mikolov T,Sutskever I,Chen K,et al.Distributed representations of words and phrases and their compositionality[J].Neural Information Processing Systems,2013:3111-3119.
    [4]Srivastava N,Hinton G E,Krizhevsky A,et al.Dropout:A simple way to prevent neural networks from overfitting[J].Journal of Machine Learning Research,2014,15(1):1929-1958.
    [5]Yao K,Peng B,Zweig G,et al.Recurrent conditional random field for language understanding[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2014:4077-4081.
    [6]Chiu J P,Nichols E.Named entity recognition with bidirectional LSTM-CNNs[J].Transactions of the Association for Computational Linguistics,2015,4(0):357-370.
    [7]Ma X,Hovy E H.End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[J].Meeting of the Association for Computational Linguistics,2016:1064-1074.
    [8]Jagannatha A,Yu H.Structured prediction models for RNN based sequence labeling in clinical text[J].Empirical Methods in Natural Language Processing,2016:856-865.
    [9]Graves A,Mohamed A,Hinton G.Speech recognition with deep recurrent neural networks[C]//IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:6645-6649.
    [10]Ling W,Dyer C,Black A W,et al.Finding function in form:Compositional character models for open vocabulary word representation[J].Empirical Methods in Natural Language Processing,2015:1520-1530.
    [11]Lample G,Ballesteros M,Subramanian S,et al.Neural architectures for named entity recognition[J].North American Chapter of the Association for Computational Linguistics,2016:260-270.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700