基于字词联合的变体词规范化研究
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Research on Morph Normalization Based on Joint Learning of Character and Word
  • 作者:施振辉 ; 沙灜 ; 梁棋 ; 李锐 ; 邱泳钦 ; 王斌
  • 英文作者:SHI Zhen-Hui;SHA Ying;LIANG Qi;LI Rui;QIU Yong-Qin;WANG Bin;Institute of Information Engineering,Chinese Academy of Sciences;University of Chinese Academy of Sciences;
  • 关键词:变体词 ; 变体词规范化 ; 社交网络 ; 词向量 ; 字词联合训练
  • 英文关键词:morph;;morph normalization;;social network;;word embedding;;joint character-word training
  • 中文刊名:XTYY
  • 英文刊名:Computer Systems & Applications
  • 机构:中国科学院信息工程研究所;中国科学院大学;
  • 出版日期:2017-10-15
  • 出版单位:计算机系统应用
  • 年:2017
  • 期:v.26
  • 基金:国家重点研发计划(2016YFB0801003);; 青年科学基金项目(61402466)
  • 语种:中文;
  • 页:XTYY201710005
  • 页数:7
  • CN:10
  • ISSN:11-2854/TP
  • 分类号:31-37
摘要
社交网络中的文本具有随意性和非正规性等特点,一种常见现象是社交网络文本中存在大量变体词.人们往往为了避免审查、表达情感等将原来的词用变体词替代,原来的词成为目标词.本文研究变体词的规范化任务,即找到变体词所对应的初始目标词.本文利用变体词所在文本的时间和语义,结合变体词词性,提出了一种时间和语义结合的方法获取候选目标词,然后提出基于字词联合的词向量方法对候选目标词排序.我们的方法不需要额外的标注数据,实验结果表明,相比于当前最好的方法在准确性上具有一定的提升,针对与目标词存在相同的字的变体词其性能更好.
        The text is informal in social networks. One of the common phenomena is that there are a lot of morphs in social networks. People are keen on creating morphs to replace their real targets to avoid censorship and express strong sentiment. In this paper we aim to solve the problem of finding real targets corresponding to their entity morphs. We exploit the temporal and semantic and POS constraints to collect target candidates. Then we propose a method based on joint character-word training to sort the target candidates. Our method does not need any additional annotation corpora.Experimental results demonstrate that our approach achieved some improvement over state-of-the-art method. The results also show that the performance is better when morphs share the same character as targets.
引文
1 Huang HZ,Wen Z,Yu D,et al.Resolving entity morphs in censored data.Proc.of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria.2013.1083-1093.
    2 Zhang BL,Huang HZ,Pan XM,et al.Context-aware entity morph decoding.Proc.of the 53rd Annual Meeting of the Association for Computational Linguistics.Beijing,China.2015.586-595.
    3 Zhang BL,Huang HZ,Pan XM,et al.Be appropriate and funny:Automatic entity morph encoding.Proc.of the 52nd Annual Meeting of the Association for Computational Linguistics(Short Papers).Baltimore,Maryland,USA.2014.706-711.
    4沙灜,梁棋,王斌.中文变体词的识别与规范化综述.信息安全学报,2016,1(3):77-87.
    5 Wong KF,Xia Y.Normalization of Chinese chat language.Language Resources and Evaluation,2008,42:219-242.[doi:10.1007/s10579-008-9067-7]
    6 Xia YQ,Wong KF,Li WJ.A phonetic-based approach to Chinese chat text normalization.Proc.of the 21st International Conf.on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistic.Sydney,Australia.2006.993-1000.
    7陈儒,张宇,刘挺.面向中文特定信息变异的过滤技术研究.高技术通讯,2005,15(9):7-12.
    8 Sood SO,Antin J,Churchill EF.Using crowdsourcing to improve profanity detection.AAAI Spring Symposium Series.2012.69-74.
    9 Yoon T,Park SY,Cho HG.A smart filtering system for newly coined profanities by using approximate string alignment.Proc.of 2010 IEEE 10th International Conference on Computer and Information Technology(CIT).Bradford,UK.2010.643-650.
    10 Wang A,Kan MY,Andrade D,et al.Chinese informal word normalization:An experimental study.Proc.of the 6th International Joint Conference on Natural Language Processing.Nagoya,Japan.2013.
    11 Wang AB,Kan MY.Mining informal language from chinese microtext:Joint word recognition and segmentation.Proc.of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria.2013.731-741.
    12 Choudhury M,Saraf R,Jain V,et al.Investigation and modeling of the structure of texting language.International Journal of Document Analysis and Recognition,2007,10(3-4):157-174.[doi:10.1007/s10032-007-0054-0]
    13 Han B,Cook P,Baldwin T.Automatically constructing a normalisation dictionary for microblogs.Proc.of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.Jeju Island,Korea.2012.421-432.
    14 Han B,Baldwin T.Lexical normalisation of short text messages:Makn sens a#twitter.Proc.of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies.Portland,Oregon.2011,1:368-378.
    15 Li ZF,Yarowsky D.Mining and modeling relations between formal and informal chinese phrases from web corpora.Proc.of the Conference on Empirical Methods in Natural Language Processing.Honolulu,Hawaii.2008.1031-1040.
    16 Chen XX,Xu L,Liu ZY,et al.Joint learning of character and word embeddings.Proc.of the 24th International Conference on Artificial Intelligence.Buenos Aires,Argentina.2015.1236-1242.
    17来斯惟.基于神经网络的词和文档语义向量表示方法研究[博士学位论文].北京:中国科学院自动化研究所,2016.1.
    18中国大陆网络语言列表.https://zh.wikipedia.org/wiki/中国大陆网络语言列表.[2016-12].

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700