用户短文本无关语自动识别方法研究

英文篇名：Research on Automatic Recognition Method About the Irrelevant Words in User-oriented Short Text
作者：陈国 ; 刘亮亮 ; 张再跃
英文作者：CHEN Guo;LIU Liangliang;ZHANG Zaiyue;College of Computer Science and Engineering,Jiangsu University of Science and Technology;College of Statistics and Information,Shanghai University of International Business and Economics;
关键词：短文本 ; 无关语 ; 隐马尔科夫模型 ; 机器学习
英文关键词：short text;;irrelevant words;;HMM;;machine learning
中文刊名：JSSG
英文刊名：Computer & Digital Engineering
机构：江苏科技大学计算机科学与工程学院;上海对外经贸大学统计与信息学院;
出版日期：2019-07-20
出版单位：计算机与数字工程
年：2019
期：v.47;No.357
基金：国家自然科学基金项目(编号:61371114,611170165);; 江苏高校高技术船舶协同创新中心/江苏科技大学海洋装备研究院项目(编号:1174871701-9)资助
语种：中文;
页：JSSG201907037
页数：5
CN：07
ISSN：42-1372/TP
分类号：189-193

摘要

在用户短文本中,意思相同的句子有多种表述方式,这些句子中存在很多与句意无关的信息,称为无关语。针对一般方法无关语识别准确度不高的问题,论文提出了一种通过二阶隐马尔科夫模型来自动识别用户短文本中无关语的方法。本方法在建模过程中将词本身、词性以及词的相对位置作为特征来对隐马尔科夫模型进行扩充。实验结果表明,论文给出的用户短文本无关语识别方法可以避免对训练文本进行手工编写规则的限制,且在准确率和召回率方面均有一定程度的提高。
In user-oriented short text,sentences with the same meaning have a variety of expressions,these sentences has a lot of irrelevant information,which is called irrelevant words. In order to solve the problem that the accuracy of common recognition method is not high,an automatic recognition method is proposed for marking irrelevant words in the corpus to be marked by the second-order hidden Markov model. In order to solve the problem that the Hidden Markov Model can only consider the previous word as a feature when labeling the corpus and it has led to poor results,this method has considered each word itself in the labeling process,the speech and the relative position as features when marking. The results show that this method can avoid the limitation of hand-written rules for training texts,and improve the accuracy and recall rate to a certain extent.

引文

[1]O'MaraEves A,Thomas J,McNaught J,et al. Using text mining for study identification in systematic reviews:a systematic review of current approaches[J]. Systematic Reviews,4,1(2015-01-14),2015,4(1):5.
    [2]Chidanand Apte,Bing liu,Edwin P D Pednault,Padhraic Smyth. Business Applications of Data Mining[J]. Communications of ACM,2002,45(8):49-53
    [3]Bholat D M,Hansen S,Santos P M,et al. Text Mining for Central Banks[J]. Handbooks,2015,33:1-19.
    [4]Nassirtoussi A K,Aghabozorgi S,Wah T Y,et al. Text mining for market prediction:A systematic review[J]. Expert Systems with Applications,2014,41(16):7653-7670.
    [5]樊存佳,汪友生,边航.一种改进的KNN文本分类算法[J].国外电子测量技术,2015(12):39-43.FAN Cunjia,WANG Yousheng,BIAN Hang. An Improved KNN Text Classification Algorithm[J]. Foreign Electronic Measurement Technology,2015(12):39-43.
    [6]Wang J C,Pan J G,Zhang F Y. RESEARCH ON WEB TEXT MINING[J]. Journal of Computer Research&Development,2000.
    [7]Dhillon I S,Modha D S. Concept decompositions for large sparse text data using clustering[J]. Machine learning,2001,42(1-2):143-175.
    [8]姜仁会,王挺,唐晋韬.面向微博文本的命名实体识别[J].计算机与数字工程,2014,42(4):647-651.JIANG Renhui,WANG Ting,TANG Jintao. Named Entity Recognition for Micro-blog Text[J]. Computer&Digital Engineering,2014,42(4):647-651.
    [9]袁璐,蒙祖强,许珂.依存分析和HMM相结合的信息抽取方法[J].计算机工程与应用,2012,48(09):138-140.YUAN Lu,MENG Zuqiang,XU Ke. Information Extraction Method Based on Dependency Analysis and HMM[J]. Computer Engineering and Application,2012,48(09):138-140.
    [10]Speck R,Ngomo A C N. Ensemble Learning for Named Entity Recognition[C]//International Semantic Web Conference. Springer-Verlag New York,Inc. 2014:519-534.
    [11]周峰,朱俊武,童林,等.无关语获取与语料聚类方法研究[J].南京师大学报(自然科学版),2014,37(4):150-157.ZHOU Feng,ZHU Junwu,TONG Lin,et al. Research on Acquisition of Irrelevant Words and the Method of Corpus Clustering[J]. Journal of Nanjing Normal University(NATURAL SCIENCE EDITION),2014,37(4):150-157.
    [12]Lakshmanan G T,Shamsi D,Doganata Y N,et al. A markov prediction model for data-driven semi-structured business processes[J]. Knowledge&Information Systems,2015,42(1):97-126.
    [13]Shun-Zheng Yu. Hidden semi-Markov models[J]. Artificial Intelligence,2010,174(2):215-243.
    [14]Blanchet J,Gallego G,Goyal V. A markov chain approximation to choice modeling[J]. Operations Research,2016,64:886-905.
    [15]孙永雄,申晨,黄丽平,等.基于二阶马尔可夫模型的模糊时间序列预测[J].计算机工程与应用,2015,51(6):120-123.SUN Yongxiong,SHENG Cheng,HUANG Liping,et al.Fuzzy Time Series Prediction Based on Second-order Markov model[J]. Computer Engineering and Applications,2015,51(6):120-123.
    [16]王卫民,贺冬春,符建辉.基于种子扩充的专业术语识别方法研究[J].计算机应用研究,2012,29(11):4105-4107.WANG Weiming,HE Dongchun,FU Jianhui. Research on Recognition of Professional Terminology Based on Seed Expansion[J]. Application Research of Computers,2012,29(11):4105-4107.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700