摘要
在用户短文本中,意思相同的句子有多种表述方式,这些句子中存在很多与句意无关的信息,称为无关语。针对一般方法无关语识别准确度不高的问题,论文提出了一种通过二阶隐马尔科夫模型来自动识别用户短文本中无关语的方法。本方法在建模过程中将词本身、词性以及词的相对位置作为特征来对隐马尔科夫模型进行扩充。实验结果表明,论文给出的用户短文本无关语识别方法可以避免对训练文本进行手工编写规则的限制,且在准确率和召回率方面均有一定程度的提高。
In user-oriented short text,sentences with the same meaning have a variety of expressions,these sentences has a lot of irrelevant information,which is called irrelevant words. In order to solve the problem that the accuracy of common recognition method is not high,an automatic recognition method is proposed for marking irrelevant words in the corpus to be marked by the second-order hidden Markov model. In order to solve the problem that the Hidden Markov Model can only consider the previous word as a feature when labeling the corpus and it has led to poor results,this method has considered each word itself in the labeling process,the speech and the relative position as features when marking. The results show that this method can avoid the limitation of hand-written rules for training texts,and improve the accuracy and recall rate to a certain extent.
引文
[1]O'MaraEves A,Thomas J,McNaught J,et al. Using text mining for study identification in systematic reviews:a systematic review of current approaches[J]. Systematic Reviews,4,1(2015-01-14),2015,4(1):5.
[2]Chidanand Apte,Bing liu,Edwin P D Pednault,Padhraic Smyth. Business Applications of Data Mining[J]. Communications of ACM,2002,45(8):49-53
[3]Bholat D M,Hansen S,Santos P M,et al. Text Mining for Central Banks[J]. Handbooks,2015,33:1-19.
[4]Nassirtoussi A K,Aghabozorgi S,Wah T Y,et al. Text mining for market prediction:A systematic review[J]. Expert Systems with Applications,2014,41(16):7653-7670.
[5]樊存佳,汪友生,边航.一种改进的KNN文本分类算法[J].国外电子测量技术,2015(12):39-43.FAN Cunjia,WANG Yousheng,BIAN Hang. An Improved KNN Text Classification Algorithm[J]. Foreign Electronic Measurement Technology,2015(12):39-43.
[6]Wang J C,Pan J G,Zhang F Y. RESEARCH ON WEB TEXT MINING[J]. Journal of Computer Research&Development,2000.
[7]Dhillon I S,Modha D S. Concept decompositions for large sparse text data using clustering[J]. Machine learning,2001,42(1-2):143-175.
[8]姜仁会,王挺,唐晋韬.面向微博文本的命名实体识别[J].计算机与数字工程,2014,42(4):647-651.JIANG Renhui,WANG Ting,TANG Jintao. Named Entity Recognition for Micro-blog Text[J]. Computer&Digital Engineering,2014,42(4):647-651.
[9]袁璐,蒙祖强,许珂.依存分析和HMM相结合的信息抽取方法[J].计算机工程与应用,2012,48(09):138-140.YUAN Lu,MENG Zuqiang,XU Ke. Information Extraction Method Based on Dependency Analysis and HMM[J]. Computer Engineering and Application,2012,48(09):138-140.
[10]Speck R,Ngomo A C N. Ensemble Learning for Named Entity Recognition[C]//International Semantic Web Conference. Springer-Verlag New York,Inc. 2014:519-534.
[11]周峰,朱俊武,童林,等.无关语获取与语料聚类方法研究[J].南京师大学报(自然科学版),2014,37(4):150-157.ZHOU Feng,ZHU Junwu,TONG Lin,et al. Research on Acquisition of Irrelevant Words and the Method of Corpus Clustering[J]. Journal of Nanjing Normal University(NATURAL SCIENCE EDITION),2014,37(4):150-157.
[12]Lakshmanan G T,Shamsi D,Doganata Y N,et al. A markov prediction model for data-driven semi-structured business processes[J]. Knowledge&Information Systems,2015,42(1):97-126.
[13]Shun-Zheng Yu. Hidden semi-Markov models[J]. Artificial Intelligence,2010,174(2):215-243.
[14]Blanchet J,Gallego G,Goyal V. A markov chain approximation to choice modeling[J]. Operations Research,2016,64:886-905.
[15]孙永雄,申晨,黄丽平,等.基于二阶马尔可夫模型的模糊时间序列预测[J].计算机工程与应用,2015,51(6):120-123.SUN Yongxiong,SHENG Cheng,HUANG Liping,et al.Fuzzy Time Series Prediction Based on Second-order Markov model[J]. Computer Engineering and Applications,2015,51(6):120-123.
[16]王卫民,贺冬春,符建辉.基于种子扩充的专业术语识别方法研究[J].计算机应用研究,2012,29(11):4105-4107.WANG Weiming,HE Dongchun,FU Jianhui. Research on Recognition of Professional Terminology Based on Seed Expansion[J]. Application Research of Computers,2012,29(11):4105-4107.