摘要
针对目前垃圾内容识别算法存在的问题,研究一种垃圾内容识别率高、准确率高的识别方法。对常用识别方法 AC自动机和贝叶斯方法进行分析,指出这些方法存在的问题,提出一种基于AC自动机和贝叶斯方法的垃圾内容识别方法。首先,利用AC自动机按照设定的类别关键词库圈定关键词,然后利用贝叶斯方法训练所得的策略对关键词进行二次筛查,进而判断是否为垃圾内容。AC自动机与贝叶斯方法的结合能够在保证高效识别关键词的情况下尽可能地减少误伤,提高用户体验感。
In view of the existing problems of garbage content recognition algorithm,a garbage content recognition method with high recognition rate and high accuracy is proposed. By analyzing the common recognition method of the AC automaton and the Bayesian method,the identification of garbage content based on the AC automaton and the Bayesian method is proposed. First,we should find out the key words based on the specified category keyword database through the AC automata method. Then the keywords are identified twice using the strategy of Bayesian training to determine whether the content is garbage. The combination of AC automata and Bayesian method can reduce the misrecognition as much as possible and improve the user experience while ensuring the efficient recognition of keywords.
引文
[1]韩云凤.基于Lucene的期刊论文库的检索技术研究[D].北方工业大学,2018.
[2]张俊兰,张波.基于数据库的字符串检索[J].电脑学习,2005(01):62-63.
[3]刘丽霞,张志强.基于Trie树的相似字符串查找算法[J].计算机应用,2013,33(08):2375-2378.
[4]徐懿彬.基于Aho-Corasick自动机算法的概率模型中文分词CPACA算法[J].电子科技大学学报,2017,46(02):426-433.
[5]郭淑敏,朱蓉,王晶晶,胡胜,陈佳辉.基于贝叶斯算法的垃圾邮件过滤的方法研究[J].电脑知识与技术,2017,13(13):171-173.