基于遗传禁忌算法的网络信息过滤模型研究

英文题名：Research on Network Information Filtering Model Based on Genetic Taboo Algorithm
作者：姜沛佩
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：网络信息过滤 ; 遗传禁忌算法 ; 朴素贝叶斯 ; 词法分析 ; 文本摘要
英文关键词：network information filter ; genetic taboo algorithm ; Navie Bayesian ; lexical analysis ; text summarization
学位年度：2011
导师：刘培玉
学科代码：081202
学位授予单位：山东师范大学
论文提交日期：2011-06-16

摘要

随着Internet的发展和应用,网上信息飞速增长,内容丰富,种类繁多。然而,网络是把双刃剑,在给人类带来便利的同时又不可避免地使其接触到大量不良信息;另外,基于网络自身所固有的开放性、动态性和异构性,用户很难准确快速地获取所需信息,如何自动从动态信息流中抽取出符合用户个性化需求的信息变得异常重要。为解决上述问题,网络信息过滤技术应运而生。信息过滤技术能根据用户需求抽取信息并屏蔽不良信息,它主要研究网络信息的获取和表示、用户模板的构建、待处理文档的分类等问题。
     本文涵盖了网络信息过滤的各个阶段,以信息过滤模型的查准率和查全率两个技术指标为出发点,做了如下几方面的工作:
     1、深入研究了网络信息过滤相关过滤模型及其各项关键技术探讨了典型的信息过滤模型及其相关算法,重点研究了网络信息过滤中涉及的网络数据获取、分词技术、特征选择算法、权值计算、文本表示模型、分类算法等关键技术。
     2、提出了基于遗传禁忌算法的网络信息过滤模型深入探讨了遗传算法的基本原理及应用,在充分分析遗传算法优点的基础上,针对遗传算法存在的“爬山”能力差、“早熟”等缺点,引入“爬山”能力较强的禁忌搜索算法对交叉算子进行改进,形成禁忌交叉算子,提高传统遗传算法的搜索能力。在过滤模型的分类阶段,针对模型中使用的传统朴素贝叶斯分类算法不能解决单类别词汇问题,文中对其进行改进,使之具有较好的鲁棒性和适应性。
     3、提出了应用词汇组合进行句子抽取的文本摘要方法一篇文本往往包含很多句子,但有些句子不能表达该文本的主题,这些冗余句子影响遗传训练形成的用户模板质量。文本摘要作为一种信息压缩工具能对文本内容进行压缩,去掉冗余句子,提取出最精炼的内容。为进一步提高模板质量,文中引入文本摘要方法对语料进行优化。针对摘取过程中词法分析系统分词精度过低而导致特征项之间语义缺失的问题,文中提出根据词性制定修正规则,并依此规则对分词后的句子进行规范的思想,使句子中有语义关系的词语建立相应联系,改进后的摘要方法摘取的内容更精炼,更准确。
     4、设计并实现了基于遗传禁忌算法的网络信息过滤模型在系统中首先采用改进的文本摘要方法对训练语料进行预处理;然后使用遗传禁忌算法训练文本,形成最优用户模板;最后,采用改进的分类算法对待测文本进行分类,最终实现了一个多层次、多策略及模块化的基于遗传禁忌算法的网络信息过滤系统。经测试,该系统运行可靠、稳定、高效,能对网络信息进行有效的过滤。
With the development and application of internet, the network information is rapidly increasing, rich in content and various in form. However, coin has two sides, while enjoying the convenience of the internet, we also have to face some negative information. In addition, because the internet is open, dynamic and isomerous, it is rather hard to get information what we need, how to automatically extract the information to meet the personalized demands of the user from dynamic information flow becomes more important than ever. In order to solve above problems, network information filter technology has emerged as required. Network information filter can extract information what the user needs and shield the negative information, it focuses primarily on the research about the acquirement and representation of information, the establishment of user template, and the text classification.
     This thesis covers each stage of the network information filter and makes research and study on the following aspects with the two main indexes of filter accuracy and speed of information filter model:
     1. This thesis deeply researches on the related filter model of network information filter and its’key technologies
     This paper discusses the typical information filter model and related algorithms at first. Then, it mainly researches on key technologies which used in network information filter, such as the acquirement of network data, the word segmentation technology, feature selection algorithm, the calculation of the feature weights, text representing model, classification algorithm and so on.
     2. This thesis proposes the network information filter model based on genetic taboo algorithm
     This paper makes an in-depth discussion of the basic principle and application of the genetic algorithm, based on the analysis of the advantages of genetic algorithm, due to the drawback that the genetic algorithms is poor in capable of climbing and has premature problem, this paper introduces taboo search algorithm with strong capable of climbing mountains in crossover operator, which forming taboo crossover operator to improve the search capacity of traditional genetic algorithm. In the classification stage of filtering model, due to the problem that the traditional Naive Bayesian Classifier used in the model could not solve the problem of single category words, this paper improves the classification to make it have better robustness and adaptability.
     3. This thesis proposes text summarization method applying vocabulary combination into sentence extraction
     A text contains many sentences, but some sentences can not express the theme of this text, these redundancy sentences have impact on the quality of user template. Text summarization as an information compression tool can compress text content, remove redundant sentences, and extract the most refined content. In order to improve the quality of the template, this paper introduces text summarization to optimize corpus. In the process of extracting, due to a phenomenon that the lexical analysis system what it uses has the low segmentation accuracy and causes semantic loss between features, this paper formulates the amendment rules, which are used to the sentences formed after partition process of the words, to regulate the vocabulary combination according to the part of speech, making the words in the same sentence semantically related to each other can establish their appropriate links. The summary method proposed in this paper makes the contents extracted more refined and accurate.
     4. This paper designs and implies a network information filter model based on genetic taboo algorithm
     In the system, we firstly adopt the improved text summarization method to preprocess the training corpus, then use the improved genetic algorithm to training text, and form the best user template, finally categorize text by using the improved classification algorithm and achieve a multi-hierarch, multi-policy and modular network information filter system based on genetic taboo algorithm. After testing, this system runs reliably, steadily, effectively, which can effectively filter.

引文

[1]程妮,崔建海,王军.国外信息过滤系统的研究综述[J].现代图书情报技术, 2005(6): 30-38.
    [2]梅海燕.信息过滤问题的研究[J].现代图书情报技术, 2002(2): 44-47.
    [3]柳胜国.网络信息过滤方法与技术[J].情报杂志, 2005(9): 33-34.
    [4]阮彤.信息过滤模型与算法的研究[D].中科院软件研究所, 2001.
    [5]林鸿飞.中文文本过滤的逻辑模型[D].东北大学, 2006.
    [6]牛洪波.基于文本分类技术的信息过滤方法的研究[M].哈尔滨理工大学, 2009.
    [7] Shardanand U., Maes P. Social information filter algorithms for automating Word of Mouth[C]. Proceeding of the 1995 ACM Conference on Human Factors in Computing System. 1995: 210-217.
    [8] Konstan A., Bradley N.M., Maltz D., Herlocker J.L., Gordon L.R., Riedl J.GroupLens: Applying collaborative filter to usenet news[J]. Communication of the ACM, 1997, 40(3): 77-87.
    [9] Pazzani M., Bilsus D. Learning and revising user profiles: the identification of interesting websites[J]. Machine Learing. 1997, 27(3): 313-331.
    [10] Armstrong R., Freitag D., Joachims Tetal. Web Watcher: a learning apprentice for the World Wide Web. Working Notes of the AAAI Spring Symposium Series on Information Gathering from Distributed[C]. Heterogeneous Environments, Cambridge: AAAIPress. 1995: 6-12.
    [11] Lieberman H., Letizia. An agent that assists web browsing. Proceedings of the International Joint Conference on Artificial Intelligence[C]. SanMateo:Morgan Kaufman Publishers. 1995: 924-929.
    [12]卢增祥,关宏超,李衍达.利用Bookmark服务进行网络信息过滤[J].软件学报, 2000(11): 545-550.
    [13] Raskutti B., Beitz A., Ward B.A feature-based approach to recommending selections based on past preferences[J]. User Modeling and User-Adapted Interaction, 1997, 7(3): 179-218.
    [14] Youngsoo K. Taekyong N. An efficient text filter for adult web documents. ICACT 2006[C]: 438-440.
    [15] Belkin Nicholas J, Croft W Bruce. Information Filter and Information Retrieval: two Sides of the Same Coin[J]. Communications of ACM, 1992, 35(12): 29-38.
    [16]钱丽萍,高光来.包捕获技术:原理、防范和检测[J].计算机系统应用, 2000, 3, 31-33.
    [17]朱雁辉著. Windows防火墙与网络封包截获技术[M].电子工业出版, 2002.
    [18]刘开瑛著.中文文本自动分词和标注[M].北京:商务印书馆. 2000.
    [19]刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展, 2004, 41(8): 1421-1430.
    [20] Yang Yiming, Pederson J O. A Comparative Study on Feature Selection in Text Categorization[A]. Proceedings of the 14th International Conference on Machine learning[C]. Nashville: Morgan Kaufmann, 412-420, 1997.
    [21]杨玉珍,刘培玉,朱振方,邱烨.应用特征项分布信息的信息增益改进方法研究[J].山东大学学报(理学版), 2009.
    [22]叶浩,王明文,曾雪强.基于潜在语义的多类文本分类模型研究[J].清华大学学报(自然科学版), 2005(S1).
    [23] Dik L., Lee H.Doucument ranking and the vector-space modal.IEEE software[J]. 1997.4: 67 -75.
    [24] Remco R.Bouckaert. Bayesian Network Classifiers in Weka. 2004.
    [25] Remco R.Bouckaert: Na?ve Bayes Classifiers That Perform Well with Continuous Variables[C]. Australian Conference on Artificial Intelligence 2004: 1089-1094.
    [26] C-W.Hsu, C-C.Chang, C-J. Lin. A practical guide to support vector classification. July, 2003.
    [27] Vapnik V. The Nature of Stastical Learning Theory[M]. New York: Springer-Verlag, 1995.
    [28]王小平.遗传算法-理论、应用与软件实现[M].西安:交通大学出版社, 2002.
    [29]赵云珍.遗传算法及其改进[M].昆明理工大学, 2005.
    [30] GENG zhao-feng, LI Bei-bei, ZHAO zhi-hong. Improved Genetic Algorithm Application in Textile Defect Detection[J]. Journal of Donghua university.2007, 24(3): 350-353.
    [31] Holland J.H., Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan press, 1975.
    [32] [日]玄光男、程润传,遗传算法与工程设计.北京:科学出版社, 2000.
    [33]于水英.基于遗传算法与模糊聚类的文本分类研究[D].哈尔滨理工大学, 2009.
    [34]张颖,张艳秋.软计算方法.科学出版社, 2002
    [35]李大卫,王莉,王梦光.遗传算法与禁忌搜索算法的混合策略[J].系统工程学报,1998, 13(3): 28-34.
    [36] Glover F., Kelly J., Laguna M. Genetic algorithms and tabu search: hybrids for optimizations. Computer Ops.Res., 1995, 22(1): 111-134.
    [37]孙艳丰,郑加齐. GATS混合算法及其收敛性研究[J].铁道学报, 2000, 22(2): 94-98.
    [38]阮彤,冯东雷,李京.基于贝叶斯网络的信息过滤模型研究[J].计算机研究与发展, 2002, 39(12): 1564-1571.
    [39]张宁,贾自艳,史忠植.使用KNN算法的文本分类[J].计算机工程, 2005, 31(8): 171-185.
    [40]陈剑敏.基于Bayes方法的文本分类器的研究与实现[D].重庆:重庆大学计算机学院, 2007.
    [41] Neal,R.M, Hinton. A new view of the EM algorithm that justifies incremental, sparse, and other variants, In Learning in Graphical Models[M]. [S.l.]: Hewer Academic Publishers, 1998: 355-368.
    [42]庞秀丽,冯玉强,姜维.贝叶斯文本分类中特征词缺失的补偿策略[J].哈尔滨工业大学学报, 2008, 40(6): 956-970.
    [43]袁方,苑俊英.基于类别核心词的朴素贝叶斯中文文本分类[J].山东大学学报(理学版), 2006, 41(3): 46-49.
    [44]刘德喜,何炎祥,姬东鸿,杨华.一种基于演化算法进行句子抽取的多文档自动摘要系统[J].中文信息学报. 2006, 20(6): 46-53.
    [45]王继成,武港山,周源远,张福炎.一种篇章结构知道的中文Web文档自动摘要方法[J].计算机研究与发展, 2003, 40(3): 398-405.
    [46]杨勇涛.文本自动摘要提取算法[J].成都大学学报(自然科学版). 2009, 28(2): 142-145.
    [47] Abney S.Parsing by chunks. In: Berwick R, Abney S, Tenny C etal. Principle-Based Parsing[C]. Dordrecht: Kluwer Academic Publishers, 1991: 257-278.
    [48]肖清梅.汉语组块识别的研究与应用[D].大连:大连理工大学, 2009.
    [49]李素建,刘群,白硕.统计和规则相结合的汉语组块分析[J].计算机研究与发展, 2002, 39(4): 385-391.
    [50]叶星火.基于特征信息提取的中文自动文摘研究[D].华中师范大学硕士论文, 2007.
    [51] Karen Sparck Jones, etc.Automatic Summarizing Factors and Directions Advances in Automatic Text Summarization, Cambridge MA: MIT Press, 1998.
    [52]徐超,王萌,何婷婷,张勇.基于局部主题关键句抽取的自动文摘方法[J].计算机工程, 2008, 34(22): 49-51.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700