基于遗传算法与模糊聚类的网络信息过滤系统的研究

英文题名：The Research of Network Information Filtering System Based on Genetic Algorithm and Fuzzy Clustering
作者：陆宏菊
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息过滤 ; 遗传算法 ; 模糊聚类 ; 适应度函数 ; 相似度
英文关键词：Information Filtering ; Genetic Algorithm ; Fuzzy Clustering ; Fitness Function ; Similarity
学位年度：2008
导师：刘培玉
学科代码：081202
学位授予单位：山东师范大学
论文提交日期：2008-04-10

摘要

随着Internet的发展和应用,越来越多的商务、日常活动通过Internet进行,网络与人们的生活越来越紧密。然而,网络是双面的,人们在享受网络所带来便利的同时,不可避免地接触到大量不良信息;另外,基于Internet所固有的开放性、动态性和异构性,用户很难准确快捷地从Internet上获取所需信息。这就需要在浩如烟海的动态信息中过滤掉不符合用户信息需求的有害、无用信息,把不相关信息减至最少。因此,网络信息过滤技术已经成为当前研究的热点之一。
     如何获得用户的兴趣模板,并依据模板对过滤文档分类,是网络信息过滤中的关键技术。目前常采用文本分类中的相关技术来实现,如:Rocchio、K-元最近邻居、贝叶斯、支持向量机以及遗传算法(GA)等方法。GA在网络信息过滤中的应用主要是为了获得用户的兴趣模板,其效果与适应度函数相关。当前的适应度函数多采用以求个体相似度为基础的方法对种群进行评价。这种方法在评价时,重点在种群个体的相似程度评估上,没有对个体的类别属性进行评价,也没有考虑到特征的典型性及特征包含的类别信息方面的内容,所以获得的用户模型在过滤时效果不是很理想。
     1965年,Zadeh提出模糊集理论之后,人们开始用模糊的方法来处理聚类问题,并称之为模糊聚类分析。由于模糊聚类得到了样本属于各个类别的不确定性程度,表达了样本类属的中介性,即建立起了样本对于类别的不确定性描述,能更客观地反映现实世界。因此,在基于遗传算法的信息过滤中,引入模糊聚类技术来评价,能够更多的考虑到各特征项所属类别的非绝对性、特征的典型性及所包含的类别信息,从种群个体的类别属性上进行评价,从而可获得更准确的用户兴趣模板。
     本文在遗传算法中引入了模糊聚类的思想,从模糊聚类的角度对基于GA的信息过滤系统中种群个体进行评价,提出一个基于模糊聚类的遗传算法,然后将该算法应用于信息过滤中,实现了基于遗传算法与模糊聚类的信息过滤系统。最后,在该系统中对其有效性进行了验证。本文具体工作如下:
     1.将模糊聚类技术融入遗传算法,对个体进行评价。在计算适应度之前,先采用个体所选择的特征子集将训练文本表示成向量,然后采用模糊相似矩阵直接聚类法对其聚类,最后根据聚类的效果来计算适应度。这种评价方法从个体对文本类别的判定能力方面评价个体,更多的考虑到特征的典型性及所包含的类别信息方面的内容。
     2.提高了算法的抗干扰性。适应度函数通过对模糊聚类结果的正确率和紧凑程度两个方面评价的综合来计算适应度值。该函数设置了一个w参数。调整w的取值,可以降低适应度函数对训练文本集中干扰文本的敏感程度,从而提高了算法的抗干扰性。
     3.实现了基于遗传算法与模糊聚类的网络信息过滤系统。采用本文中所提出的基于模糊聚类的遗传算法学习训练文本,通过对种群个体进行评估,经过一定代数的迭代训练获得用户的兴趣模板,然后采用改进的Sim函数对待过滤文档比较分类,最终实现信息过滤。通过该系统验证了该方法的有效性。
     文中通过从模糊聚类角度评价种群个体,提出了基于模糊聚类的遗传算法。经试验验证,该算法在准确率和F1测度方面均有明显的提高。
Following the development of Internet, more and more commercial and daily activities are carried out through the Internet. Network becomes closer to people’s daily life. Coin has two sides. When we are enjoying the convenience from the Internet, it also brings some bad information to Internet users. In addition, because the Internet is openly, dynamic and isomerous , it is rather hard to get information what we need. This demands a method to reduce the irrelevant information according to user’s information demand. So information filtering becomes one of the hot research fields.
     Gaining the user’s profiles, expressing user’s interests, using which to classify the documents form the Internet is the key technique of network information filtering. The relevant techniques of text classifier are often used, such as Rocchio, K Nearest Neighbor, Na?ve Bayesian, Support Vector Machine, and Genetic Algorithm (GA). The application of GA in information filtering is to gain the user’s profiles and its effect is determined by GA’s Fitness Function. At present, the Fitness Function often adopts the method that based on computing the similarity of GA’s individuals. The evaluation method pays more attention to individuals’similarity but less to the classificatory attribute of individuals and features, also the typicality of features. Therefore, the effect of users’profiles is not so good.
     After the fuzzy set theory brought forward by Zadeh in 1965, people begin to use fuzzy theory to do clustering problems. Because the fuzzy clustering (FC) can obtain the degree of classificatory indeterminacy, express the samples’medi-attribute, it reflects the realistic world better. So, if we use the fuzzy clustering method to evaluate the GA’s individuals in network information filtering system based on Genetic Algorithm, can considers more the non-absoluteness of the classificatory residing of each feature, features’typicality and the involved classificatory attribute, mean while, can give a classificatory attribute evaluation of individuals to some degree. Accordingly, gain the more veracious user’s profile. This paper uses fuzzy clustering method to evaluate GA’s individuals in information filtering, proposes a genetic training algorithm based on FC, and then applys this algorithm to an information filtering system, forms the GA and FC network information filtering system, using which proves the validity of GA based on FC. The main tasks that this paper has done as follows:
     1. Using the GA combined with FC to evaluate GA’s individuals. Before computing the fitness, expresses the training set as vectors according to one individual, then clusters it using direct clustering method of Fuzzy Similar Matrix, computes the fitness finally by evaluating the result of clustering. This method evaluates the individuals according to its ability of juding texts’sorts, pays more attention to the typicality of features and its classificatory attribute.
     2. Improving the training algorithm’s ability of anti-jamming. The fitness function computes the fitness by combining the correctness and denseness of the result of fuzzy clustering. This function sets an parameter w which can lower the sensitivity to outliers of training text set. Thereby improves the training algorithm’s ability of anti-jamming.
     3. Implementing the network information filtering system based on GA and FC. This system adopting simulated annealing genetic algorithm to training, evaluating individuals by fuzzy clustering, obtaining user’s profiles through certain generations’iterative training, classifying the information according to profiles using the improved Sim function, accomplishing the process of information filtering, presents the experiment results which proves the validity of the GA based on FC.
     This paper presents a genetic algorithm based on fuzzy clustering by evaluating GA’s individuals using fuzzy clustering technique. Testing proves that it has an obvious advantage in the aspect of precision and F1 measure.

引文

[1] 毛颖,周源远,王继成,张福炎.信息过滤技术研究[J].计算机科学,2003.30(8):10-23.
    [2] 程妮,崔建海,王军.国外信息过滤系统的研究综述[J].现代图书情报技术,2005(6):30-38.
    [3] Maes P.,Kozierok R.Learning interface agents[C].Proceeding of AAAI.1993:459-465.
    [4] Shardanand U.,Maes P.Social information filtering algorithms for automating Word of Mouth[C].Proceeding of the 1995 ACM Conference on Human Factors in Computing System.1995:210-217.
    [5] 柳胜国.网络信息过滤方法与技术[J].情报杂志,2005(9):33-34.
    [6] Konstan A.,Bradley N. M.,Maltz D.,Herlocker J. L.,Gordon L. R.,Riedl J.GroupLens:Applying collaborative filtering to usenet news[J].Communication of the ACM,1997.40(3):77-87.
    [7] Pazzani M.,Bilsus D.Learning and revising user profiles:the identification of interesting web sites[J].Machine Learing,1997.27(3):313-331.
    [8] Armstrong R.,Freitag D.,Joachims Tetal.WebWatcher:a learning apprentice for the World Wide Web[C].Working Notes of the AAAI Spring Symposium Series on Information Gathering from Distributed.Heterogeneous Environments,Cambridge: AAAIPress.1995:6-12.
    [9] Lieberman H.,Letizia.An agent that assists web browsing[C].Proceedings of the International Joint Conference on Artificial Intelligence. SanMateo: Morgan Kaufman Publishers. 1995:924-929.
    [10] 卢增祥,关宏超,李衍达.利用 Bookmark 服务进行网络信息过滤[J].软件学报,2000,11(4):545-550.
    [11] Raskutti B.,Beitz A.,Ward B.A feature-based approach to recommending selections based on past preferences[J].User Modeling and User-Adapted Interaction,1997.7(3):179-218.
    [12] Youngsoo K.,Taekyong N.An efficient text filter for adult web documents[C].ICACT.2006:438-440.
    [13] Holland J. H.Adaption in natural and artificial systems[M]:The University of Michigan Press.1975.
    [14] DeJong K. A.An analysis of the behaviour of a class of genetic adaptive system:[ph D Dissertation][J].University of Michigan,1975:76-9381.
    [15] Bethke A. D.Genetic algorithms as function optimizers[J].Dissertation Abstracts International,1981.41(9):3503B.
    [16] 蓝海,王雄,王凌.复杂函数全局最优化的改进遗传退火算法[J].清华大学学报,2002.42(9):1237-1240.
    [17] 陈勇,唐敏,童若锋,董金祥.基于遗传模拟退火算法的不规则多边形排样[J].计算机辅助设计与图形学学报,2003.15(5):598-603.
    [18] 田珂,朱清新,向培素.基于混合遗传算法的工作流重构研究[J].计算机科学,2007.34(1):103-111.
    [19] 周双娥,雷辉.基于改进的遗传-模拟退火的有序任务调度算法[J].微电子学与计算机,2006.23(10):62-64.
    [20] 张戈,胡伟武.高性能通用处理器中的漏电功耗优化[J].计算机学报,2006.29(10):1764-1771.
    [21] 陈水利,李敬功,王向公.模糊集理论及其应用[M].北京:科学出版社.2005,9:95-125.
    [22] Backer E.,Jain A. K.A clustering performance measure based on fuzzy set decomposition[J].IEEE Trans. PAMI,1981.3(1):66-74.
    [23] Zkim Le.Fuzzy relation compositions and pattern recognition[J].Inf. Sci.,1996:107-130.
    [24] Esogbue A. O.Optimal clustering of fuzzy data via fuzzy dynamic programming[J]. FSS,1986.18:283-298.
    [25] 叶林,陈东升.基于 Fuzzy 等价关系的矩形聚类方法[J].数学的实践与认识,2007.37(9):104-108.
    [26] 叶浩,王明文,曾雪强.基于潜在语义的多类文本分类模型研究[J].清华大学学报(自然科学版),2005,45(S1):1818-1822.
    [27] Dik L., Lee H.Doucument ranking and the vector-space modal[J].IEEE software,1997.4:67-75.
    [28] Qing Yang, Fang-min Li.Support vector machine for customized email filtering based on improving latent semantic indexing[C].Proceedings of the Fourth International Conference on Machine Learning and Cybernetics.Guangzhou.2005:3785-3791.
    [29] 杨军,夏清国.Rocchio 算法实现数据库模糊查询[J].微电子学与计算机,2006.23(1):160-161.
    [30] Cheatham M.,Rizki M.Feature and prototype evolution for nearest neighbor classication of web documents[C].Proceedings of the Third International Conference on Information Technology:New Generations (ITNG'06).2006:364-369.
    [31] Fang Yuan,Liu Yang,Ge Yu.Improving the k-nn and applying it to Chinese text classification[J].2005 IEEE,2005:1547-1553.
    [32] 阮彤,冯东雷,李京.基于贝叶斯网络的信息过滤模型研究[J].计算机研究与发展,2002.39(12):1564-1571.
    [33] Hussein F.,N. Kharma,R. Ward.Genetic algorithms for feature selection and weighting,a review and study[J].2001 IEEE:1240-1244.
    [34] Jian-chao Xu,Da-you Liu,Ming Hu.Feature selection and text classification for Chinese web documents[C].Proceeding of the Third International Conference on Machine Learning and Cybernetics.Shanghai.2004:1304-1309.
    [35] 卢苇,彭雅.几种常用文本分类算法性能比较与分析[J].湖南大学学报(自然科学版),2007(6):67-69.
    [36] 刘明吉.基于协同演化的文本特征获取算法[J].计算机工程,2005.31(4):85-87.
    [37] 杜长海,吉根林.模糊聚类在中文文本分类中的应用研究[J].计算机工程与应用,2006(8):170-177.
    [38] 甘利杰,丁明勇,杨永斌.基于 Winsock SPI 技术的包过滤研究[J].计算机科学,2007.34(8):112-122.
    [39] 陆宏菊,刘培玉,崔嘉.基于模糊评判的网络信息过滤系统的研究[J].计算机工程与应用,2007.43(33):164-166.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700