信息过滤系统中特征选择算法的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

信息过滤系统中特征选择算法的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Feature Selection Methods in Information Filtering System
作者：王美方
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息过滤 ; 特征选择 ; 分词 ; 文本分类 ; 聚类
英文关键词：information filtering ; feature selection ; word segmentation ; text tegorization ; clustering
学位年度：2008
导师：刘培玉
学科代码：081202
学位授予单位：山东师范大学
论文提交日期：2008-04-08

摘要

随着Internet的迅速发展和日益普及,电子文本信息迅速膨胀,如何有效地组织和管理这些信息,并快速、准确、全面地从中找到用户所需要的信息就是当前信息科学技术领域面临的一大挑战。网络信息过滤技术作为处理和组织庞大的网络信息的关键技术,可以在较大程度上解决信息杂乱的现象,方便用户准确地定位所需信息。目前,对于信息过滤技术的研究,大多数研究者的精力主要放在各种不同分类方法的研究与改进上。然而,特征选择一直是网络信息过滤中的基础性工作,而且是一项瓶颈技术。因此,对特征选择算法的研究也是十分必要的。
     目前常用的特征选择算法都直接利用了特征之间的条件独立性假设,通过构造一个评价函数,单独对特征集的每个特征进行评价,但是由于没有直接考虑特征的类别相关性,也没有考虑特征子集的冗余性,这些方法选择的特征子集在类别区分能力上往往存在着冗余,导致最终分类效果不佳。
     本文主要针对信息过滤系统中特征选择算法的相关问题,在如下几个方面进行了研究和讨论:
     1、对常用的特征选择方法的优点和缺点进行了分析,并针对存在的不足之处指出了相应的改进方向。
     本文首先对特征选择技术做了综合分析,并着重介绍了特征选择技术的框架。目前常用的几种特征选择方法各有所长,亦各有所短,文中从计算复杂度和分类效果出发,分析了它们的优缺点,并指出了可能导致的原因所在。另外,根据相关文献资料,列举出了常用特征选择算法的对比实验结论。这与本文最后的实验结果大致相同。
     2、从特征相关性和冗余性定义出发,提出了一种特征选择框架FSBC(feature selection based on correlation),即把特征选择过程分两步进行:第一步选取类别相关的特征子集;第二步通过冗余分析,去除候选特征子集中的冗余特征,最终获得优化特征子集。
     首先,选取类别相关特征时,本文根据这样一个原则构造评价函数来选取特征项:如果一个特征项t在一个类别的文档中频繁出现,而在其它类别中很少出现的话,那么该特征项t能够很好的代表这个类别,这样的特征项应该赋予较高的权值,并选来作为该类别的特征词,以区别于其它类别的文档。另外,文中引入了TFIDF权重计算的思想,考虑将词频和文档频率结合起来共同作为评价特征项的依据。
     其次,进行冗余分析时,本文采用聚类方法中常用的K-Means算法作为去冗余的核心算法,针对该算法中的初始簇中心的选择及初始簇个数的设置问题进行了相应的改进,使类K-Means算法更有效的减少特征集的冗余性。
     3、最后,将所提出的特征选择策略在网络信息过滤平台上进行了实验测试,并取得了令人满意的测试效果。
     本文将特征选择框架FSBC应用于网络信息过滤系统,并与信息增益(IG)和CHI统计方法进行了实验对比。实验表明,FSBC方法在准确率和查全率上要好于其它两种方法,尤其在特征维数较高时取得了不错的实验效果。
With the rapid development and the spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how people organize and process large amount of document data, and find the interesting information for users quickly, exactly and fully. As the key technology in organizing and processing large amount of document data, network Information filtering technology can solve the problem of information disorder to a great extent, and is convenient for users to find the required information quickly. Recently, for the study of Information Filtering technology,researchers mostly focus on the exploration and improvement of diffenent classification algorithms. However, the feature selection has always been a basic work and a bottle-neck technology furthmore of Network Information Filtering.so, it is necessary to study feature selection algorithms.
     At present, common feature selection algorithms directly uses the conditions of independence assumptions among features , evaluates separately each feature in the feature set through constructing a evaluation function. But duing to in the absence of the relevant categories of features and redundancy of feature subsets, the feature subsets selected by these methods exist redundancy sometimes in the ability to distinguish between categories, and thus lead to a final classification ineffective.
     In this paper, for the related issues of feature selection algorithm in the information filtering system, the following aspects were studied and discussed:
     1. The strengths and weaknesses of feature selection commonly used were analysized ,and improvement of direction was pointed out for the weaknesses.
     This paper firstly gived the comprehensive analysis of feature selection technology, and emphatically introduced the framework of feature selection technology. At present, several feature selections commonly used have their strong points and weak points. We analyzed their advantages and disadvantages from the computational complexity and classification effect in this paper, and pointed out the reason that may lead to it. In addition, according to the literature data related, we described the experiment conclusions .this conclusions were same to the finally experimental results.
     2. A feature selection framework FSBC(feature selection based on correlation) was proposed from the definition of feature relativity and redundancy, that is the process of feature selection was separated into two-step section: first, selecting the feature subset that was related to categories; secondly, removing out the redundant feature item in the choosely feature subset through the redundancy analysis, and finally got the optimized feature subset.
     Firstly, for the selecting feature relevant of category, this paper constructs a evaluation function to selecting feature item according to the principle : if a feature item t frequently appear in the document belonging to one category, but few in other categories, then the feature item t can well represente this category ,and should be given a higher weight, and should be selected as the categories of feature words to distinguished from other category of documents. In addition, this paper introduces the idea of weight computing TFIDF, and considers combining the word frequency and the document frequency as the basis for the evaluation of features.
     Secondly, for the redundancy analysis, this paper adopts the algorithms of K-Means commonly used in the clustering method as core algorithm to removing redundancy. For the selection of the center of initial cluster and the number of the initial cluster in this algorithm, this paper has improved those issues in order to making similary K-Means algorithm reduce the redundancy of features set effectively.
     3.Finally, the proposed feature selection strategy was applied in the platform of Network Information Filtering, and achieved satistisfying experimental effect.
     This paper applied the feature selection framework of FSBC into Network Information Filtering System, and did experimental comparsion for Information Gain(IG) and CHI statistical methods. Experiments show that FSBC method is better than the other two methods in accuracy and recall rate, and it can make good performance especially in the higher dimension.

引文

[1] F. Crimmins, A.Smeaton, T.Dkaki, etal. Information discovery on the internet Intelligent Systems and Their Applications, IEEE, 1999, 55-62
    [2] 张晓东, 张书杰, 王万亭. 信息过滤的模糊聚类模型. 计算机工程与应用, 2002, 28(9), 34-36
    [3] 林鸿飞, 战学刚. 文本特征区域与文本过滤的匹配机制. 计算机工程与应用, 2000. 36(7), 7-9
    [4] 刘七. 基于 Web 文本内容的信息过滤系统的研究与设计. 硕士学位论文. 南京理工大学, 2004, 6-8
    [5] 阮彤. 信息过滤模型与算法的研究. 博士学位论文. 中科院软件研究所, 2001, 7-8
    [6] 林鸿飞. 中文文本过滤的逻辑模型.博士学位论文. 东北大学, 2006, 6-7
    [7] 黄晓斌, 夏明春. 网络信息过滤的成本效益分析. 情报科学, 2003, 21(11): 1129-1132
    [8] Langley P. Selection of relevant features in machine learning. Proceedings of the AAAI Fall Symposium on Relevance. Menlo Park, CA.AAAI Press, 1994, 140-144
    [9] Jain A k, Zongker D. Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, 19(2):153-158
    [10] Xing E, Jordan M, Karp R. Feature selection for high-dimensional genomic microarray data. In: Brodley C E, Danyluk A P,Eds.Proceedings of the 18th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 001, 601-608
    [11] Kohavi R, John G H. Wrappers for feature subset selection. In Artificial Intelligence journal, special issue on relevance, 1997, 97(1-2): 273-324
    [12] Huang Y, Shian-Shyong T, Wu G, etal. A two-phase feature selection methods using both filter and wrapper. Proceedings of the 1999 IEEE International Conference on Systems, Man, and Cybernetics. 1999, 2: 132-136
    [13] 章新. 一种特征选择的动态规划方法. 自动化学报, 1998, 25(4): 675-680
    [14] 张鸿宾, 孙广煜. Tabu 搜索在特征选择中的应用. 自动化学报, 1999, 25(4): 457-466
    [15] Y.Yang and J.O.Pedersen, A comparative study on feature selection in text categorization. In Proceedings of ICML-97,14th International Conference on Machine Learning, pages 412-420,Nashville, US,1997
    [16] Ng A Y. On feature selection: learning with exponentially many irrelevant features as training examples. In: Shavlik J W, Eds. Proceedings of the 15th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1998, 404-412
    [17] Zhong n, Dong J. Using Rough Sets with Heuristics for Feature Selection. Journal of Intelligence Information Systems, 2001, 16: 199-214
    [18] Setiono R, Liu H. Neural-Network feature selector. IEEE Transactions on Neural Networks, 1997, 8(3): 654-662
    [19] Weston J, Mukherjee S, etal. Feature selection for SVMS. In Advances in Neural Information Processing Systems, 2000, 13: 668-674
    [20] 范劲松,方廷建. 特征选择和提取要素的分析及其评价. 计算机工程与应用, 2001, 13: 95-99
    [21] Lee H M, Chen C M, etal. An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Transactions on Systems and Cybernetics-Part B: Cybernetics, 2001, 31(3): 426-432
    [22] Koller D, Sahami M. Toward optimal feature selection. In: Saitta L, Eds. Proceedings of the Thirteenth International Conference on Machine Learning. SanFransisco: Morgan Kaufmann, 1996, 284-292
    [23] Dash M, Liu H. Feature selection for clustering. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000, 110-121
    [24] Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Brodley C E, Danyluk A P, Eds. Proceeding of the 18th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 2001, 74-81
    [25] Liu H, Motoda H, Yu L. Feature selection with selective sampling. In: Sammut C, Hoffmann A G, Eds. Proceedings of the 19th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 2002, 395-402
    [26] 刘颖. 计算语言学. 北京: 清华大学出版社, 2002
    [27] Han Ke-song, Wang Yong-cheng, Chen Gui-lin. Research on fast high-frequency strings extracting and statistics algorithm with no thesaurus. Journal of Chinese Information Processing , 2000, 15(2): 23-29
    [28] 张辉丽, 孟昭鹏, 王慧芝. 汉语自动分词中的歧义处理. 微计算机应用, 2006, 27(6): 685-688
    [29] 曾华琳, 李堂秋, 史晓东. 一种基于提取上下文信息的分词算法. 计算机应用, 2005, 25(9): 2025-2027
    [30] 毋琳, 郑逢斌, 乔保军, 汤赛丽. HENU 汉语分词系统中的中文人名识别算法. 计算机工程与应用, 2006, 14: 180-182
    [31] 秦颖, 王小捷, 张素香. 汉语分词中组合歧义字段的研究. 中文信息学报, 2007, 21(1): 3-8
    [32] 刘开瑛著. 中文文本自动分词和标注. 商务印书馆. 2000
    [33] 黄昌宁. 统计语言模型能做什么? 语言文字应用, 2002, (1): 77-84
    [34] MANN NG C,SCHUTZE H. Foundations of Statistical Narural Language Processing. MIT Press Cambrige, MA: 1999
    [35] 刘群, 张华平, 俞鸿魁等. 基于层叠隐马模型的汉语词法分析. 计算机研究与发展, 2004, 41(8)
    [36] Hua - Ping ZHANG, Chinese Lexical Analys is Using Hierarchical Hidden Markov Model. Second SIGHAN workshop affiliated with 41th ACL. July, 2003. 63 – 70
    [37] OOI C H, TAN P. Genetic algorithm sapplied to multi-class prediction for the analysis of gene expression data. Bioinformatics, 2003, 19(1): 37-44
    [38] LEE K E, SHA N, DOUGHERTY E R, etal. Gene selection:A bayesian variable selection approach. Bioinformatics, 2003, 19(1): 90-97
    [39] DEUTSCH J M. Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics, 2003, 19(1): 45-52
    [40] XING E, JORDAN M, KARP R. Feature selection for high-mensional genomic microarray data. Proceedings of the Eighteenth International Conference on Machine Learning. Massachusetts, USA: Morgan Kaufmann, 2001
    [41] John G, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem.In: Cohen W W, Hirsh H, Eds. The Eleventh International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1994, 121-129
    [42] Caruana R A, Freitag D. Greedy attribute selection. In: Cohen W W, Hirsh H, Eds. Proceedings of the 11th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1994, 28-36
    [43] Moore A W, Lee M S. Efficient algorithms for minimizing cross validation error. In: Cohen W W, Hirsh H, Eds. Proceedings of the 17th International Conference on Machine Learning. San Fransisco: Morgan Kaufmann, 1994, 190-198
    [44] I Guyon, A Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research . 2003(3): 1157-1182
    [45] L Yu, H Liu. FCBF-Feature Selection for High-Dimensional Data. In Proceedings of the twentieth International Conference on Machine Learning, Washington DC, USA. 2003: 856-863
    [46] 刘丽珍, 宋翰涛. 文本分类中的特征选取. 计算机工程, 2004, 30(4): 14-15
    [47] G.Kowalski. Information Retrieval systems-theory and imPlementations. Kluwer Academic Publishers, 1997
    [48] O.Zamir, O.Etcioni, O.Madani, Rm Karp. Fast and lntuitive Clustering of Web Documents. In Proceedings of the 3rd Internationa1 Conference on Knowledge Discovery and DataMining, 1997: 287-290
    [49] Cutting D, Karger Dr, PederS0n J, Tukey Jw. Scatter/Gather: a cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Rethieval, 1992: 318-329
    [50] Koller D, Sahami M. Hierarchica11y c1assifying documents using very few words. Proc. of the 14th International Conference on Machine Learning ICML97, 1997: 170-178
    [51] 雷琼.中文文本分类和聚类中的特征选择研究. 硕士学位论文. 中山大学 2006, 6-6
    [52] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys. 2002, 34(1): 1-47

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700