文本挖掘技术研究及其在综合风险信息网络中的应用

作者：张翔
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：综合风险信息 ; 表示模型 ; 特征选择 ; 分类器 ; 主题词提取
英文关键词：Integrated Risk Information ; Representation model ; Feature selection ; Classifier ; Key-phrase extraction
学位年度：2011
导师：周明全
学科代码：081202
学位授予单位：西北大学

摘要

随着电子文本以爆炸式地速度增长,从海量的文本数据中寻找有用的知识已成为数据挖掘的重要课题。本文以“十一五”国家科技支撑计划重点项目——“综合风险防范(IRG)关键技术研究与示范”(2006BAD20B02)为研究背景,针对综合风险信息智能采集和分类任务结合互联网上风险灾害信息的特点,研究文本挖掘中的表示模型、特征选择、文本分类和文本关联关键技术,研究具有重要意义和实用价值。主要研究进展包括：
     (1)提出了一种综合风险信息的表示模型。分析了空间向量模型的tf~*idf权重计算方法忽略了特征在类间分布情况的不足,结合综合风险信息为Web信息的特点,设计了一种综合考虑特征项频率、逆文档频率、特征项类别权重和HTML标签的综合风险信息的特征权重计算方法。实验证明可以改善风险信息的分类性能。
     (2)提出了基于ReliefF结合RMI评估函数的特征选择方法。针对传统文本挖掘的特征选择方法因忽略了特征项之间的相关性导致特征子集中存在大量冗余特征的问题,设计一种组合式的文本特征选择方法,基于ReliefF特征选择算法将无关特征去除的基础上,利用RMI评估函数对冗余特征进行过滤。实验证明与传统的特征选择方法相比可有效去除文本特征中的冗余性。
     (3)提出了基于可信度的AttributeBagging文本分类算法。针对Bagging算法中弱分类器具有相同权重的不合理问题,设计改进的Bagging算法,通过对训练样本的属性进行重取样获得多个训练样本集合,以kNN为弱分类器,计算各个弱分类器的可信度得到其投票权重,最终根据投票规则获得集成分类结果。实验证明该算法构建的文本分类器比Attribute Bagging算法具有更好的分类效果。
     (4)提出了基于灰色关联分析的主题词提取方法。通过计算综合风险信息的给定主题词与特征项之间的灰色关联度来实现主题词的提取,其主要优点是克服了“小样本”问题,对于样本量的多少和有无规律同样适用。解决了数理统计的主题词提取方法忽略专业低频词贡献的问题。
     (5)将文本挖掘关键技术研究成果应用于综合风险信息网络中,结合网络主题爬虫技术,设计实现了互联网上综合风险信息的智能采集和分类,取得了良好的效果。
With rapid development of Internet technology and the exponential growth of electronic text information, how to find the useful knowledge from large amount of data becomes an important topic of data mining. This thesis is based on the National Science and Technology Planning Project of "11th Five-year" Plan which is named "Key technology research and demonstration of Integrated Risk Guardians (No.2006BAD20B02)". According to complete intelligent acquisition and classification of Integrated Risk Information, some key technologies of text mining, such as representation model, feature selection, text classification and text association have been studied. Based on that, some exploratory researches are carried out considering the features of Integrated Risk Information. The main contributions are summarized as follows:
     1. The representation model of integrated risk information is proposed. The tf~*idf weighted method based on the space vector model is analyzed first, and then, by ignoring the shortage of distribution information among classes, considering the Integrated Risk Information as web information, a weighted method of the integrated risk information is proposed, which comprehensively considers the feature items frequency, inverse document frequency, category weight of feature items and HTML tags. Experiments show that this method can improve the performance of text categorzation.
     2. A text feature selection based on ReliefF algorithm and RMI evaluation function is proposed. Aiming at the problem that those traditional feature selection methods of text mining neglect the relevance between features, which leads to massive problems of redundant features in the feature subsets, a combined method of text feature selection is designed. First, irrelevant features are removed by ReliefF algorithm, and then redundant features are filtered by RMI evaluation function. Experiments show that this method can remove the redundant features of text more effectively compared with the traditional ones.
     3. A text classifier based on confidence attribute bagging is is proposed. Aiming at the problem that weaker classifiers of Bagging have the same weights, an improved Bagging algorithm is developed. This algorithm gains more training sets by re-sampling the attributes of the samples. The classified weights can be calculated from each weaker classifier which is based on kNN. The ensemble classification results can be achieved based on voting rules. The classifiers ensemble results which is based on voting rules. The algorithm is used to design a text classifier, which is better than Attribute Bagging algorithm.
     4. A key-phrase extraction method based on gray associate analysis is proposed. Gray associate between given key-phrase and feature words is worked out by which key-phrase is extraction. The main advantage of this method is that it can be equally applicable for large and small quantity of samples and ignore whether the sample is regular. So it can sovle the problem that the key-phrase extraction methods using mathematical statistics ignore the contribution of low-frequency professional words.
     5. The proposed algorithms are adopted to Integrated Risk Information Network. Based on the technology of focused crawler, the intelligent collection and classification of Integrated Risk Information is implemented and achieves better performance.

引文

[Aas 1999]Aas K, Eikvil L. Text categorization:A survey[R]. Technical Report, NR 941, Oslo: Norwegian Computing Center,1999
    [Abbasi2007]Abbasi A, Chen H. Affect Intensity Analysis of Dark Web Forums [A].Proceedings of Intelligence and Security Informatics[C].2007:282-288
    [Alexandros2010] Alexandros Karatzoglou, Ingo Feinerer, Kernel-based machine learning for fast text mining in R[J], Computational Statistics & Data Analysis,2010,V54(2):290-297
    [Agrawal 1994] R.Agrawal, R.Srikant. Fast algorithms for mining association rules [A]. In proceedings of 1994 International Conference on Very Large Databases[C]. Santiago, Chile,1994,9:487-499
    [Apte 1994]Apte C, Damerau F J and Weiss S M. Automated learning of decision rules for text categorization [J]. ACM Transactions on Information Systems.1994, V12(2):233-251
    [Bennett 2005] Bennett PN, Dumais ST, Horvitz E. The combination of text classifiers using reliability indicators [J]. Information Retrieval,2005, V8(1):67-100
    [Anthony 1997]Anthony M.Probabilistic analysis of learning in artificial neural networks:the PAC model and its variants [J].Neural Computing Surveys,1997,1:1-47
    [Bauer 1999]Bauer E,Kohavi R.An empirical comparison of voting classification algorithms:bagging, boosting,and variants[J].Machine Learning,1999, V36(1-2):105-139
    [Bigi 2003] Bigi B. Using Kullback-Leibler distance for text categorization [A]. In:Sebastiani F, ed. Proc. of the 25th European Conf. on Information Retrieval (ECIR-03) [C]. Pisa:Springer-Verlag,2003.305-319
    [BhooPesh 2002]BhooPesh, PushPak.Text Clustering using Semantics [A]. The 11th International World Wide Web Conference[C].2002
    [Bratko2006] Bratko A, Filipic B, Cormack G V, et al. Spam Filtering Using Statistical Data Compression Models [J]. Machine Learning Research,2006,7:2673-2698
    [Breiman 1996] Breiman L. Bagging predictors[J]. Machine Learning,1996, V24(2):123-140
    [Brin 1998]Brin S. Extracting Patterns and Relations from the World Wide Web [A]. Proc. Of Web Workshop at Edbt'98[C]. Valencia,1998
    [Bryll 2003]Bryll R., Gutierrez O.R., and Quek F.Attribute Bagging:Improving accuracy of classifier ensembles by using random features subsets [J]. Pattern Recognition Letters,2003, V36 (6):1291-1302
    [Buckley 1994] Buckley C, Salton G et al. Automatic query expansion using SMART:TREC 3[A]. In: Proc. 3rd Text Retrieval Conference[C].NIST,1994
    [Chang C 1996]Chang C C, Hector G M, Paepcke A. Boolean Query Mapping Across Heterogeneous Information Sources[J]. IEEE Transactions on Knowledge and Data Engineering,1996,V8(4):515-521
    [Chen 2003] Chen L, Tokuda N, Nagai A. A new differential LSI space-based probabilistic document classifier. [J]. Information Processing Letters,2003, V88(5):203-212
    [Chen 2004] Chen W, Chang X, Wang H, Zhu J, Tianshun Y. Automatic word clustering for text categorization using global information [A]. In:Myaeng SH,Zhou M,Wong KF, Zhang H,eds. Proc. of the Information Retrieval Technology,Asia Information Retrieval Symp[C]. (AIRS 2004). Beijing:Springer-Verlag,2004.:1-11
    [Cohen 1995] Cohen J. Highlights:Language and domain independent automatic indexing terms for abstracting [J]. Journal of American Society for Information Science,1995, V46 (3):162-174
    [Cooper 1991] S. Cooper. Some inconsistencies and Misnomers in probabilistic information retrieval [A]. In:Proceedings of the 14th ACM SIGIR International Conference on Research and Development in Information Retrieval [C],1991:57-61
    [Cortes 1995] Cortes C, Vapnik V. Support vector networks[J].Machine Learning.1995, V20 (3):273-297
    [Dash 1997]Dash, M.Liu, H. Feature selection for classification [J].intelligent Data Analysis,1997,1:131-156
    [David 1990]David D Lewis and Croft W Bruce. Term clustering of syntactic phrases [A]. Proceedings of the 13th ACM International Conference on Research and Development in Information Retrieval (SIGIR-90) [C].1990:385-404
    [David 1992]David D Lewis. An evaluation of phrasal and clustered representations on a text categorization task [A]. Proceedings of 15th ACM International Conference on Research and Development in Information Retrieval (SIGIR-92)[C].1992:37-50
    [Davies 1994]Davies S, Russl S. Np-completeness of searches for smallest possible feature sets [A]. Proceedings of the AAAI Fall 94 Symposiums on Relevance[C]. Menlo Park, CA:AAAI Press,1994, 37-39
    [Debole 2003] Debole F, Sebastiani F. Supervised term weighting for automated text categorization [A]. In: Haddad H, George AP, eds. Proc. of the 18th ACM Symp. On Applied Computing (SAC-03) [C]. Melbourne:ACM Press,2003.784-788
    [Debole 2004] Debole F, Sebastiani F. An analysis of the relative hardness of reuters-21578 subset [J].Journal of the American Society for Information Science and Technology,2004, V56(6):971-974
    [Deerwester 1990] Deerwester S,Dumais S T, Furnas G W,et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science,1990, V41(6):391-407
    [Dietterich 1997]Dietterich T G. Machine learning research:four current directions [J]. AI Magazine,1997, VI8(4):97-136
    [Dumais 1991] Dumais S T. Improving the retrieval information from external sources. Behavior Research Methods [J].Instruments and Computers,1991,V23(2):229-236
    [Dumais 1998]Dumais S T,Platt J,Heckerman D,et al. Inductive learning algorithms and representations for text categorization[R]. Technical report, Microsoft Research.1998
    [Dunningl993] T.E. Dunning. Accurate methods for the statistics of surprise and coincidence [J]. In Computational Linguistics,1993, V19(1):61-74.
    [Durant2007]Durant K T, Smith M D.Predicting the Political Sentiment of Web Log PostsUsing Supervised Machine Learning Techniques Coupled with Feature Selection [A]. N.O(eds.).WebKDD [C]. Berlin, Heidelberg:Springer-Verlag,2007:187-206
    [Eirinaki 2003]Magdalini Eirinaki, Miehalis Vazirgiannis. Web Mining for Web Personalization [J]. ACM Transaction on Internet Technology,2003 V3(1),:1-27
    [Eric 1999]Eric Bauer.Ron Kohavi.An empirical comparison of voting classification algorithm:bagging, boosting and Variants[J]. Machine learning,1999, V36(1-2):105-139
    [Etzioni 1996]O.Etzioni. The World Wide Web:Quagmire or gold mine[J]. Communications of the ACM, 1996,V39(11):65-68
    [Fabrizio 2002]Fabrizio Sebastiani. Machine learning in automated text categorization[J].ACM Computing Surveys,2002, V34(1):1-47
    [Feldman 1998] R Feldman, H Hirsh. Finding Associations in Collections of Text [M]. Machine Learning and Data Mining:Methods and Applications,John Wiley Sons,1998.223-240
    [Forman2003]Forman G. An extensive empirical study of feature selection metrics for text classification [J]. Journal of Machine Learning Research,2003, V3(1):1533-7928
    [Forman2004] Forman G, Cohen I. Learning from little:Comparison of classifiers given little training[A]. In:Jean FB, Floriana E, Fosca G, Dino P, eds. Proc. of the 8th European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD-04) [C]. Pisa:Springer-Verlag,2004.161-172
    [Menczer2001]F.Menczer,G.Pant, and P.Srinivasan.Topical Web crawlers:Evaluating adaptive algorithms [J]. ACM Transactions on Internet Technology,2004,V4(4):378-419
    [Freundl995]Freund Y.Boosting a weak algorithm by majority [J]. Information and Computation,1995, V121(2):256-285
    [Freundl997]Freund Y,Schapire R E.A decision-theoretic generalization of on-line learning and an application to boosting[J].Journal of Computer and System Sciences,1997,V55(1):119-139
    [Gabrilovich2004]Gabrilovich E,Markovitch S. Text categorization with many redundant features:Using aggressive feature selection to make SVMs competitive with C4.5 [A]. In:Brodley CE, ed. Proc. of the 21st Int'l Conf. on Machine Learning (ICML-04) [C]. Banff:Morgan Kaufmann Publishers,2004:321-328
    [Gomez2003] J.M.Gomez.Text Representation for Automatic Text Categorization[A].In proceedings of Eleventh Conference of the European Chapter of the Association for Computational Linguistics[C].2OO3
    [Han2000]Jiawei Han, J.pei, Y.Yin. Frequent patters without candidate generation [A]. In Proc.2000 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'oo) [C]. Dallas,Tx,May 2000:1-12
    [Han2001]Jiawei Han. Data Mining:Concept and Techniques[M],Morgan Kaufmann Publishers Inc.2001.
    [Hearstl997]M.A.Hearst. Text data mining:issues, techniques, and the relationship to information access [A]. Presentation notes for UW/M5 workshop on data mining[C].1997
    [Henry2009] Henry Anaya-Sanchez, Aurora Pons-Porrata, Rafael Berlang-Liavori, A document clustering algorithm for discovering and describing topics[J].Pattern Recognition Letters.2010,31(5):502-510
    [Hersh1994] W.R. Hersh, C. Buckley, T.J. Leone, D.H. Hickam. OHSUMED:An interactive retrieval evaluation and new large test collection for research [A]. In Proceedings of the 17th Annual ACM SIGIR Conference[C].1994,192-201
    [Hu2008]Meishan Hu, AixinSun, Ee-Peng Lim. Comments-oriented Document Summarization: Understanding Documents with Readers'feedback [A]. The 31st Annual International ACM SIG IR Conference on Research and Development in Information Retrieval [C].Singapore,2008
    [Hyafil 1976]Hyafil L, Rivest R L. Constructing optimal binary decision trees is NP-complete [J]. Information Processing Letters,1976, V5(1):15-17
    [Inderjeetl997]Inderjeet Mani, Eric Bloedorn. Multi-document Summarization by Graph Search and Matching [A]. In Proceedings of the Fifteenth National Conference Artificial Intelligence[C].1997:622-628
    [Jain2000]Jain A k,Duin R,Mao J C. Statistical pattern recognition:a review[j]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2000, V22(1):4-37
    [Jolliffe1986] Jolliffe I T. Principal component analysis [M]. New York:Springer Verlag,1986
    [Joachims 1998] Joachims T. Text categorization with support vector machines:learning with many relevant features [A]. In:Proceedings of 10th European Conference on Machine Learning (ECML-98) [C].Chemnitz, DE,1998:137-142
    [John1994]John G,Kohavi R,Pfleger K. Irrelevant features and the subset selection problem[A]. In:Cohen W W, Hirsh H,Eds. The Eleventh International Conference on Machine Learning[C].San Francisco: Morgan Kaufmann,1994,121-129
    [Kearns 1988]Keams M,Valiant L G. Learning Boolean formulae or factoring[R]. Aiken Computation Laboratory,Harvard University,Cambridge,MA,Technical Report:TR21488,1988
    [Kira1992]K.Kira and L.A.Rendell. The Feature Selection Problem Traditional Methods and a new algorithm[A]. Proceedings of Ninth National Conference on Artificial Intelligence[C].1992:129-134
    [Kjerstil999] Kjersti Aas and Line Eikvil. Text categorization:a survey[R]. Technical report, Norwegian computing center.1999
    [Kohavi1997]KOHAV I R, JOHN G. Wrappers for feature subset selection on[J]. Artificial Intelligence, 1997,97(1-2):273-324
    [Kononenko1994] Kononenko I. Estimation attributes:Analysis and extensions of RELIEF [A]. In: Bergadano F, De Raedt L, eds. Proceedings of the 1994 European Conference on Machine Learning [C]. Catania, Italy:Springer Verlag,1994.171-182
    [Liu1998]Liu H, Motoda H.Feature Selection for Knowledge Discovery and Data Mining [M]. Boston: Kluwer Academic Publishers,1998
    [Lam2003]Lam W, Lai KY. Automatic textual document categorization based on generalized instance sets and a metamodel [J].IEEE Trans. on Pattern Analysis and Machine Intelligence,2003, V25(5):628-633
    [Lang1995]K.Lang. NewsWeeder:learning to filter netnews [A]. In Proceedings of ICML-95,12th International Conference on Machine Learning (Lake Tahoe, US) [C].1995:331-339
    [Langley1993] Langley P, Iba W. Average-case Analysis of Nearest Neighbor Algorithm [A]. Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence [C].Morgan Kaufmann Publishers San Francisco USA 1993:889-894
    [Lewis1998] LewisD D. Naive (Bayes) at forty:The independence assumption in information retrieval[A]. In:Proceedings oflOth European Conference on Machine Learning(ECML-98) [C].Chemnitz,DE,1998: 4-15
    [Li1998]Li Y H and Jain A K. Classification of text documents [J]. The Computer Journal,1998, V41(8): 537-546
    [Liu2003]Liu WY, Song N. A fuzzy approach to classification of text documents [J]. Journal of Computer Science and Technology,2003, V18(5):640-647
    [Martinez2001] Martinez AM, Kak A C. PCA versus LDA [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2001, V23 (2):228-233
    [Mitra2002]Mitra,P.,MurthC.A.,Pal,S.K. Unsupervised feature selection using feature similarity [J]. IEEE Transactions on pattern Analysis and Machine Intelligence,2002, V24(3):301-31
    [Miguel1996]A Miguel, Carreira Perpina n. A review of dimension reduction techniques[R]. Technical Report CS-96-09, Department of Computer Science, University of Sheffield,1996
    [Nguyen1995]Nguyen, H. S. Nguyen, S. H. Skowron. A. Searching for Features defined by Hyper planes. In:Z. W. Ral, M. Michalewicz (eds.) [A],Pro. Of the IX International Symposium on Methodologies for Information Systems ISMIS'96. Zapopan[C].Poland. Lecture Notes in AI 1079, Berlin, Springer Verlag, June 1996:366-375
    [Nothert1991]Nothert Fuhr, Chris Buekley. A probabilistic learning approach for document indexing[J]. ACM Transactions on Information Systems,1991, V9(3):223-248
    [Nunzio2004] Nunzio GMD. A bidimensional view of documents for text categorization[A]. In:McDonald S,Tait J,eds. Proc. of the 26th European Conf. on Information Retrieval Research (ECIR-04) [C].Sunderland:Springer-Verlag,2004.112-126
    [Qin2008]Qin B, Zhao YY, Gao LL, Liu T. Recommended or not? Give advice on online products [A]. In: Ma J, et al., eds. Proc. of the 5th Int'l Conf. on Fuzzy Systems and Knowledge Discovery[C]. IEEE Computer Society Press,2008.208-212
    [Opitzl999]Opitz D, Maclin R.Popular Ensemble Methods:An empirical study[J]. Journal of Artificial Intelligence Research,1999, V11:169-198
    [Orwig1997] Orwig R,Chen H,Nunamaker J F. A graphical,self-organizing approach to classifying electronic meeting output[J]. Journal of the American Society for Information Science,1997, V48(2):157-170.
    [Pawlakl982] Pawlak Z. Rough Sets [J].International Journal of Information and Computer Science, 1982.V11 (5),341-356
    [Pinkerton 1994] Pinkerton B. Finding what people want:Experiences with the web crawler [A]. Proceedings of the Second World-Wide Web conference [C]. Chicago, Illinois, October 1994
    [Platt1999] Platt J. Sequential minimal optimization:A fast algorithm for training support vector machines[A]. In:Advances in Kernel Methods-Support Vector learning.[C] Cambridge,MA:MIT Press,1999:185-208
    [Raina2007] Rajat Raina,Alexis Battle, etc. Self-taught Leaning:Transfer Leaning from Unlabeled Data[A].In Proceedings of the 24th International Conference Machine Leaning[C].Corvallis,OR,2007: 759-766
    [Robnik-Sikonja2003]M.Robnik-Sikonja,Ifor Kononenko.Theoretical and Empirical Analysis of ReliefFand RReliefF[J].Machine Learning Journal.2003.53(1-2):23-69
    [Rocchio l971] J. Rocchio. Relevance feedback in information retrieval[A],In The SMART Retrival System:Experiments in Automatic Document Processing[C]. Prentice Hall Inc.1971,313-323
    [Salton1975]Salton G, Wong A and Yang C S. A vector space model for automated indexing [J]. Communications of the ACM.1975,V18(1):613-620
    [Saltonl983]Salton G,McGill C. An introduction to modern information retrieval[M]. McGraw Hill.1983
    [Salton1988]Salton G and Buckley C. Term-weighting approaches in automatic text retrieval [J]. Information Processing and Management,1988,V24(5):513-523
    [Schapire 1990] Schapire R E. The strength of weak learnability[J]. Machine Learning,1990,V5(2): 197-227
    [Schapire2000]Schapire R E and Singer Y.BoosTexter:a boosting-based system for text categorization[J]. Machine Leanring.2000, V39(2/3):135-168
    [Sebastiani2002] Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Surveys,2002, V34(1):1-47
    [Stoyanov2008] Stoyanov V, Cardie C. Topic identification for fine-grained opinion analysis[A]. In: McKeown K, ed. Proc. of the Conf. on Computational Linguistics[C]. Morristown:ACL,2008:817-824
    [Su2009]Su F, Markert K. Subjectivity recognition on word senses via semi-supervised mincuts[A]. In: Ostendorf M, ed. Proc. of the NAACL 2009[C]. Morristown:ACL,2009:1-9
    [Svetlana2004]Svetlana Hensman. Construction of conceptual graph representation of texts [A]. In:Proceedings of the Student Research workshop at HLT-NAACL[C].Boston,2004:49-54
    [Tzeras 1993]Tzeras K, Hartmann S. Automatic indexing based on Bayesian inference networks[A]. Proceedings of International ACM SIGIR Conference Researchand Development in Information Retrieval, Inference Networks[C].1993:22-34
    [Vapnik1995]VapnikV.The Nature of Statistical Learning Theory [M]. NewYork:Springer-Verlag,1995.
    [Verleysen2008]Verleysen M, Lee J A. Rank-based Quality Assessment of Nonlinear Dimensionality Reduction [A].The 16th European Symposium on Artificial Neural Networks[C]. New York, NY, USA: ACM Press,2008:49-54
    [Wang1997]Wang Ke,Liu Huiqing. Schema Discovery from Semi-Structured Data[A]. Proc. of the 3rd Int'l Conf. on Knowledge Discovery and Data Mining[C],New Port Beach,1997:272-274
    [Wei2002]Wei YG,Tsay JJ. A study of multiple classifier systems in automated text categorization [D]. Chiayi:College of Engineering National Chung Cheng University,2002
    [Wickramaratna 2001]Wickramaratna J,Holden S,Buxton B. Performance degradation in boosting[A]. In: Kittler J,Roli F,Eds. Proceedings of the 2nd International Workshop on Multiple Classifier Systems[C].Lecture Notes in Computer Science. Berlin:Springer,2001, V2096:11-21
    [Written 1999]Written I H, Paynter G W, Frank E, et al. KEA:Practical automatic key phrase extraction [A]. Proceedings of the Fourth ACM Conference onDigital Libraries[C].1999:254-255
    [Yarng1992]Y.Yarng,C.G. Chute. A linear least squares fit mapping method of informationRetrieval from natural language texts [A], In:Proceedings of thel4th Conference on Computational Linguistics(CoLNIG92) [C].1992:447-453
    [Yang1997]Yang Y and Pedersen J O. A comparative study on feature selection in text categorization [A]. In:Proceedings of ICML-97,14th International Conference on Machine Learning[C].Nashville, US,1997: 412-420.
    [Yang1999]Yang Yiming. An evaluation of statistical approaches to text categorization [J]. Journal of Information Retrieval,1999,1(1/2):67-88
    [YangL1999] Yang Yiming,Liu Xin. A re-examination of text categorization methods[A]. In:Proceedings of ACM SIGIR Conference on Research and Development in Information RetrievalSIGIR'99) [C].1999: 42-49
    [Yang2006]Yang C, Shi X, and Wei C. Tracing the event evolution of terror attacks from online news [A]. Proceeding s of IEEE International Conference on Intelligence and Security Informatics[C]. San Diego: Lecture Notes in Computer Science,2006:.343-354
    [Yu2004]Yu L,Liu H. Efficient Feature Selection via Analysis of Relevance and Redundancy [J]. Journal of Machine Learning Research,2004,5:1205-1224
    [Zhanga2009] Xiang Zhang, Mingquan Zhou, Lili Dong, Na Ye. Design of Chinese Text Categorization Classifier Based on Attribute Bagging[A], The 2nd International Conference on Business Intelligence and Financial Engineering[C]. Beijing,24-26 July 2009:201-204
    [Zhangb2009] Zhang Xiang, Zhou Mingquan, Geng Guohua, Ye Na. A Combined Feature Selection Method for Chinese Text Categorization[A].2009 International conference on Information Engineering and Computer Science[C], Wuhan,18-20 December 2009:405-408
    [边2000]边肇祺,张学工等编著.模式识别(第二版)[M].北京：清华大学出版社.2000
    [常2008]常庆.综合风险主题搜索引擎设计[D].西安：西北大学,2008
    [董2007]董乐红.文本分类若干关键技术研究[D].西安：西北大学,2007
    [代2003]代建华,李元香,刘群.遗传算法在决策系统离散化中的应用[J].微电子学与计算机,2003(2)：19-21
    [邓2002]邓聚龙.灰理论基础[M].武汉：华中科技大学出版社,2002
    [姜2006]姜远,周志华.基于词频分类器集成的文本分类方法[J].计算机研究与发展,2006,43(10)：1681-1687
    [陆2002]陆玉昌,鲁明羽,李凡等.向量空间法中单词权重函数的分析与构造[J].计算机研究与发展,2002,39(10)：1205-1210
    [李2004]李晓明,闫宏飞,王济民.搜索引擎——原理、技术与系统[M].北京：科学出版社,2004
    [李2005]李陆荣,王建会,陈晓芸等.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1)：94-101
    [李2010]李芳.文本挖掘技术若干关键技术研究[D].北京：北京化工大学,2010
    [刘2006]刘天羽.基于特征选择技术的集成学习方法及其应用研究[D].上海：上海大学,2006
    [刘2000]刘思峰,郭天榜.灰色系统理论及其应用[M].北京：科学出版社,2000.3
    [刘2007]刘业政,焦宁,姜元春.连续属性离散化算法比较研究[J].计算机应用研究,2007,24(9)：28-33
    [刘2008]刘菲,黄萱菁,吴立德.利用关联规则挖掘文本主题词的方法[J].计算机工程,2008.34(7)：81-84
    [韩2001]韩家炜,孟小峰,王静等.Web挖掘研究[J].计算机研究与发展,2001,38(4)：405--414
    [耿2006]耿焕同,蔡庆生,于琨等.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学),2006,42(2)：156-162
    [侯2000]侯利娟,王国胤,聂能等.粗糙集理论中的离散化问题[J].计算机科学,2000,27(12)：89-94
    [侯2008]侯凡.文本分类技术在综合风险元搜索引擎中的研究与实现[D].西安：西北大学,2008
    [胡2008]胡洁.高维数据特征降维研究综述[J].计算机应用研究,2008,25(9)：2601-2607
    [蒋2008]蒋宗礼,徐学可,李帅.一种基于超链接引导的主题搜索的主题敏感爬行方法[J].计算机应用,2008,28(4)：942-944
    [苗2001]苗夺谦Rough Set理论中连续属性的离散化方法[J].自动化学报,2001,27(3)：296-302
    [单2003]单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,39(22)：146-148
    [宋2005]宋枫溪,高秀梅,刘树海,杨静宇.统计模式识别中的维数削减与低损降维[J].计算机学报,2005,28(11)：1915-192
    [苏2006]苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9)：1848-1859
    [苏2008]苏变萍.面向建设法规数据的挖掘技术及智能查询的研究[D].西安：西安建筑科技大学,2008
    [尚2006]尚文倩,黄厚宽,刘玉玲等.文本分类中基于基尼指数的特征选择算法研究[J].计算机研究与发展,2006,43(10)：1688-1694
    [沈2000]沈学华,周志华,吴建鑫等.Boosting和Bagging综述[J].计算机工程与应用,2000,36(12)：31-32
    [沈2008]沈婧.综合风险主题推送技术的研究与应用[D].西安：西北大学,2008
    [史2002]史忠植.知识发现[M].北京：清华大学出版社,2002
    [邵2003]邵峰晶,于忠清.数据挖掘原理与算法[M].北京：中国水利水电出版社,2003
    [涂2003]涂承胜,鲁明羽,陆玉昌.Web挖掘研究综述[J].计算机工程与应用,2003,39(10)：90-93
    [唐2005]唐焕玲,孙建涛,陆玉昌.文本分类中结合评估函数的TEF-WA权值调整技术[J].计算机研究与发展,2005,42(1)：47-53
    [吴2005]吴高巍,陶卿,王珏.基于后验概率的支持向量机[J].计算机研究与发展,2005,42(2)：196-202.
    [王2001]王国胤Rough集理论与知识获取[M].西安：西安交通大学出版社,2001
    [王2005]王建会,王洪伟,申展等.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1)：85-93
    [薛2004]薛德军.中文文本自动分类中的关键问题研究[D].北京：清华大学,2004
    [谢2005]谢宏,程浩忠,牛东晓.基于信息熵的粗糙集连续属性离散化算法[J].计算机学报,2005,28(9)：1570-1574
    [徐2005]徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1)：181-185
    [燕2005]燕济坤,郑辉,王艳等.基于可信度的投票法[J].计算机学报,2005,28(8)：1308-1312
    [袁2006]袁军鹏,朱东华,李毅等.文本挖掘技术研究进展[J].计算机应用研究,2006,23(2)：1-4
    [姚2006]姚望舒,商琳,陈兆乾.一种基于进化算法的连续属性离散化方法[J].计算机应用与软件,2005,22(3)：37-39
    [曾2004]曾黄麟.智能计算[M].重庆：重庆大学出版社,2004
    [周2005]周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9)：1965-1969
    [邹2000]邹涛.WWW上的信息挖掘技术及实现[J].计算机研究与发展,2000,36(8)：1020-1024
    [邹2006]邹加棋,陈国龙,郭文忠.基于图模型的中文文档分类研究[J]小型微型计算机系统,2006,27(4)：754-757
    [朱2001]朱雪龙.应用信息论基础[M].北京：清华大学出版社,2001
    [赵2006]赵军,张显跃.基于粗糙集理论的数据离散化技术研究[J].重庆邮电学院学报(自然科学版),2006,18(6)：752-757
    [张2007]张希娟,王会珍,朱靖波.面向文本分类的基于最小冗余原则的特征选择[J].中文信息学报,2007,21(5)：56-61
    [张2001]张文修Rough Set推理与方法[M].北京：科学出版社,2001
    [张2004]张新丽.高维数据的特征选择及基于特征选择的集成学习研究[D].北京：清华大学,2004
    [张2005]张静.基于粗糙集理论的数据挖掘算法研究[D].西安：西北工业大学,2005
    [张2009]张翔,周明全,耿国华,侯凡.面向中文文本分类的C4.5Bagging算法研究[J].计算机工程与应用,2009,45(26)：135-137
    [张a2010]张翔,周明全,耿国华Bagging中文文本分类器的改进研究[J].小型微型计算机系统,2010,31(2)：281-284
    [张b2010]张翔,周明全,李智杰,董丽丽.基于PageRank与Bagging的主题爬虫研究[J].计算机工程与设计,2010,31(14)：3309-3312
    [周2002]周志华,陈世福.神经网络集成[J].计算机学报.2002,25(1)：1-8
    [Httpl]http://www.gov.cn/jrzg/2006-02/09/content_183787.htm
    [Http2]http://www.research.att.com/-lewis/reuters21578.html
    [Http3]http://www.sogou.com/labs/dl/c.html
    [Http4]http://www.ics.uci.edu/-mlearn/MLRepository.html
    [Http5]http://www.searchforum.org.cn/tansongbo/corpus1.php
    [Http6]http://www.ictclas.org/
    [Http7]Nguyen Sinh Hoa. Nguyen Hung Son. Some Efficient Algorithms for Rough Set Methods, http://
    citeseer.nj. nec.com/Pnguyen96some. html