详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
With the fast development of the Internet, various kinds of diversified information are growing exponentially everyday, most of these abundant information resources are still exist in terms of text. It is becoming a high research value that how to manage and organize so huge and increasing text information and mining relevant information from which people needed, this problem has been drowning more and more attention all over the world. With this background, text classification based on the machine learning grew with the trend of the times, it has become the important basis and prerequisite in information retrieval, information filtering, search engine, text database, data mining fields and so on, and it has comprehensive application foreground.
     In the process of the text classification, it includes many key technologies: Chinese Word Segmentation, feature selection, vector space model, classification model, classification evaluation indicator and so on. Most of automatic text categorization based on the machine learning is built on the vector space model (VSM), Text is expressed as the form of computers can recognized in the VSM. Using the feature weight algorithm, we choose the features that play an important role and can represent text better in the text; at the same time we ignore the features that have no contribution to the text categorization. One reason of the above purpose that it can reduce the dimension of the VSM and improve the efficiency of the text categorization, the other reason is that it can choose the better features expressed the text, it can improve the precision. Therefore, text feature weight algorithm is the basis and premise of the text categorization, it has the important position. Following the analysis mentioned above, this dissertation focuses on improving the term-weighting approach. The contributions of this dissertation are listed as follow:
     Basis concept of text classification and the development at home and abroad are introduced briefly.
     Introduce the key technology of text classification including pretreatment the text, feature dimension reduction, text representation, classification algorithm and evaluation metric.
     Introduce the classification term-weighting approach TFIDF and analyze its weaknesses, lay out several improving approaches based on TFIDF, TFIDF-DI is the better one and analyze it.
     Introduce the concept of the skewed dataset and do the control experiments using the TFIDF and TFIDF-DI, analyze the results and pointe out the shortcoming of these two approaches with the skewed dataset.
     Propose an improvement method TFIDF-λDI based on the TFIDF-DI and use the KNN algorithm comparing the new approach with TFIDF and TFIDF-DI, the result shows the improvement method has the certain enhancement for the performance of classification.
    [2]Maron M. Automatic indexing: an experimental inquiry. Journal of the Association for Computing Machinery.1961, 8(3): 404-417.
    [3]Sparck J.K, Willett P, et al. Readings of information retrieval. San Mateo, US: Morgan Kaufmann, 1997.
    [4]Salton G, et al. A vector space model for automatic indexing. Communications of the ACM. 1975, 18: 613-620.
    [9]吴军,王作英.汉语语料的自动分类[J].中文信息学报.1995, 9(4):25-32.
    [12]Chawla,n., Japkowicz,N., &Kolcz, A(Eds.).(2003).Proceedings of the ICML’2003 workshop on learning from imbalanced data sets.
    [13]Chawla,n., Japkowicz,N., &Kolcz,A(Eds.).(2004).ACM SIGKDD Explorations Newsletter, 6(1)[Special issue on learning from imbalanced datasets].
    [14]Goldman,S., &Zhou,Y.(2000). Enhancing supervised learning with unlabeled data. In Proceedings of 17th international conference on machine learning (pp.327-334). San Fransisco, California, USA.
    [15]Japkowicz,N.(ED.).(2000).Proceedings of the AAAI’2000 workshop on learning from imbalanced data sets ,AAAI Tech Report WS-00-05,AAAI.
    [16]HOW B C, NARA YANAN K.An empirical study of feature selection for text categorization based on term weightage[C]//Proceedings of the 2004 IEEE/W IC/ACM Intemational Conference on Web Intelligence.Washington, DC: IEEE Computer Society, 2004:599-602.
    [11]Winter Wen,中文搜索引擎技术揭秘:中文分词[EB/OL]. http://www.stlchina.org/twiki/bin/view.pl/Main/SESegment.2005.
    [18]刘迁,贾惠波.中文信息处理中自动分词技术的研究与展望[J].计算机工程与应用. 2006, 3. P177.
    [19]Zou F, Wang F L, Deng X T, et al. Stop Word List Construction and Application in Chinese Language Processing[J], WSEAS Transactions on Information Science and Application, 2006,3(6):1036-1044.
    [20]顾益军,樊孝忠,王建华等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4): 337-340.
    [21]刘丽珍,宋瀚涛.文本分类中的特征选取[J].计算机工程,2004,30(4): 14-16.
    [22]边肇祺,张长水,张学工.模式识别(第二版).北京:清华大学出版社. 2000.284-303.
    [23]Joachims T.Text categorization with support vector machines: Learning with many relevant features. In Proceedings 10th European speech Conference on Machine Learning. Chemnitz. Springer-Verlag. 1998.137-142.
    [24]Fabio Alolli, Alessandro Sperduti. Multiclass classification with multi-prototype support vector machines. Journal of Machine Learning Research. 2005.6(6). 817-850.
    [25]Suykens J, Vandewale J. Least squares support vector machine classifiers. Neural Proccessing Letters. 1999.9(3).293-300.
    [26]Deerwester,S.Dumais, S.T,Fumas, GW, Landauer, T.K.Harshman, R. Indexing by Latent Semantic Analysis[J]. Journal of the American Society of Information Science, 1990, 41(6):391-407.
    [27]Joliffe, I.T.Principal Component Analysis. New York: Springer Verlag, 1986.
    [28]Gerard Salton, Michael J, McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, Inc., 1983. 284.
    [29]李正林.中文文本数据分类研究[学位论文].上海:上海师范大学,2004.5 3-3.
    [31]李扬,曾海泉.基于KNN的快捷WEB文档分类.小型微型计算机系统.2004, 25(4), 725-729.
    [33]Vladimir V.Vapnik. The Nature of Statistical Learning Theory [M]. Springer, New York, 1995.
    [34]Platt J.Fast Training of Support Vector Machines using Sequential Minimal Optimization[A]. B.Scholkopf, C.Burges and A.Smola, eds, Advances in Kernel Methods-Support Vector Learning[C], pp.185-208, Cambridge, MA, USA: MIT Press, 1999.
    [35]Quinlan, J,R.Induction of Decision Trees[J].Machine Learning.1986,1(1):81—106.
    [36]Zhang T. and Oles F.J. Text categorization based on regularized linear classification methods. Information Retrieval. 4:5-31, 2001.
    [37]Aas K. and Eikvil A. Text Categorization: A survey Technical report, Norwegian Computing Center, http://citeseer.nj.nec.com/aas99text.html, 1999.
    [38]Xin Liu. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999, 42-49.
    [39]Fabrizio Sebastiani.Machine learning in automated text categorization. ACM Computing Surveys, 34(I):1-47, 2002.
    [40]James Auen. Natural Language Understanding [M]. The Benjamin/Cummings Publishing Company, 1991-05.
    [41]Dai Liu-ling, Huang He-yan,etc. A Comparative Study on Feature Selection in Chinese Text Categorization. Journal of Chinese Information Processing, Vol.18, No.1.
    [42]G.Salton:“Automatic Text Processing:The Transformation,Analysis,and Retrieval of Information by Computer.”Addison Wesley, 1989.
    [44]刁倩,王永成,张惠惠等. VSM中词权重的信息摘算法[J].情报学报,2000,19(4):354-358.
    [45]吴科,石冰,卢军等.基于文本集密度的特征选择与权重计算方案[J].中文信息学报, 2004,18(1):42-43.
    [46]徐风亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用, 2005.1:181-184.
    [47]T Theeramunkong, V Lertnattee. Improving centroid-based text classification using term-distribution-based weighting system and clustering. NECTEC (NT-B-06-4F-13-311)2001.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700