市长公开电话汉语文本标签的确立

作者：张晓明
论文级别：硕士
学科专业名称：应用数学
中文关键词：文本分类 ; 市长公开电话 ; 半监督学习
英文关键词：categorization ; mayor's public telephone ; semi-supervised learning
学位年度：2010
导师：郝立柱
学科代码：070104
学位授予单位：黑龙江大学
论文提交日期：2010-05-22

摘要

随着计算机网络事业的快速发展和人民群众参政议政、自我保护意识的不断增强,信息处理已经成为人们获取有用信息不可缺少的工具.许多城市设立了市长公开电话服务平台,于是,各行各业的文档信息每天都在剧增.采用传统的人工手段分类信息,不仅耗时长,而且面临的困难越来越多,尤其政府承办部门职能的调整,使得如何将这些信息及时准确地分类到调整后的处理单位成为迫切需要研究的问题.
     文本自动分类是信息检索和数据挖掘领域的研究热点与核心技术,基于机器学习的文本自动分类系统是信息处理的重要研究方向,它是指在给定的分类体系下,根据文本的内容自动判别文本类别的过程.
     本文基于长春市市长公开电话汉语文本分类的实际问题,介绍文本自动分类的概念,市长公开电话系统,对文本分类中所涉及的关键技术,包括分词、特征选择、特征提取,进行了总结和研究,探讨了基于半监督学习的文本标签的分类问题,研究了基于EM算法、随机森林、Boosting算法的汉语文本的分类问题,使用C++语言实现了三种算法的文本分类程序,并对实验效果进行了分析.
With the rapid development of computer networks career and continuously improvement of people's consciousness of suffrage and self-protection, information processing turns more and more important for us to get useful information, lots of cities have established mayor's public access lines,therefore,the government and institutions accumulate a large amount of documents everyday. If we adopt manual classifier to tackle the work, the efficiency will be too low to deal with many new problems, especially with the adjustment of government functions, it's very urgent for us to find a method on text categorization timely and exactly to meet novel institutions.
     Automated text categorization is one of the hotspots and key techniques in the information retrieval and data mining field, text categorization based on Machine Learning, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance.
     The paper based on the practical problems in ChangChun mayor's public access line project introduces the definition of automated text categorization, the system of mayor's public telephone, it also gives a summary and research to several key techniques about text categorization, including Word Segmentation., Feature Se-lection、Feature Extraction, maining discusses how to label the documents based on semi-supervised learning,including EM algorithm, random forest, boosting al-gorithm. We use C++language to implement the three classified approaches and analyse the results.

引文

[1]Sebastiani F.Machine Learning in Automated Text Categorization[J].ACM computing surveys, 2002,34(1):1-47.
    [2]Bishop C.Pattern Recognition and Machine Lerning.Springer,2004,New York.
    [3]樊兴华,孙茂松.一种高性能的两类中文文本分类方法.计算机学报,2009,29(1):124-131.
    [4]张云良,张全.基于句类向量空间模型的自动文本分类研究.计算机工程,2000,33(22):45-47.
    [5]王强,关毅,王晓龙.基于特征类别属性分析的文本分类器分类噪声剪裁方法.自动化学报,2007,33(8): 809-816.
    [6]BREIMAN L.Bagging Predictors[J].Machine Learning,1996,24(2):123-140.
    [7]Breiman L.Bagging predictors.Machine Learning, 1996,26:123-140.
    [8]SCHAPIRE R E,FREUND Y,BARTLET P,et al.Boosting the Margin:New Explanation for the Effectiveness of Voting Methods[J].The Annals of Statistics,1998,26(5):1651-1686.
    [9]Boosting algorithms:regularization,prediction and model fitting(with discussion).statistical science, 2008,22(4):477-505.
    [10]Hothorn T,Buhlmann P.Model-based boosting in high dimensions.Bioinformatics,2006,22(22): 2828-2829.
    [11]Breiman L.Arcing the Edge[J].The Annals of Statistics,1998,26(3):801.823.
    [12]Breiman L.Arcing classifiers(with discussion),Annals of statistics,1998:26:801-849.
    [13]Breiman L.Prediction games and arcing algorithms,Neural Compupation,1999:11(7):1493-1517.
    [14]Breiman L.Random Forests[J].Machine Learning,2001,45(1):5-32.
    [15]ZHOU Zhihua,WU Jianxin,TANG Wei.Ensembling neural networks:many could be better than all [J].Artificial Intelligence,2002,137(122):239.263.
    [16]Nigam K,McCallum A K,Thrun S,et al.Text Classification from Labeled and Unlabeled Documents Using EM.Machine Learning,2000,39(2/3):103-134.
    [17]Wu Ying,Huang T S,Toyama K.Self.supervised learning for object Recognition Based on Kernel Discriminate-EM Algorithm.In Proccedings of the IEEE International Conference on Computer Vi-sion.275-280.
    [18]Hwa R,Osborne M,Sarkar A,et al.Corrected co-traing for statistical parsers.In Proceedings of The Twentieth International Conference on Machine Learning, Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.Washington,USA,2003:95-102.
    [19]Blum A,Mitchell T.Combining labeled and unlabeled data with co-training.Proceedings of the 11th Annual Conference on Computational Learning Theory.Madison,USA,1998:92-100.
    [20]陈凯.基于分类问题的选择性集成学习研究.计算机应用研究,2009年7月,第26卷第7期Application Research of Computers Vol.26 No.7,Jul.2009.
    [21]Cherkassky V,Ma Y.Comparison of model selection for regression.Neural computation,2003,15(7): 1691-1714.
    [22]苏金树,张博峰,徐听.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9)：1848-1859.
    [23]张永,张卫国,徐维军.基于数据分割和集成学习的大规模SVM分类算法.系统工程,2009年3月, 第27卷第3期(总第183期).
    [24]缪志敏,赵陆文,胡谷雨,王琼.基于单类分类器的半监督学习.模式识别与人工智能,2009年12月,第22卷第6期.
    [25]金凯民,苗夺谦,段其国.一种基于隐含子类信息的粗糙集中文文本分类方法.计算机科学,2008,Vo1.35No.2
    [26]市长公开电话综述.长春政报,2007年01期.
    [27]市长公开电话综述.长春政报,2006年12月.
    [28]郝立柱.汉语文本自动分类-市长公开电话数据统计分析[D].长春:吉林大学,2008.
    [29]Yang Y M. An evaluation of statistical approaches to text categorization[J]. Information Retrieval. 1999,1(1-2):69-90.
    [30]王晓龙,关毅等.计算机自然语言处理.清华大学出版社,2005,第一版.
    [31]孙宾.现代汉语文本的词语切分技术.报告(第二稿),北京：北京大学计算语言学研究所.
    [32]张春霞,郝天永.汉语自动分词的研究现状和困难.系统仿真学报,2005,17(1)：13-147.
    [33]周志华,王珏.机器学习及其应用2007[M].北京：清华大学出版社,2007.
    [34]郝立丽.汉语文本数据挖掘-基于市长公开电话数据库的统计分析[D].长春：吉林大学,2009.
    [35]王映,常毅,谭建龙,白硕.基于N元汉字串模型的文本表示和实时分类的研究与实现[J].计算机工程与应用,2005,5(41)：88-91.
    [36]Salton G, Wong A.Yang C S. Information Retrieval and Language Processing,1975,18(11):613-620.
    [37]胡洁.高维数据特征降维研究综述.计算机应用研究,2008年9月,第25卷第9期: 2601-2606.
    [38]何尧.基于半监督学习的中文文档分类技术研究[D].长沙:中南大学,2005.
    [39]Yang Y, Pedersen J O. A comparative study on feature selection in taxt categorization. Proceedings of the fourteenth international conference on Machine Learning,1997,412-420.
    [40]徐燕,李锦涛,王斌,孙春明.基于区分类别能力的高性能特征选择方法.软件学报,2008年1月,第19卷第1期:82-89.
    [41]Cover T, Thomas J. Elements of Information Theory. Wiley,1991.
    [42]Quinlan R. Introduction of decision trees. Machine Learning,1986,1(1):81-106.
    [43]Mood A, Graybill F, Boes D. Introduction of the Theory of Statistics,1974, McGraw-Hill,3 edition.
    [44]方开泰.实用多元统计分析.华东师范大学出版社,1989.
    [45]Intrator N. Localized exploratory projection pursuit. In:Proceedings of the 23rd Symposium on the Interface, Seattle, WA,1991,237-240.
    [46]Bickel P, Levian E. Some theory for Fisher's linear discriminant function,"Naive Bayes",and some alternatives when there are many more variables than observations,2004,Bernoulli 10:989-1010.
    [47]Gadat S, Younes L. A stochastic algorithm for fearure selection in pattern recognition. Journal of Machine Learning Research,2007,8,509-547.
    [48]Garuana R A, Freitag D. Greedy attribute selection. The Eleventh International Conference on Ma-chine Learning,1994,28-36.
    [49]Zhou Z H. Learning with unlabeled data and its application to image retrieval[C]. In:Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence(PRICAI'06), Guilin, China, LNAI 4099,2006:5-10.
    [50]李建更,高志坤.随机森林针对小样本数据类权重设置.计算机工程与应用,2009,第45卷第26期: 131-134.
    [51]Liu X Y, Wu J.Exploratory under-sampling for class-imbalance learning[C]. Proceedings of the 6th IEEE International Conference on Data Mining(ICDM'06), Hong Kong, China,2006.
    [52]Valiant L G. A theory of learnable. Communications of ACM,198427(11):1134-1142.
    [53]Kearns M, Valiant L G. Cryptographic limitations on learning Boolearn formulae and finite automate. Journal of the ACM,1994,41(1):67-95.
    [54]Schapire R E. The strength of weak learnability. Machine Learning,1990,5(2):197-227.
    [55]Freund Y, Schapire R E. Experiments with a new boosting algorithm, proc of the 13-th conf on Machine Learning. Bari. Italy. Morgan Kaufmann,1996,148-156.
    [56]韩慧,王文渊,毛炳寰.不均衡数据集中基于Adaboost的过抽样算法.计算机工程,2007,第33卷第10期: 20-209.
    [57]Schapire R E, Singer Y. Improved boosting algorithms using confidence-rated predictions. Machine Learning,1999,37(3):297-336.
    [58]Bartlett P.Traskin M.Adaboost is consistent,in B.Scholkopf,J.Platt and T.Hoffman(eds), Advances in Natural Information Processing Systems 19,MIT Press,Cambridge,2007,MA,pp.105-112.
    [59]Jiang W. Process consistency for Adaboost. Annals of Statistics,2004,32(1):13-29.
    [60]Schapire R E, Singer Y. BOOSTEXTER:a boosting-based system for text categorization [J]. Machine Learning,2000,39(2/3):135-168.
    [61]Freund Y, Schapire R E. A decision-Theoretic Generalization of On-line Learning and an Application to boosting[J]. Journal of Computer and System Sciences,1997,55(1):119-139.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700