基于SVM的中文网页多类分类问题研究及实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet技术的快速发展,人们从信息缺乏的时代过渡到信息极为丰富的数字化时代。在这个数字化的时代里,人们可以获得越来越多的数字化信息。这些信息大都是半结构化或者是非结构化的数据,想从其中迅速有效地获得所需信息是非常困难的事情。为此,中文网页自动分类被研究者提出并进行了应用研究,研究中文网页分类具有重要的理论意义和实际应用价值。自动分类不仅可以将网页按照类别信息分别建立相应的数据库,提高中文搜索引擎的查全率和查准率,而且可以建立自动的分类信息资源,为用户提供分类信息目录,并且,自动分类的好与坏,对后面的相关性排序过程也有一定的积极作用。
     本文在研究了传统支持向量机(SVM)分类器模型的同时,结合现有的网页分类技术,对SVM多类分类器模型构造进行了较为系统的研究,提出了一种基于SVM的多类分类器模型构造算法,在此基础上对基于分类的中文网页内容获取、中文分词、中文网页特征选择、SVM中文网页分类器提出了一定的思考和见解。
     (1)针对中文网页的结构和特点,分析了网页中对分类过程有贡献的信息成分,采用网页中的标题和主体部分标签中的文本来近似表达网页中的主题内容,并设计了标题和主体部分标签中文本获取的算法。
     (2)对中文分词和特征提取方法进行了深入地研究,系统地分析了中文分词方法,介绍了哈工大信息检索研究室的分词系统,采用改进的x~2估计方法作为本文特征选择方法,并描述了特征选择算法。
     (3)对SVM多类分类方法进行了深入理论研究,分析了以往SVM多类分类器构造方法,利用核函数在高维空间中距离公式,计算类别间最短距离,引入带权无向完全图来刻画高维空间中类别间的距离结构,基于最容易分割的类或类别集合先分割,提出了一种基于SVM的多类分类器的构造方法。
     (4)在上述研究的基础上,构建了一个完整的分类系统CWPMCS,进行了实验,并对实验结果做出了分析和评价。实验结果表明,本文研究开发的分类系统具有较高的分类准确率,比K-最近邻(KNN)分类方法的准确率要高。
With the fast development of Internet technology, the era that people lack from information carries out the transition to the era of information in extremely abundant digitized era. In the era of this digitization, people can obtain more and more digitized information including text , digital, figure , picture , sound or even video . The information is data of the half structurization or non structurization ,It is a very difficult thing that obtain necessary information from this information ,so the automatic classification of webpage has proposed and carried on the application study by researcher, the research of Chinese webpage classification has theory meaning and the value of application . automatic classification of webpage not only can set up separately corresponding database according to classification information, improve recall and precision of the Chinese search engine, but also can set up automatic classification information resources , offering the classified information catalogue to user, and the automatic classification are good and bad, there are certain positive roles to the following course of related ranking.
     This paper combines the existing classification technology of webpage while studying traditional support vector machine(SVM) classifier model, does comparatively systematic research to construction of SVM multi-class classifier models , provides a algorithm of constructing multi-class classifier models based on SVM. To put forward certain thinking and opinion for obtaining the Chinese webpage content, Chinese word segmentation, Chinese webpage feature selection, Chinese webpage SVM classifier on this basis.
     (1)Direct against the structure and characteristic of the Chinese webpage, have analyzed the contributory information composition for classification course in Chinese webpage .Adapting the title in the webpage and text in some labels of main body come to express the theme content of webpage approximately , and design the algorithm of the text obtaining in title and labels.
     (2)The method to Chinese word segmentation and feature selection has been studied deeply, has analyzed systematically the Chinese word segmentation method, introducing the Chinese word segmentation system of Information retrieval research lab of Harbin Industry University ,to adopt the method of CHI estimation as the method of selecting feature, and describe the algorithm of feature selection.
     (3)Have done the theoretical research to SVM multi-class classification methods deeply, to analyzed the constructing method of past SVM multi-class classifier, Have used the formula of distance of kernel function in high dimension space to calculate the distance between every two class, bring the undirected complete graph with weight to describe the structure of distance among the classes. Have proposed the constructing method of the multi-class classifier models based on SVM.
     (4) on the basis of above analysis, set up a comparatively intact classification system(CWPMCS), have carried on the experiment, and has made analysis and evaluation to the experimental result. The experimental result shows , the classification system that this text researches and develops has the higher classification rate of accuracy, it is higher than the rate of accuracy of the classification method of K-near neighbor (KNN) the most.
引文
[Azimi S M,2000]Azimi S M,Zckavat S.Cloud classification using support vector machines[A].In:Proceedings of the IEEE Geoscience and Remote Sensing Symposium[C].Honolulu,2000:669-671.
    [A.Basu,2003]A.Basu,C.Watters,M.Shepherd.Support Vector Machines for Text Categorization[A].Proceedings of the 36th Hawaii International Conference on System Sciences(HICSS'03)
    [Burges C J C,1998]Burges C J C.A tutorial on support vector machines for pattern recognition [J].Data Mining and Knowledge Discovery,1998,2(2):121-167.
    [ChristopherJ.C.Burges,1998]Christopher J.C.BURGES A Tutorial on Support Vector Machines for Pattern Recognition[J].Data Mining and Knowledge Discovery,2,121-167(1998)
    [Cortes C,1995]Cortes C,Vapnik V.Support- vector networks[J].Machine Learning 1995,20(3):273-297.
    [Dou Shen,2004]Dou Shen,Zheng Chen.Web-Page Classification through Summarization SIGIR'04 July 25-29,2004,Sheffield,south Yorkshire,UK
    [Haixin Ke,2001]Haixin Ke,Xuegong Zhang.Editing Support Vector Machines[J].IEEE 2001:1464-1467
    [Hsu C,2002]Hsu C,Lin C J.A simple decomposition method for support vector machines[J].Machine Learning,2002,46:291-314.
    [Jin-Seon Lee,2003]Jin-Seon Lee,II-Seok Oh.Binary Classification Trees for Multi-class Classification Problems[A].In:Proceedings of the Seventh International Conference on Document Analysis and Recognition(ICDAR 2003) IEEE 2003
    [JIANG Zhi-Qiang,2005]JIANG Zhi-Qiang,FU Han-Guang,LI Ling-Jun.Support Vector Machine for mechanical faults classification[J]Journal of Zhejiang University SCIENCE 20056A(5):433-439
    [Kristin P.Bennett,1998]Kristin P.Bennett,Ayhan Demiriz Semi-Supervised Support VectorMachines[A].Proceedings of Neural Information Processing Systems,Denver, 1998.
    [Lawrence Kai Shih,2004]Lawrence Kai Shih,David R.Karger.Using URLs and Table Layout For Web Classification Tasks WWW2004,May,New York,NY USA,ACM Press
    [Min-Yen Kan,2005]Min-Yen Kan,Hoang Oanh Nguyen Thi.Fast Webpage Classification Using URL Features CIKM'05,October 31-November 5,2005,Bermen,Germany
    [Platt J,1999]Platt J.Fast training of support vector machines using sequential minimal optimization[A].In,Scholkopf B,Burges C,Smola A,eds.Advances in Kernel Methods-Support Vector Learning[M].Cambridge,MA:MIT Press,1999:185-208.
    [Peter Sollich,2000]Peter Sollich.Probabilistic methods for Support Vector Machines In Advances[A].in Neural Information Processing Systems 12 S.A.Solla,T.K.Leen and K.R.Muller(eds) pp349-355 MTT Press(2000)
    [Ruihua Song,2004]Ruihua Song,Haifeng Liu,JiRong Wen,WeiYing Ma.Learning Block Importance Model for Web Pages WWW2004,May,New York,NY USA,ACM
    [Schwenker F,2000]Schwenker F.Hierarchical Support Vector Machines for Multi-class Pattern Recognition[A].In Proceedings of the Fourth International Conference on Knowledge-based Intelligent EngineeringSystem&Allied Technologies[C].Chennai,2000:561-565.
    [Scott Selikoff,2003]Scott Selikoff.The SVM-Tree Algorithm:A New Method for Handling Multi-Class SVMs[J].IEEE May,12,2003
    [Stilton G,1983]Stilton G,McGill M J.Introduction to Modern Information Retrieval[C]New York;McGraw-Hill,I983.
    [Scholkopf B,2000]Scholkopf B,Smola A,Williamson R C etal.New support vector algorithms [J].Neural Computation,2000,12:1207-1245
    [Suykens J A K,1999]Suykens J A K,Vandewalle J.Least squares support vector machine classifiers[J].Neural Processing Letters,1999,9:293-300.
    [Takahashi F,2002]Takahashi F,Abe S.Decision-Tree-Based Multiclass Support Vector Machines [A].In:Proceedings of the Ninth International Conference on Neural Information Processing[C).Singapore,2002:1418-1422.
    [Vojtech Franc,2002]Vojtech Franc,Vaclav Hlavac.Multi-class Support Vector Machine[A].ICPR 2002,Quebec
    [Yang Yiming,1997]Yang Yiming,Pedersen J O A.Comparative Study on Feature Selection in Text Categorization.Proceedings of the Fourteenth[A].International Conference on Machine Learning(ICML'97),1997
    [白亮,2005]白亮,老松杨,胡艳丽.支持向量机训练算法比较研究[J].计算机工程与应用,2005,No.17:79-81
    [边肇祺,2000]边肇祺,张学工.模式识别(第二版)[M].北京:清华大学出版社,2000.1
    [崔伟东,2001]崔伟东,周志华,李星.支持向量机研究[J].计算机工程与应用,2001,No.1:58-61
    [程传鹏,2005]程传鹏,李钜.中文网页分类中特征提取的研究[J].中原工学院学报,2005.12,Vol.16,No.6
    [陈涛,2005]陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005.12,Vol.24,No.6
    [邓乃扬,2004]邓乃扬,田英杰.数据挖掘中的新方法一支持向量机[M].科学出版社,2004.
    [都云琪,2002]都云琪,肖诗斌.基于支持向量机的中文文本自动分类研究[J].计算机工程,2002.11,Vol.18,No.11
    [傅鹏 张德运]傅鹏 张德运 基于离散核支持向量机的文本自动分类 清华大学学报(自然科学版)2005年第45卷第S1期(J Tsinghua Univ(Sci&Tech).2005,Vol.45,No.S1)
    [高洁,2004]高洁,吉根林.文本分类技术研究[J].计算机应用研究,2004,,No.7:28-30
    [官章全,2002]官章全,刘加明等.Visual C++.net类库大全[M].电子工业出版社,2002.1
    [胡国胜,2006]胡国胜,钱玲,张国红.支持向量机的多分类算法[J].系统工程与电子技术,2006.1,Vol.28.No.1
    [黄琼英,2005]黄琼英.支持向量机多类分类算法的研究及应用 河北工业大学,硕士论文,2005.3
    [黄勇,2005]黄勇,郑春颖,宋忠虎.多类支持向量机算法综述[J].计算技术与自动化,2005.12,Vol.24.No.4
    [贺海军,2003]贺海军,王建芬,周青,曹大元.基于决策支持向量机的中文网页分类器[J].计算机工程 2003.2,Vol.29 No.2
    [贾洞,2005]贾洞,梁久祯.基于支持向量机的中文网页自动分类[J].计算机工程,2005.5,Vol.31.No.10
    [贾银山,2005]贾银山,贾传荧.一种加权支持向量机分类算法[J].计算机工程,2005.6,Vol.31,No.12
    [刘向东,2005]刘向东,骆斌,陈兆乾.支持向量机最优模型选择的研究[J].计算机研究与发展,2005,Vol.42.No.4:576-581
    [刘志刚,2004]刘志刚,李德仁,秦前清,史文中.支持向量机在多类分类问题中的推广[J].计算机工程与应用,2004.7
    [李蓉,2002]李蓉,叶世伟,史忠植.SVM-KNN分类器—一种提高SVM分类精度的新方法[J].电子学报,2002.5,Vol.30,No.5
    [李晓黎,2001]李晓黎,刘继敏,史忠植.基于支持向量机与无监督聚类相结合的中文网页分类器[J].计算机学报,2001,Vol.24,No.1
    [路斌,2005]路斌,杨建武,陈晓鸥.一种基于SVM的多层分类策略[J].计算机工程,2005.1,Vol.31,No.1
    [卢正鼎,2005]卢正鼎,赵萍.一种基于支持向量机的PCA分析方法[J].华中科技大学学报(自然科学版),2005.1,Vol.33,No.1
    [陆从德,2005]陆从德,张太镒,李灿平 张伟.基于支持向量域描述的学习分类器[J].微电子学与计算机,2005,Vol.22,No.11
    [马金娜,2006]马金娜,田大钢.基于SVM的中文文本自动分类研究[J].计算机与现代化,2006,No.8:5-8
    [孟媛媛,2005]孟媛媛,刘希玉.一种新的基于二叉树的SVM多类分类方法[J].计算机应用2005.11,Vol.25.No.11
    [彭佳红,2005]彭佳红,沈岳,张林峰.数据挖掘中的特征选择及其算法研究[J]计算机工程与设计,2005.5,Vol.26,No.5
    [彭希鸿,2003]彭希鸿.基于Web内容挖掘的网页分类与过滤研究与应用 中南大学,硕士学位论文,2003.3
    [齐志泉,2005]齐志泉,田英杰 徐志洁.支持向量机中的核参数选择问题[J].控制工程,2005.7,Vol.12,NO.4
    [任函,2006]任函.大规模中文网页的自动分类研究华 中师范大学,硕士论文,2006.6
    [单松巍,2003]单松巍,冯是聪,李晓明.几种典型特征选取方法在中文网页分类上的效果比较[J].计算机工程与应用,2003,No.22:146-148
    [沈翠华,2005]沈翠华,刘广利,邓乃扬.一种改进的支持向量分类方法及其应用[J].计算机工程,2005.4,Vol.31,No.8
    [孙建涛,2004]孙建涛,沈抖,陆玉昌,石纯一.网页分类技术[J].清华大学学报(自然科学版),2004,Vol.44,No.1
    [孙国菊,2005]孙国菊,张杰.中文文本分类的特征选取评价[J].哈尔滨理工大学学报,2005.Feb,Vol.10,No.1
    [荣丽丽,2005]荣丽丽.支持向量机分类方法及其在文本分类中的应用研究 大连理工大学,博士论文,2005.7
    [申红,2006]申红,吕宝粮,内山将夫,井佐原均.文本分类的特征提取方法比较与改进[J].计算机仿真,2006.3,Vol.23,No.3
    [邵浩然,2005]邵浩然,张亮,马范援.基于损失最小化的SVM多类网页分类算法[J].计算机应用与软件,2005.7,Vol.22,No.7
    [谈恒贵,2005]谈恒贵,王文杰,李游华.数据挖掘分类算法综述[J].微型机与应用,2005,No.2:4-6
    [田晓宇,2006]田晓宇,粱静国.支持向量机在文本自动分类中的应用研究[J].情报学报,2006.4,Vol.25,No.2
    [唐发明,2005]唐发明,王仲东,陈绵云.一种新的二叉树多类支持向量机算法[J].计算机工程与应用,2005.7:24-26
    [唐发明,2005]唐发明,王仲东,陈绵云.支持向量机多类分类算法研究[J].控制与决策2005.7月,Vol.20,No.7
    [王国胜,2001]王国胜,钟义信.支持向量机的若干新进展[J].电子学报,2001.10,V01.29,No.10
    [王晔,2006]王晔,黄上腾.基于潜在链接分析的FTSVM网页分类[J].计算机工程,2006.5,Vol.32,No.10
    [王立国,2005]王立国,张晔,谷延锋.支持向量机多类目标分类器的结构简化研究[J].中国图象图形学报,2005.5,Vol.10,No.5
    [王凯,2005]王凯,周建国,夏德麟,晏蒲柳,董伟钛.基于支持向量机的中文文本自动分类研究[J].计算机应用研究,2005,No.11:61-63
    [吴艳玲,2004]吴艳玲.基于SVM的网页分类器的研究 吉林大学,硕士论文,2004.4
    [应伟,2006]应伟,王正欧,安金龙.一种基于改进的支持向量机的多类文本分类方法[J].计算机工程,2006.8,Vol.32,No.16
    [杨怡玲,2002]杨怡玲,管旭东,尤晋元.基于页面内容和站点结构的页面聚类挖掘算法[J].软件学报,2002,vol.13,No-3
    [阎威武,2003]阎威武,邵惠鹤.支持向量机和最小二乘支持向量机的比较及应用研究[J].控制与决策,2003.5,Vol.18,No.3
    [张春霞,2005]张春霞.郝天永汉语自动分词的研究现状与困难[J].系统仿真学报,2005.1,Vol.17,No.1
    [张国宣,2005]张国宣,孔锐,施泽生等.基于核聚类方法的多层次支持向量机分类树[J].计算机工程,2005.3,Vol.31,No.5
    [张华煜,2005]张华煜,邢丽萍.基于核函数的支持向量机分类方法[J].电脑开发与应用,2005,Vol.18,No.7
    [张金霞,2006]张金霞.HTML网页设计参考手册[M].清华大学出版社,2006.9
    [张莉,2004]张莉,康耀红,王曙光,张春元.中文网页自动分类现状的研究[J].福建电脑,2004,No.5:3-4
    [张学工,2000]张学工.关于统计学习理论与支持向量机[[J].自动化学报,2000,Vol.26,No.1:32-42
    [张熠,2005]张熘,张素,章琛曦,陈亚珠.基于支持向量机的概率密度估计方法[J].系统仿真学报,2005.10,Vol.17,No.10
    [赵晖,2005]赵晖,荣莉莉.支持向量机组合分类及其在文本分类中的应用[J].小型微型计算机系统,2005.10,Vol.26,No.10
    [郑勇涛,2005]郑勇涛,刘玉树.支持向量机解决多分类问题研究[J].计算机工程与应用,2005,No.23:190-192
    [翟林,2005]翟林,刘亚军.支持向量机的中文文本分类研究[J].计算机与数字工程,2005.Vol.33,No.3
    [庄东,2005]庄东,陈英.基于加权近似支持向量机的文本分类[J].清华大学学报(自然科学版),2005,Vol,45,No.S1
    [朱家元,2003]朱家元,吴伟,张恒喜,董彦非.一种新型的多元分类支持向量机[J].计算机工程,2003.10,Vol.29,No.11
    [周文帅,2006]周文帅,冯速.汉语分词技术研究现状与应用展望[J].山西师范大学学报(自然科学版),2006.3,Vol.20,No.1
    [邹加棋,2005]邹加棋.中文网页自动分类关键技术研究 福州大学,硕士学位论文,2005.2

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700