基于朴素贝叶斯的网页自动分类技术研究

英文题名：Na(?)ve Bayesian-based Automatic Webpage Classification Technology Research
作者：李晋松
论文级别：硕士
学科专业名称：控制理论与控制工程
中文关键词：数据挖掘 ; 网页分类 ; 朴素贝叶斯 ; 信息过滤
英文关键词：Data Mining ; Web Classification ; Na(?)ve Bayesian ; Filtration
学位年度：2008
导师：薛为民
学科代码：081101
学位授予单位：北京化工大学
论文提交日期：2008-06-04
答辩委员会主席：曹柳林

摘要

文本与网页分类技术是文本挖掘和网络挖掘的一项重要研究内容,已成为数据挖掘领域技术发展的热点之一。随着数据处理工具、先进数据库技术以及网络技术迅速发展,大量的形式各异的复杂类型的数据(如结构化与半结构化数据、超文本与多媒体数据)不断涌现。因此数据挖掘面临的一个重要问题就是针对复杂数据类型的挖掘,这包括复杂对象、空间数据、多媒体数据、时间序列数据、文本数据和Web数据。该选题是建立基于一定分类算法的网页文本分类模型,研究怎样合理利用网页文本内容信息、链接结构信息、用户使用信息,将这三种类别信息整合起来达到较为完整的反映页面所属类别的目的,并在此基础上建立针对特定网页信息的过滤系统。
     论文介绍了一种结合网页的使用者信息及其链接结构层次的中文网页分类方法,和传统的仅仅基于网页内容的或网页链接的分析方法不同,本论文提出的这种方法能够充分利用其他的Web类信息,诸如用户的使用信息和链接层次信息,以达到改进或增强网页分类器的效果和特点,并在此基础上采集数据进行了实验,通过对得到结果的分析,证明这种方法是有效的。
     此外在文章的最后部分分析了网页分类方法在信息过滤技术中的应用,结果证明利用用户信息可以提高过滤的准确度。
Text and Webpage classification is an important technology based on text mining and Web mining, and one of the focuses of development in data mining research. By the high speed in development of data analysis tools、new database technology and internet technology, a large number of different forms of the complex types of data continue to emerge like: Semi-structured and structured data, hypertext and multimedia data, a very important problem in data mining area is data mining of complex data types; this includes complex objects, spatial data, multimedia data, time-series data, text data and Web data. Our research is try to find a way to build a model of Text and Webpage classification which based on a certain classification algorithm, and how to use the information of text content, URL link, and user usage, combined them to reflect the categories of Web pages. At last we also try to build a filtration system of Web pages.
     This paper describes a method for Chinese Webpage classification that uses user usage information and hierarchy from website, rather than the content-based analysis approach and the link-based analysis approach; we have to find a way to use other information like user's usage and hierarchy from the website to try to improve the performance and features of classifier. This paper tests this method and gains a result to analysis.
     In addition, expansion of the research, analysis a Web classification-based method of filtering technology research, and explore the way how to make use of user information to improve the accuracy of the filter approach.

引文

[1]Jon Kleinberg,Steve Lawrence.The structure of the web[J].Science,2001,294(5548):1849-1850.
    [2]Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,and J.Wiener.Graph Structure in the Web[C],in Proceedings of WWW 9 Conference,2000.
    [3]Search Engine watch.March 1998,http://searchenginewatch.com/[OL]
    [4]Arvin Arasu,Junghoo Cho,Hector Garcia-Molina,Andreas Paepcke,Sriram Raghavan.Searching the web[C].ACM Transactions on Internet Technology,2001.
    [5]Grishman,R.Information Extraction:Techniques and Challenges[M].Information Extraction(International Summer School SCIE-97),M.T.Pazienza,ed.New York:Springer-Verlag,1997.
    [6]Web2.0 是一种理念而不是技术[OL].北方网,http://www.medialeader.com.cn/media/200703/20070302093410_6425.html 2007,3-1.
    [7]JiaweiHan,Micheline Kamber.数据挖掘概念与技术[M],机械工业出版社2002.
    [8]符敏慧,基于文本的信息过滤模型,信息学·文献学,2006,6(2):37.
    [9]刘宏伟,黄静,基于朴素贝叶斯算法的垃圾邮件网关[J],微计算机信息,2006,22(6-3):73-76.
    [10]李欣,左瑞欣,曲文赋.NaiveBayesian算法在基于内容的垃圾邮件过滤中的应用[J].计算机应用,2006.6(6):48-54.
    [11]任吉力,项婧.基于神经网络的电子邮件分类与过滤[J].计算机工程与设计,2006,5(6):1021-1024.
    [12]朱烨行,戴冠中,慕德俊,李艳玲.基于内容审查过滤的网络安全研究[J].计算机应用研究,2006,7(10):130-132.
    [13]李石君,李洲,余军,张科.基于URL过滤与内容过滤的网络净化模型[J].计算机技术与发展,2006,9(1):5-7.
    [14]梁勇勇.基于数据挖掘的WEB内容过滤系统模型[J].今日科技,2006,8(4):44-45.
    [15]Z Chen,S P Liu,W Y Liu,G G Pu,W YMa.Building a web thesaurus from web link structure[C].Proc of the 26th annual international ACM Singir.Toronto,Canada:ACM Press.2003.48-55.
    [16]饶文碧柯慧燕.Web文本分类技术研究及其实现[J].计算机技术与发展,2006,16(3):116.
    [17]Inderjeet Mani,Mark T.Maybury(editors).Advances in Automatic Text Summarization [M].MIT Press,1999.
    [18]Overview of Text Summarization History[OL].http://www.ics.mq.edu.au/～swan/summarization/history.htm
    [19]凌云,刘军,王勋.多层次web文本分类[J].情报学报,2005,12(6):684-689
    [20]Chen,J.,Zhou,B.,Shi,J.,Zhang,H.2J.,Qiu,E Function2BasedObject Model Towards Website Adaptation[C].Procrrdings of the 10th World Wide Web conference,2001,587-596.
    [21]Kovaceivic M,Diligenti M,Gori M,Milutinovic V.Recognition of Common Areas in a Web Page Using Visual Information:a possible application in a page classification[C].Proceedings of 2002 IEEE International Conference on Data Mining(ICDMp02),2002,250
    [22]Yu,S.,Cai,D.,Wen,J.R.,Ma,W.Y.Improving Pseudo2Relevance Feedback in Web Information retrieval Using Webpage Segmentation[C].Proceedings of twelfth World Wide Web Conference(WWW2003),2003,11-18
    [23]Lan Yi,Bing Liu,Xiaoli Li.Eliminating Noisy Information in Web Pages for Data Ming [C].Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and data mining,2003,296-305.
    [24]P.Buneman,Semistructured data[C],In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Databases Systems,pp.117-121,1997
    [25]L.Yi,B.Liu,and X.Li.Eliminating Noisy Information in Web Pages for Data Mining [C].KDD2003.2003.
    [26]S.Chakrabarti,B.Dom,and P.Indyk.Enhanced hypertext categorization using hyperlinks[C].In Proceedings of the ACM SIGMOD International Conference on Management of Data,pages 307-318,Seattle,Washington,June 1998.
    [27]Vapnik,V.The Nature of Statistical Learning Theory[M].Springer-Verlag,New York,1995.
    [28]R.O.Duda and P.E.Hart.Pattern Classification and Science analysis[M].John Wiley & Son,1973.
    [29]Yang,Y.,Chute,C.G.An example-based mapping method for text classification and retrieval[C].ACM Transactions on Information Systems(TOIS)1994;12(3):252-77.
    [30]L.Breiman,J.H.Friedman,R.A.Olshen and C.J.Stone.Classification and Regression Trees[M].Belmont,CA:Wadsworth,1984
    [31]Thorsten Joachims.A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization[C].In Douglas H.Fisher,editor,Proceedings of 101 ICML-97,lth International Conference on Machine Learning,pages 143 151,Nashville,US,1997.Morgan Kaufmann Publishers,San Francisco,US.
    [32]刘茂旺,林世平.BOOSTING算法在多类多标签文本分类中的应用[J1.福建电脑,2006,16(3),116.
    [33]陈江兵,张巍.基于状态转换方法的不良信息文本过滤模型[J].江西教育学院学报(综合),2005,12(6):23-24.
    [34]滕少华,张巍,黎嘉喜.基于规则演算的不良文本过滤模型[J].计算机应用与软件,2004,9(21).66-67.
    [35]林鸿飞,王剑峰.基于合作模式的文本过滤模型[J].小型微型机系统,2001,(22).
    [36]林鸿飞.基于混合模式的文本过滤模型[J].计算机研究与发展,2001,(38).
    [37]陈勤,张国煊,王小华,陆蓓,赵葆华.基于模糊模式识别的文本自动分类法研究[J].浙江大学学报(理学版),2000,(27)
    [38]陈阳贵,袁卫忠,谢俊元.基丁自然语言处理的Web内容过滤模型[J].计算机应用研究,2001,(7).
    [39]金成植.编译方法[M].北京高等教育出版社,1984.
    [40]吴长瀛.基丁VSM不良文本过滤系统的硬件实现[J].信息安全与通信保密,2006.6 (9):113-115.
    [41]张海波.面向主题的网页过滤机制研究[D].兰州大学硕士论文,2007.
    [42]钱晓东.数据挖掘中分类方法综述[J].图书情报工作,2007,(68).
    [43]祝磊.基于SVM技术的文本分类研究[J].软件导刊,2006,(26).
    [44]王涛.文本自动分类研究[J].图书馆学研究,2007,12(40).
    [45]林士敏,田凤占,陆玉吕.贝叶斯学习、贝叶斯网络与数据采掘[J].计算机科学,27(10),2000,(69)
    [46]薛万欣.Bayesian网推理及应用[M].吉林大学出版社,2006.
    [47]JiaweiHan and Mieheline,Kamber.DataMining:Conee Ptsand Teehniq[M].Beijing:Higher Edueation Press,2001.
    [48]林士敏,田凤占,陆玉吕.贝叶斯学习、贝叶斯网络与数据采掘[J].计算机科学.2000,27(10).
    [49]侯小静.贝叶斯分类器研究及其在Web文档分类中的应用[D].郑州大学硕士论文,2005.
    [50]汉语词法分析系统ICTCLAS(Institute of Computing Technology,Chinese Lexical Analysis System)[OL].http://www.ictclas.org/
    [51]Chinese Segmenter[OL].http://www.mandarintools.com/segmen-ter.html
    [52]黄科,马少平.基于统计分词的中文网页分类[J].中文信息学报,2002,16(6):25-31
    [53]周水庚,关佶红,胡运发,等.一个无需词典支持和切词处理的中文文档分类系统[J].计算机研究与发展,2001,38(7):839-844
    [54]Porter Stemming Algorithm[OL].http://www.tartarus.org/martin/PorterStemmer/
    [55]薛为民,陆玉昌.文本挖掘技术研究.北京联合大学学报,2005.12.(59)
    [56]李荣陆.文本分类及其相关技术研究[D].上海复旦大学,2004.4
    [57]张贵红.Web使用模式挖掘技术[J].科技信息,2008,6(65)
    [58]A.Nanopoulos,YManolopoulos.Mining Patterns from Graph Traversals[J].Data and Knowledge Engineering,2001,37(3):243-266
    [59]B.Mobasher,H.Dai,T.Luo.Effective Personalization Based on Association Rule Discovery from Usage Data[J].Web Information Data Management,2001:9-15
    [60]乔良.基于马尔科夫模型的用户浏览路径预测研究[D].燕山大学硕士学位论文,2007.
    [61]张海波.面向主题的网页过滤机制研究[D].兰州大学硕士学位论文,2007.
    [62]周幼兰.元数据环境下国际华文书目交换的展望[OL].http://www.libnet.sh.cn/dcchina/hywj.htm,2004-11-21.
    [63]邓健爽,郑启伦,彭宏.基于提取网站层次结构的网页分类方法[J],计算机应用,,2006,5(26).
    [64]McCallum.A and Nigam.K.A comparison of event models for Naive Bayes text classification[C].AAAI/ICML-98 Workshop on Learning for Text Categorization,1998,41-48.
    [65]Ian H.Witten Eibe Frank.数据挖掘实用机器学习技术[M].机械工业出版社,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700