基于粗糙集的“规则+例外”网页分类研究

英文题名：Study on Web-Pages Classification Based on Rough Set and "Rule+Exception"
作者：刘云霞
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：文本分类 ; 特征提取 ; 粗糙集 ; 规则归纳 ; 例外分析
英文关键词：text classification ; feature extracting ; rough sets ; rule induction ; exception analysis
学位年度：2007
导师：胡彧
学科代码：081202
学位授予单位：太原理工大学
论文提交日期：2007-05-01

摘要

随着信息技术的迅速发展，网络信息不断膨胀。如何让网络信息更好地为人类服务，已成为未来几年的一个研究热点。一方面是人们对快速、准确而全面获取信息的渴望，而另一方面却是网络信息的纷繁芜杂，在这两者之间架设一座桥梁的确是一个巨大的挑战。网页自动分类技术正为解决这个问题提供了一种合理有效地组织信息的方法。
     为了有效地组织和分析网页信息，帮助用户迅速地获取所需要的信息，论文针对不同用户对网络信息的不同需求来提取对应的规则，同时根据知识中规则与例外相互补充的学习理论对存在的例外进行分析，从而对中文网页文本进行精确分类。本文从理论和应用的角度对中文网页文本信息的分类技术进行了深入的研究，提出了将粗糙集与面向自然语言处理的规则与例外学习理论应用到中文网页分类中，并实现了一个基于粗糙集的“规则+例外”中文网页分类系统。
     论文对中文网页分类的关键技术、粗糙集理论的主要内容、规则归纳以及例外分析进行了系统的研究和详细的介绍，并在这些理论知识的指导下设计了一个解决用户需求的中文网页文本分类器。论文主要做了以下研究工作：
     网页文本分类首先需要收集WEB文本，对WEB文本进行预处理，保存其中的文本信息。在这部分，文章首先实现了抢先式多线程中文网页收集器，采用深度优先的算法获取特定类型的网页，接着根据HTML Tag文本的特点，实现了基于非递归方式匹配的WEB文本预处理器，它用于提取网页中的文本信息以及定义的网页标记集。
     其次，本文在研究文本信息表示和网页信息特点的基础上，改进了中文网页文本表示的权重计算方法，设计了面向用户需求的属性约简算法，该算法在文本分类系统中取得了较好的效果。此外，本文结合粗糙集理论中的研究内容分析了规则与例外的形成过程，并提出基于reduct的例外鉴别方法。
     论文最后设计了中文网页文本分类系统的总体方案，并根据方案实现了基于粗糙集的“规则+例外”中文网页文本分类系统。为了进行实验评估，论文进行了两组实验进行结果比较。实验数据表明本文设计的网页文本分类器提高了网页文本分类的效率，有一定的实际意义。
Along with the rapid development of information technology, network information increases explosively. It's a real researching hotspot to make network information easier and more efficient to be used. The information in Internet is in short of organization and full of a mass of pages. On the other hand, people want to retrieve information quickly and accurately. The technique of automatic web pages classification seemed as a good approach to solve such problems.
     To effectively organize and analyze massive web information resource and help users to promptly get knowledge and information they need, this thesis extracts diverse rules according to users' different requirements and analyses the existing exceptions to reach the aim of accurate classification on the basis of the learning theory that rules and exception are complementary. This paper studies the Chinese web text mining techniques deeply in the aspects of theory and application, puts forward applying rough sets and the learning theory of "rule + exception" in natural language processing to Chinese web text mining and realizes a classifier of the Chinese web page text.
     The key techniques of Chinese web pages classification and the main theory of rough sets, rule induction and exception analyzing have been introduced systematically in this thesis. At last, a Chinese web pages classifier has been designed under the guidance of the theory. The achievements of this thesis are:
     Unlike the general text classification, we need to collect Chinese web pages, preprocess these web pages and save the weight of the text information. First, a preemptive multi-thread web text collector which is used to collect web pages of special catalog using Depth First Algorithm is realized. Besides, a web text preprocessor which is used to erase the meaningless HTML tag and extract web text by recursive match method is implemented.
     Furthermore, a weight computing algorithm is improved taking into account of the characters of text information and web pages information. To be important, an attributes reducing algorithm oriented users' requirements is proposed, which is proved to be highly effective in the text classification system and a Reduct exception analysis method is proposed based on the theory of rough sets by analyzing the reasons that rules and exception appear in the web pages text classification.
     At last, the designing process of Chinese web pages text classification is listed and the Chinese web pages text classifier based on the theory of rough set and rule plus exception is realized according to the process. To evaluate the performance of the classifier, we did two experiments and compared the results.
     The results show both the efficiency and the correctness of the web pages text classification system are higher and these researches are worthy to be referenced in the field of text classification.

引文

[1] Furnkranz J. Exploiting structural information for text classification on the WWW. IDA'99[C], 1999
    [2] Chakrabarti.S, Dom.B, Indyk.P. Enhanced hypertext categorization using hyperlinks[A].Laura M.H, Tiwary A.P roc ACM SIGMOD Int Conf on Management of Data[C] New York:ACM Press,1995. 307-318
    [3] Ghani.R, Siattery.S, Yang.Y. Hypertext categorization using hyperlink patterns and meta data[A]. Brodley.C, ICML'01 [C]. San Francisco:Morgan Kaufmann,2001
    [4] Oh. H, M yaeng. S, HoLee. M. A practical hypertext categorization method using links and incrementally available class information[A]. SIGIR-00[C].New York: ACM Press, 2000. 264-271.
    [5] Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization [J]. J Intelligent InfoSyst,2002,18 Page: 219-241
    [6] Choon. Y. Classification of world wide web documents[D]. Pittsburgh: CarnegieMellon Univ, 2000
    [7] 范焱，郑诚，王清毅等．用Naive Bayes方法协调分类Web网页[J]．软件学报，2001，12．1386—1392
    [8] Salton G, Wong A, Yang CS. A Vector Space Model for Automatic Indexing[J]. Communication of the ACM, 1995, (1): 2-8
    [9] Y.Yang, Jan O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), 1997
    [10] 单松巍，冯是聪，李晓明．几种典型特征选取方法在中文网页分类上的效果比较．计算机工程与应用．2003．22：146-148
    [11] 胡佳妮，徐蔚然，郭军等．中文文本分类中的特征选择算法研究．光通信研究，2005，(3)：44-46
    [12] 黄萱菁，吴立德，石崎洋之，徐国伟．独立于语种的文本分类方法[J]．中文信息学报，2000，14(6)：1-7
    [13] 刘少辉，董明楷，张海俊等．一种基于向量空间模型的多层次文本分类方法[J]．中文信息学报，2002，16(3)：8-14
    [14] Y.Yang, Xin Liu. A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp. 42-49), 1999
    [15] 曾黄麟．粗糙集理论及其应用．重庆大学出版社，1996，1-38
    [16] Skowron A., Rauszer C. The discemibility matrices and functions in information systems. In: Slowinski R.,ed.,Intelligent Decision Support-Handbook of Application and Advances of the Rough Sets Theory, Kluwer Academic Publishers, 1992, 331-362
    [17] Bazan J, Skowron A, Synak P. Dynamic reducts as a tool for extracting laws from decision tables. In: Proceedings of the Eighth International Symposium on Methodologies for Intelligent Systems. Lecture Notes in Artificial Intelligence 869. Springer Verlag, 1994, 346-355
    [18] Slezak,D. Approximate reducts in decision tables. In: Proc. oflPMU'96, 1996,1159-1164
    [19] Wroblewski J. Covering with reducts-a fast algorithm for rule generation. In: Proceedings of the first Conferenct on Rough Sets and Current Trends in Computing(RSCTC'98),Lecture Notes in Artificial Intelligence 1424,1998,402-407
    [20] Pawlak Z. Rough Set—Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991
    [21] Hu X., Cercone N. Learning in relational databases: a rough set approach. International Journal of Computational Intelligence, 1995, 11(2): 323-338
    [22] Jelonek J., et al. Rough set reduction of attributes and their domains for neural networks. International Joumal of Computational Inteligence, 1995,11(2): 339-347.
    [23] Miao D., Wang, J. Information-based algorithm for reduction of knowledge. In: Proceedings of IEEE ICIPS'97, 1997.
    [24] Skowron A., Rauszer C. The discernibility matrices and functions in information systems. In: Slowinski R., ed., Intelligent Decision Support—Handbook of Application and Advances of the Rough Sets Theory, Kluwer Academic Publishers, 1992, 331-362.
    [25] Wang Jue, Wang Ju. Reduction algorithms based on discernibility matrix: the ordered attributes method. Journal of Computer Science and Technology, 2001, 16(6): 489-504.
    [26] Wroblewski J.Finding minimal reducts using genetic algorithms. In: Proceedings of the Second Annual Join Conference on Information Sciences, 1995, 186-189
    [27] Wroblewski J. Theoretical foundations of order-based genetic algorithms. Fundamenta Informaticae, 1996, 28(3-4). 423-430.
    [28] 刘群．基于Rough集方法的Web网页智能搜索．计算机与现代化．2002年第9期
    [29] 王汉萍，孟庆春，张继军，李占斌，殷波．基于粗糙集的文本自动分类方法的研究．信息技术．第27卷第8期．2003年8月
    [30]Quinlan J. Induction of decisions trees. Machine Learning, 1986,1: 81-106.

    [31]Quinlan J. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

    [32]Breiman L., Friedman J., Olshen R., et al. Classification and Regression Trees. Wadsworth, 1984.

    [33]Mitchell T. Machine Learning. McGraw-Hill Higher Education, 1997.

    [34]Clark P., Boswell R . Rule induction with CN2: some recent improvements. In: Y. Kodratoff, editor, Machine Learning-EWSL-91, Springer-Verlag, 1991,151 -163
    [35]Kaufman K., Michalski R. The AQ19 system for machine learning and pattern discovery: a general description and user's guide. Reports of the Machine Learning and Inference Laboratory, MLI01-2, George Mason University, 2001.
    [36]Kaufman K., Michalski R. Learning patterns in noisy data: the AQ approach Machine Learning and its Applications, Springer-Verlag, 2001,22 -38.
    [37]Cohen W. Learning trees and rules with set-valued features. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96),1996.
    [38]Weiss S. Indurkhya N. Lightweight rule induction. In: Proceedings of the International Conference on Machine Learning (ICML)2000, 2000.
    [39]Sun Junping. Discovering Reduct Rules from N-Indiscernibility Objects in Rough Sets. In: Proceedings of the IEEE International Conference on Fuzzy Systems, 2003, 720-725.
    [40]Hu X., Cercone N. Discovering maximal generalized decision rules through horixontal and vertical data reduction. Computational Intelligence, 2001,17 (3): 684-702.
    [41] Malik Agyemang, Ken Barker, Reda Alhaj. Framework of Mining Web Content Outliers. ACM Symposium on Applied Computing.2004
    [42] Han S., Wang J. Reduct and attribute order. Journal of Computer Science and Technology, 2004, 19(4): 429-449
    [43] Yiyu Yao, Fei-Yue Wang, Jue Wang. "Rule + Exception" Strategies for Knowledge Management and Discovery. D. Slezak et al. (Eds.): RSFDGrC 2005, LNAI 3642, pp. 69-78,.
    [44]Markou M., Singh S. Novelty detection: a review, part I: statistical approaches. Signal Processing, 2003,83:2481-2497
    [45]Markou M., Singh S. Novelty detection: a review, part II : neural network based approaches. Signal Processing, 2003, 83: 2499-2521
    [46] Bay S., Schwabacher M. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of KDD-2003, 2003, 29 -38.

    [47] Bolton R., Hand D. Statistical fraud detection: a review. Statistical Science, 2002,17: 235-55
    [48] Tzeng J., Byerley W., Devlin B., et al. Outlier detection and false discovery rates for whole-genome DNA matching. Journal of the American Statistical Association, 2003, 98: 236- 246.
    [49] Yang Y., Zhang J., Carbonell J., et al. Topic-conditioned novelty detection. In: Proceedings of ACM SIGKDD-2002. 2002. 688-693
    [50]Nosofsky M., Palmeri J., McKinley C.. Rule-plus-exception model of classification learning. Psychological Review, 1994,101(1): 53-79
    [51]Zhou Yujian, Wang Jue. Rule+Exception Modeling Based on Rough Set Theory. In: Proceedings of: RSCTC'98, LNAI 1424,1998, 529-536.
    [52]Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. 2005 by Elsevier Inc.
    [53]Domingos P. Unifying instance-based and rule-based induction. Machine Learning, 1996, 24: 141-168
    [54]Zhang J., Michalski R. An integration of rule induction and exemplar-based learning for graded concepts. Machine Learning, 1995,21(3): 23 5-267.
    [55] Richards D., Compton P. Taking up the situated cognition challenge with ripple down rules. International Journal of Human Computer Studies, 1998,49: 895-926

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700