基于WEB挖掘技术的网页自动分类和聚类的研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于WEB挖掘技术的网页自动分类和聚类的研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research of Automatic Web Page Categorization and Cluster Based on Web Mining Technology
作者：谢振亮
论文级别：硕士
学科专业名称：计算机应用
中文关键词：文本分类 ; 文本聚类 ; Web挖掘 ; 链接文本
英文关键词：Text Classification ; Text Cluster ; Web Mining ; Anchor Text
学位年度：2004
导师：何丕廉
学科代码：081203
学位授予单位：天津大学
论文提交日期：2004-01-01

摘要

文本分类和文本聚类是信息处理中的两个重要工作。传统的分类和聚类算法主要针对纯文本文件，随着Internet的迅速发展，半结构化的Web数据慢慢占据了信息处理对象的主体，这使得文本分类和聚类算法得到了进一步的延伸和发展。
本论文主要研究如何利用Web挖掘技术，并结合现有的分类和聚类技术，实现对Web文本数据的高准确率的分类和聚类。论文的出发点是：一个网页在网站拓扑结构中的位置及其它网页对它的链接文本都包含了网站管理者对这个网页的内容及类别的定位；充分利用这些信息，有助于对该网页的分类和聚类。本论文提出通过Web内容挖掘和结构挖掘，提取网页在整个网站中的层次类别信息，通过这些层次类别信息对网页进行分类和聚类。
Text classification and cluster are two important missions of information processing. Traditional algorithms of classification and cluster aim at pure text files, but with the development of Internet, half-struct web data become the main objects of information processing, and it makes evolution to the algorithms of classification and cluster.
This paper focuses on how to achieve high precision of classification and cluster using web-mining technology compounded with existing technology. The stand of this paper is that the page’s positon in the site topology shows the manager’s viewpoint of content and class of the page and this information is very helpful to classification and cluster. We extract the hiberarchy class infomation of pages through web content mining and web structure mining, and use this infomation to classify and cluster the pages.

引文

[1] Craven M, Slattery S, Nigam K. First-order learning for Web mining, Proc of the 10th European Conf on Machine Learning. Chemnitz, 1998
    [2] Brin S et al, The anatomy of large-scale hypertextual web search engine, Proc of the Seventh Int’l World Wide Web Conf, 1998
    [3] Spertus E, ParaSite, Mining structural information on the web, Proc of the Sixth Int’l World Wide Web Conf, 1997
    [4] DiPasquo D, Using HTML formatting to aid in natural language processing one the World Wide Web, School of Computer Science, Canegie Mellon University, 1998
    [5] Yang, Y., Expert network: Effective and efficient learning from human decisions in text categorization and retrieval, In 17th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'94), pages 13-22, 1994.
    [6] David D. Lewis, Robert E. Schapire, James P. Callan, Ron Papka. Training Algorithms for Linear Text http://citeseer.nj.nec.com/lewis96training.html
    [7] Ellen Spertus. ParaSite: Mining the Stuctural Information on the World-Wide Web,In Proceedings of the 6th World Wide Web Conference,1997
    [8] A.Blum and T.Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998
    [9] Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975,18(5): 613~620
    [10] J.furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysisl. Pages 487~496 1999.
    [11] Bayardo, R. J. Brute-force mining of high-confidence classification rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97), 123-126, 1997.
    [12] Broder, A. Z., Glassman, S. C., Manasse, M. S. and Zweig, G. Syntactic clustering of the Web. In Proceedings of the Sixth International Web Wide World Conference (WWW6), 1997.
    [13] Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W. and Freeman, D. AutoClass: A bayesian classification system. In Proceedings of the Fifth International Conference on Machine Learning (ML’88), 54-64, 1988.
    [14] Cheeseman, P. and Stutz, J. Bayesian classification (AutoClass): Theory and results. In Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, 1996.

    [15] Dunlop, M. D. Time, relevance and interaction modeling for information retrieval. In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), 206-215, 1997.
    [16] Furnkranz, J. A study using n-gram features for text categorization. Technical Report OEFAI-TR-98-30, Austrian Research Institute for Artificial Intelligence, 1998.
    [17] Furnkranz, J., Mitchell, T. and Riloff, E. A case study in using linguistic phrases for text categorization on the WWW. In Proceedings of 1998 AAAI/ICML Workshop on Learning for Text Categorization, 1998.
    [18] Hearst, M. and Karadi, C. Cat-a-Cone: An interface for specifying searches and viewing retrieval results using a large category hierarchy. In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), 1997.
    [19] Hearst, M. A. The use of categories and clusters in information access interfaces. In T. Strzalkowski (ed.), Natural Language Information Retrieval, Kluwer Academic Publishers, 1998.
    [20] Klienberg, J. M. Authoritative sources in a hyperlinked environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998.
    [21] Krellenstein, M., Chief Technology Officer, Northern Light Technology LLC. The added value of classification intelligence. Talk at the 1998 Search Engines and Beyond Conference, 1998.
    [22] Leouski, A., and Croft, W. B. An evaluation of techniques for clustering search results. Technical report IR-76, Department of Computer Science, University of Massachusetts, Amherst, 1996.
    [23] Lewis, D. D. An evaluation of phrasal and clustered representation on a text categorization problem. In Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), 37-50, 1992.
    [24] Macskassy, S., Banerjee, A., Davison, B. and Hirsh H. Human performance on clustering Web pages: A preliminary study. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), 1998.
    [25] Page, L., Brin, S., Motwani, R. and Winograd, T. The PageRank citation ranking: Bringing order to the Web. Technical Report, Stanford University, 1998.
    [26] Rasmussen, E. Clustering Algorithms. In Frakes, W. B. and Baeza-Yates, R. (eds.), Information Retrieval, Prentice Hall, Eaglewood Cliffs, N. J., 419-442, 1992.
    [27] Schütze, H. and Silverstein, C. Projections for efficient document clustering. In Proceedings of the 20th International ACM SIGIR Conference on Research and
    [28] Silverstein, C. and Pedersen, J. O. Almost-constant time clustering of arbitrary corpus subsets. In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97), 60-66, 1997.

    [29] Ukkonen, E. On-line construction of suffix trees. Algorithmica, 14:249-260, 1995.
    [30] Zamir, O., Etzioni, O., Madani O. and Karp, R. M. Fast and intuitive clustering of Web documents. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD’97), 287-290, 1997.
    [31] Zamir, O. and Etzioni, O. Web document clustering: A feasibility demonstration. In: Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'98), 46-54, 1998.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700