中文网页自动采集与分类系统设计与实现

英文题名：Design and Implementation of Chinese Webpage Automatic Collection and Classification
作者：于洪波
论文级别：硕士
学科专业名称：软件工程
中文关键词：Web信息采集 ; 网页分类 ; 信息抽取 ; 分词 ; 特征提取
英文关键词：web information collection ; webpage classification ; information extraction ; segmentation ; character extraction
学位年度：2010
导师：程保中
学科代码：081202
学位授予单位：北京邮电大学
论文提交日期：2010-06-01

摘要

随着科学技术的飞速发展,我们已经进入了数字信息化时代。Internet作为当今世界上最大的信息库,也成为人们获取信息的最主要手段。由于网络上的信息资源有着海量、动态、异构、半结构化等特点,且缺乏统一的组织和管理,所以如何快速、准确地从海量的信息资源中寻找到自己所需的信息已经成为网络用户需要迫切解决的一大难题。因而基于Web的网络信息的采集与分类便成为人们研究的热点。
     传统的Web信息采集的目标就是尽可能多地采集信息页面,甚至是整个Web上的资源,在这一过程中它并不太在意采集的顺序和被采集页面的相关主题。这就使得所采集页面的内容过于杂乱,其中有相当大的一部分利用率很低,大大消耗了系统资源和网络资源。这就需要采用有效的采集方法以减少采集网页的杂乱、重复等情况的发生。同时如何有效地对采集到的网页实现自动分类,以创建更为有效、快捷的搜索引擎也是非常必要的。网页分类是组织和管理信息的有效手段,它可以在较大程度上解决信息杂乱无章的现象,并方便用户准确地定位所需要的信息。传统的操作模式是对其人工分类后进行组织和管理。随着Internet上各种信息的迅猛增加,仅靠人工的方式来处理是不切实际的。因此,网页自动分类是一项具有较大实用价值的方法,也是组织和管理数据的有效手段。这也是本文研究的一个重要内容。
     本文首先介绍了课题背景、研究目的和国内外的研究现状,阐述了网页采集和网页分类的相关理论、主要技术和算法,包括网页爬虫技术、网页去重技术、信息抽取技术、中文分词技术、特征提取技术、网页分类技术等。在综合比较了几种典型的算法之后,本文选取了主题爬虫的方法和分类方面表现出色的KNN方法,同时结合去重、分词和特征提取等相关技术的配合,并对中文网页的结构和特点进行了分析后,提出中文网页采集和分类的设计与实现方法,最后通过程序设计语言来实现,在本文最后对系统进行了测试。测试结果达到了系统设计的要求,应用效果显著。
With the rapid development of science and technology, we have entered the digital information age. Internet, which is seen as the world's largest information database, becomes the main tool of obtaining information. It is a major problem to be solved urgently how to quickly and accurately from the mass of information resources to find the information that users need because the network of information resources has a massive, dynamic, heterogeneous, semi-structured characteristics, and the lack of a unified organization and management presents. Web information-based collection and classification becomes the research hotspot.
     The goal of traditional Web information collection is to gather information as much as possible, or even the whole resources on the Web. The order and topic pages arenot cared about in the process of collecting. the page contents is too cluttered, and a large part of them is sparingly used so that system resources and network resources are wasted. This requires effective collection method used to reduce the collected page clutter and duplication. The web pages are automaticaly classificated to create effective and efficient search engine. Organization and management of web page classification is an effective means of information, which can solve a large extent the phenomenon of information clutter and facilitate users to accurately locate the information they need. However, the traditional mode of operation is manual. With the rapid increasing of all kinds of information in the Internet, manual way to handle alone is unrealistic. Therefore, Web classification is not a method with great practical value, but also is an effective means of organizing and managing data. Tt is an important research part of this paper.
     Firstly, the topic background, purpose and research status are introduced, and the theories, techniques and algorithms of web page collection and classification are described, which includs web crawler technology, duplicated web pages deletcion technology, information extraction technology, Chinese word segmentation, feature extraction techniques and web page classification technology. A comprehensive comparison of several typical algorithms is made, topical crawler and KNN classification is selected because they have outstanding performance. The proposed acquisition and classification of Chinese web are designed and implementated after these technologies are combined and the structure and characteristics of Chinese language web page are analyzed. Finally, it is coded and realized by the programming language. Test results that the system met the design requirements, and application are done in many feilds.

引文

[1]叶卫国.Web信息检索及网页分类方法的研究[D]：华中科技大学,2003
    [2]宗校军.中文网页定题采集及分类研究[D]：华中科技大学,2006
    [3]Wai Lam, Chao Yang Ho. Using a generalized instance set for automatic text categorization[M]. University of Melbourne, ACM Special Interest Group Retrieval. Melbourne, Australia、:ACM Press,1998:81-89
    [4]C.K. P. Wong, R.W.P.Luk, K. F. Wong等Text categorization using hybrid (mined) terms (poster session) [M]. Hong Kong, China、:ACM,2000:217-218
    [5]陈林,杨丹.独立于语种的文本分类方法[J].计算机工程与科学.2008(06)
    [6]黄萱菁，吴立德,石崎洋之等.独立于语种的文本分类方法[J].中文信息学报.2000(06)：
    [7]刁倩,王永成,张惠惠等.文本自动分类中的词权重与分类算法[J].中文信息学报.2000(03)
    [8]Ines F. Vega-Lopez, Pavel A. Alvarez-Carrillo, Eduardo R. Fernandez-Gonzalez. An evolutionary model for measuring document relevance in a focused web spider[M]. Cuernavaca, Morelos, Mexico、:Inst. of Elec. and Elec. Eng. Computer Society,2008:177-182
    [9]Omer Choresh, Battuya Bayarmagnai, Randolph V. Lewis. Spider web glue:Two proteins expressed from opposite strands of the same DNA sequence[J]. Biomacromolecules.2009.10 (10):2852-2856
    [10]R. B. Jolliffe, J.E.Meling, L. M. Such等United States,2008
    [11]Wei-Jiang Li, Hua-Suo Ru, Tie-Jun Zhao等A new algorithm of topical crawler [M]. Qingdao, China、:IEEE Computer Society,2009:443-446
    [12]Jun Jiang, Chong-Jun Yang, Ying-Chao Ren. A spatial information crawler for OpenGIS
    [13]A. H.H.Ngu, D. J. Buttler, T. J. Critchlow. United States,2005
    [14]王舜燕,李蕾,吴兵华.基于ID3分类算法的深度网络爬虫设计[J].现代图书情报技术.2008(06)
    [15]周小平,黄家裕,刘连芳等.基于网页正文主题和摘要的网页去重算法[J].广西科学院学报.2009(04)
    [16]殷波,蒋华.一种基于重复串的STC改进算法[J].微计算机信息.2009(27)
    [17]Yuchen Zhou, Zuoda Liu, Beixing Deng等Improved fuzzy set information retrieval approach on duplicate webpage detection[J]. Journal of Information and Computational Science. 2009.6 (2):1033-1041
    [18]费洪晓,胡海苗,巩燕玲.基于Hash结构的机械统计分词系统研究[J].计算机工程与应用JSGG.2006 (05):159-161
    [19]湛燕陈昊袁方王熙照.基于中文文本分类的分词方法研究[J].计算机工程与应用JSGG. 2003(23)：87-91
    [20]赵文.基于体裁的中文网页自动分类的研究与实现[D]：武汉理工大学,2008

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700