基于Web的网页聚类系统的研究与实现

英文题名：Research and Realization of Page Clustering System Based on Web
作者：王会芬
论文级别：硕士
学科专业名称：计算机技术
中文关键词：信息检索 ; 文本挖掘 ; web挖掘 ; 聚类
英文关键词：information retrieval ; text mining ; web mining ; clustering
学位年度：2005
导师：张新荣 ; 于铁兵
学科代码：081203
学位授予单位：天津大学
论文提交日期：2005-06-01

摘要

近几年来,随着网络的发展,信息迅猛增多,在浩瀚的信息海洋中如何快速而有效地获得所需要的信息,是困扰网上用户的难题。用户使用现有的搜索引擎浏览Web页面时,虽然能部分的解决资源发现的功能,但其精度不高,不能为用户提供结构化信息,也不能提供文档分类、过滤等功能,对于信息资源的一个主要形式——文本,人们迫切需要能够从大量Web文本集合中快速、有效地发现资源和知识的工具。
    本文通过对数据挖掘技术中的聚类分析做深入研究,提出了一种基于智能化的网页聚类系统,它以聚类算法为核心,自动聚合相似内容的网页,并最终提交给用户界面显示。其中聚类算法采用向量空间模型表示网页的文档,再利用模糊聚类算法挖掘出相似度高的文档集,初步划分文档类别,同时对“粗结果”的评价再次融入模糊聚类算法,不断地将具有“粗相似度”结果的文档集划分为若干个簇,同一个簇内文档内容的相似度不断扩大,而不同簇间的相似度不断缩小,最终达到合理的“物以类聚”。
    通过使用层次聚类为基本的挖掘工具,基本实现了能够以在线的、交互式的、语义的、层次的方式对搜索引擎的搜索结果进行聚类,从而基本解决了用户检索中出现的信息繁杂的问题。
With the rapid development of network and overspreading of information, the users onInternet find it is a difficult problem to acquire useful information quickly andefficiently in such a sea of information. With the existent searching engine, the usersmay roughly find what they want on the Internet. However, the resources got in such away are not exactly fit for the users. Such functions as structural information, textclassification and percolating cannot be offered to the users. As the main form ofinformation resources—documents, the tool that people can catch knowledge quicklyand efficiently from web documents is required.
    Having done deep research on clustering analysis in the field of data mining, this paperpresents a web clustering system based on agent technology, focus of which isclustering algorithm. It clusters similar webs automatically and submits the results touser interface finally. Algorithm applies vector space model to represent web documentsfirstly. Then fuzzy clustering algorithm mines documents of high similarity, dividesthem into rough clusters and throws the evaluation to the rough results to the fuzzyalgorithm again, partitioning these documents of rough similarity into several clusterscontinuously to enlarge similarity of documents in one cluster and reduce it in differentclusters. Finally things of one kind come together.
    Having the hierarchical agglomerative clustering as the mining tool, we may cluster thesearching results in an online, interactive, textual and hierarchical manner, so that thedifficult problems arising from searching can be tackled.

引文

[1] S.Lawrence, C. L. Files. Accessibility and Distribution of information on the Web. Nature, 1999, 400:107-109
    [2] Raymond T.Ng, Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proc. of VLDB Conf, 1994,144-155
    [3] R Cooley, B.Mobasher, J.Srivastava. Web Mining:Information and Pattern Discovery on the World Wide Web. Proc.IEEE Intl.Conf.Tools with AI, 1997
    [4] M.W.Berry, Z.Drmac, E.R. Jessup. Matrices, Vector Spaces, and Information Retrieval. SIAM Review, 1999, 41(2): 335-362
    [5] Yiming Ma, Bing Liu, Ching Kian Wong. Web for Data Mining:Organizing and Interpreting the Discovered Rules Using the Web. SIGKDD Explorations, Volume 2,Issue1, 2000
    [6] John G, Kohavi R, Pfleger K. Irrelavant features and the subset selection problem. In Proceedings of 11th International Conference on Machine Learning ICML94, 1994,121-129
    [7] 张焱,Jbuilder 5 实例教程,页数:281,2002.2,SS 号:10879893
    [8] Eric Armstrong, Jbuilder 2 实用大全,页数:620,1999.5.1,SS 号:10028844
    [9] 杨巨杰,Jbuilder 6.0 开发与应用,科学出版社,2003.1
    [10] 王能斌,数据库系统原理,北京:电子工业出版社, 2000
    [11] 彭木根,数据仓库技术与实现,北京:电子工业出版社,2002.6
    [12] 陈治平,林亚平,基于 N 层向量空间模型的信息检索运算, 计算机研究与发展,2002,10-12
    [13] 王连军,Web 文本挖掘浅析,现代图书情报技术,2002,97(6):38-40
    [14] 王涛,孙河山,Web 挖掘技术在搜索引擎中的应用,信息系统,2002,25(4):296~298
    [15] 陈建华,包煊,Web 挖掘系统的设计与实现,计算机工程,2002.8,28(8):141-142
    [16] 韩家炜,孟小峰,王静等,Web 挖掘研究,计算机研究与发展,2001.4 ,38(4) :406-411
    [17] 景丽萍,Web 文本挖掘及特征选取,电脑与信息技术,2002
    [18] 王永成,中文信息处理技术及其基础,上海:上海交通大学出版社, 1999
    [19] 吴立德,大规模中文文本处理,上海:复旦大学出版社,1997
    [20] 郭宏蕾,WWW 信息智能检索技术研究,北京航空航天大学博士后研究工作报告, 北京航空航天大学计算机系资料室,1999, 51-82
    [21] 李哓黎,周长胜, 基于相关反馈技术的 Web 检索改进研究与实现,航空计算技术,2004.9,34(3):129-133
    [22] 石冰,王安民,田国刚,WWW 信息检索系统的设计与实现,山东工业大学报,2001.4,31(2):130-133
    [23] 肖兆武,信息检索系统中信息模型的建立方法研究,科技情报开发与经济,2005,15(8):74-75
    [24] 刘明吉,王秀峰,饶一梅等,Web 文本信息的特征获取算法,小型微型计算机系统,2002.6, 23(6):683-685
    [25] 高新波,模糊聚类分析及其应用, 西安:西安电子科技大学出版社, 2004
    [26] 史忠植, 智能主体及其应用, 北京:科学出版社, 2000
    [27] 蔡自兴, 徐光祐, 人工智能及其应用, 北京:清华大学出版社, 1996
    [28] 王丽坤,王宏,陆玉昌,文本挖掘及其关键技术与方法,计算机科学,2002,29(12):12-18
    [29] 高军,陈锡元,无监督的动态分词方法,北京邮电大学出版,1997,20(4):66-68
    [30] 梁南元,郑延斌,书面汉语自动分词方法和分词模型,微型计算机,1991,3:18-22

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700