面向web文本挖掘的主题搜索技术研究

英文题名：Study on Subject Search Technology of Web-oriented Text Mining
作者：段平
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Web挖掘 ; 主题搜索 ; 网络爬虫 ; 中文分词
英文关键词：Web mining ; object search ; net crawler ; Chinese word splitting
学位年度：2008
导师：刘志镜
学科代码：081203
学位授予单位：西安电子科技大学
论文提交日期：2008-01-01

摘要

随着因特网的快速发展,海量的Web数据资源已经成为人们获取知识与信息的重要来源。由于Web资源具有半结构性、离散性、实时性和异构性等特点,用户很难快速准确地从Web上获取真正有价值的信息。获取Web信息的主要方法是使用搜索引擎,而现在流行的通用搜索引擎不能很好的提供信息结构抽取、Web文本内容的分类、过滤以及文档理解方面的功能。因此,如何设计搜索引擎技术,使之更适应的对Web资源进行高效的挖掘就成为了研究热点。
     本论文的研究内容是面向Web文本挖掘的主题搜索引擎研究与系统设计。重点讨论了当前流行的Web挖掘以及搜索引擎的核心技术,并且设计和实现主题Web信息挖掘和搜索原型系统Label3。本文的主要工作研究如下:
     主题爬虫技术:改进了以往的爬虫策略,提出了基于非贪婪遗传算法的网络爬虫搜索策略,对各个算法进行数据分析和性能比较。
     语言过滤分词、中文字词切分算法:考虑到拉丁语言与中文语言的差异,本文讨论了各自的语言分词算法,特别针对中文语言的特殊性,提出了基于字典的“词元”分词算法。
     Web数据的挖掘算法:主要是对采集到的Web数据,进行数据聚类分类,发现数据的内在联系,并且提取文本的类别信息,为用户提供更好的信息服务。
     数据索引和检索机制:数据索引机制采用独特的倒排序策略来建立数据索引,对获取的文本信息进行细化。信息查询检索服务针对不同类别网页分类查询,使用户的得到的搜索结果更加精确。
     针对以上研究成果,本文描述了原型系统的设计实现细节。
With fast development of Internet, mass Web data resources have become important source of knowledge and information obtainment. Due to the characters of Web resources, such as half-structure, discreteness, real-time and isomerous property, it is hard for users to get real valuable information fast and accurately from Web. The main method of getting Web information is using search engine. But the common popular search engine can not support some functions, such as information structure extraction, classification and filtration of Web text content, document understanding and so on. Therefore, how to design search engine fit for efficient Web data-mining has become hot research object.
     This study focuses on object search research and system design oriented to Web text mining. Current popular key technique of Web mining and search engine is importantly discussed. And prototype system named Label3 of object Web information mining and search is designed and implemented. The main research tasks can be described as follows:
     Object crawler technology: Past crawler strategy is improved, and search strategy of Based on genetic greedy algorithm net crawler is proposed. Besides, data analysis and performance comparison of each algorithm are given.
     Algorithms of filter splitting、Chinese word splitting: Considering the difference of Latin language and Chinese, we discuss word splitting algorithms of each language. Based on the specialty of Chinese, a dictionary-based word splitting algorithm named“Word cell”is proposed.
     Web data mining algorithm: It is mainly to cluster and classify the collected Web data, discover the inner relationship of data, and extract type information of text. This algorithm can provide better information serving for users.
     Data index and retrieval scheme: Unique reverse-order stratagem is adopted by data index scheme to form data index, and refine the obtained text information. The information query implement retrieval according to web pages with different type, and the retrieval results are more accurate.
     Based on above research achievements, details of prototype system design and implement is descried in this study.

引文

[1] R. Cooley,B. Mobasher,J. Srivastava.Web Mining: Information and Pattern Discovery on the World Wide Web. Ninth IEEE International Conference on Tools with Artificial Intelligence,Los Alamitos,California. 1997,11. 558～567.
    [2] Deep Web white paper. [2003-12-06]. http://www. completeplanet. com/Tutorials/Deep Web/index.asp
    [3] www.searchenginewatch.com.
    [4]韩家炜,孟小峰,王静. Web挖掘研究.计算机研究与发展. 2001,4,38(4) . 405～414.
    [5]《知识库系统导论》.徐洁磐等编著.科学出版社
    [6] James Pitkow. In Search of Reliable Usage Data on the WWW. Proceedings of the Sixth International WWW Conference, Santa Clara, California, 1997.
    [7]王继成,潘金贵,张福炎. Web文本挖掘技术研究.计算机研究与发展. 2000,5,37(5). 513～520.
    [8]张云涛,龚玲.数据挖掘原理与技术.北京:电子工业出版社,2004. 177～188.
    [9]夏火松.数据仓库与数据挖掘技术.北京:科学出版社,2004. 207～219.
    [10]高飞,谢维信.互联网上的数据挖掘.计算机科学. 2001,28(5) . 81～84.
    [11]金玮,张克君,曲文龙等.分布式Web用户兴趣迁移模式挖掘研究.计算机工程. 2006,12,32(24):44～47.
    [12]王实,高文,李锦涛. Web数据挖掘.计算机科学. 2000,27(4). 28～31.
    [13]冯是聪,张志刚,李晓明.一种中文网页自动分类方法的实现及应用.计算机工程. 2004,3,30(5). 19～20.
    [14]王继成,萧嵘,孙正兴,张福炎. Web信息检索研究进展,计算机研究与发展.2001,38(2)
    [15]刘建国.搜索引擎概述.北京大学计算机与科学技术. 1999,10 (20)
    [16] Lawrence S ,Giles C L. Accessibility and distribution of information on the Web. Nature.1999,400:107109
    [17] Zou Tao, Wang JiCheng, Zhang FuYan et al .The survey of text information retrieval. Computer Science.1999,26(9):7275
    [18] Selberg E, Etzioni O. Multi-service search and comparison using the metacrawler. In:Proc of the fourth Int'l Conf on the World Wide Web,Boston,USA.1995
    [19]Arvind Arasu, Junghoo Cho, Hector Garcia-Molina等. Searching the Web. ACM Transactions on Internet Technology. 2001, 8, 1(1). 2～43.
    [20] Cho J, Garcia-Molina H, Page L. Efficient Crawling through URL Ordering. Computer Networks. 1998, 30(1-7). 161～172.
    [21] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. San Francisco: Morgan Kaufmann, 2003.
    [22] G. Salton, M. McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983.
    [23] S. Chakrabarti, K. Punera, M. Subramanyam. Accelerated Focused Crawling through Online Relevance Feedback. WWW2002, May 7-11, 2002, Honolulu, Hawaii, USA.
    [24] Gautam Pant. Deriving Link-context From Html Tag Tree. Proceedings of the 8th ACM. SIGMOD workshop on Research issues in data mining and knowledge discovery. ACM Press, 2003. 49～55.
    [25] G. Pant, F. Menczer. Topical Crawling for Business Intelligence. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003). Trondheim, Norway, 2003.
    [26] http://www.w3.org/people/raggett/tidy/.
    [27]许欢庆,王永成,孙强.基于遗传算法的定题信息搜索策略.中文信息学报. 2003, 17 (1)
    [28]李学勇,田立军,谭义红,欧阳柳波,李国徽.一种基于非贪婪策略的网络蜘蛛搜索算法.计算技术与自动化.2004,23(2)
    [29]费洪晓,胡海苗,巩燕玲.基于Hash结构的机械统计分词系统研究.计算机工程与应用.2006,42(5)
    [30]文明,方凯,汪方斌,丁俊香.一种基于SVM的多类判别算法.工业仪表与自动化装置.2006,6
    [31]姚勇.分布式Web挖掘与搜索的研究与实现.西安电子科技大学硕士学位论文. 2006.
    [32]魏松,钟义信,王翔英.中文Web文本挖掘系统WebTextMiner开发.计算机应用研究.2006,6.211～213

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700