基于信息采集和全文检索的Intranet网络信息监察系统的研究

英文题名：Research of Intranet Information Supervision System Based on Net Crawler and Full-text Search Engine
作者：傅翔华
论文级别：硕士
学科专业名称：生物医学工程
中文关键词：Intranct ; Web ; 搜索引擎 ; 信息采集 ; 爬网机器人 ; XML ; 全文检索 ; 关系型数据库
英文关键词：Intranet ; Web ; Search engine ; Data collection ; Web crawler/Web spider ; XML ; Full-text search engine ; RDBMS
学位年度：2009
导师：郭文明
学科代码：0831
学位授予单位：南方医科大学
论文提交日期：2009-04-01
答辩委员会主席：姚国翔

摘要

随着计算机网络技术的发展和信息化建设的不断深入,单位和部门内部的网络应用水平不断提高,网络发展、建设的重点已经由网络建设初期的Internet应用服务转移到单位内部Intranet网络应用的拓展上。各单位普遍以本部门的业务工作为基础,依托Intranet网络建立了多项网络环境下的应用系统,在这些应用系统中,Web成为应用开发的主流平台,随着Web环境下的动态脚本技术、数据库技术开始成为Web应用开发的主流技术,Web环境下的信息发布能力大大增强,包含各种信息的交互式网站如雨后春笋般涌现。伴随着这种建设重心的转移和新技术的应用,各单位的网络应用水平和信息发布能力提高到了一个新的层次,随之而来的是Intranet环境下网络信息的爆炸性增长,如何对这些信息进行有效的监督和管理成为各单位网络管理部门面临的新问题。
     同Internet上的公共信息不同,Intranet应用中的信息同本单位内部的工作、业务、生活等各个方面息息相关,随着网络这一新生媒体在日常生活中扮演的角色越来越重要,这些网络信息的重要性和影响力也变得越来越大,因此对其进行有效的监管成为网络管理者迫切需要解决的问题,而网络信息的海量特征及其形式的多样性则增加了解决这一问题的技术难度。
     本文针对这种情况,提出了一种基于信息采集和全文检索技术构建Intranet网络信息监察系统的方法,通过使用计算机技术来实现对目前Intranet网络内Web信息的有效采集和信息的初步筛选,为网络管理者有效地对Intranet内的网络信息进行监管提供了一个可行的解决方案。通过使用目前搜索引擎技术中的爬网机器人技术(Web Crawler、Web Spider),系统的数据采集模块可在较短的时间内完成对Intranet网络内Web信息的数据采集和整理,然后通过数据库的全文检索技术对采集到的大量数据进行初步的检索和筛选。在系统开发过程中,结合Intranet网络内信息的特点,对爬网机器人技术进行了有效的改进,采用了“逐站式搜索”和设定“搜索规则”等技术思路来提高信息采集的准确性和效率。系统提供了基于B/S结构的用户接口,以搜索引擎的方式向用户提供服务,一方面为Intranet内用户提供了实用、方便的网络搜索服务,另一方面通过扩大系统的使用范围来提高系统对敏感信息的识别能力,通过对用户使用时的产生的历史关键字进行记录和分析,结合SQLServer数据库内全文检索引擎的相关技术参数设置,进一步完善系统对敏感信息的覆盖范围和覆盖程度。
     论文首先对目前Intranet网络信息管理所面临的形势和困难做了简要分析,对Intranet环境下网络信息的特点进行了归纳和总结,在此基础上,提出了一种利用计算机软件技术对网络信息进行有效监管的技术思路,针对系统构建中的一些技术难点提出了相应的解决方法,并对系统软件结构、具体实现方法进行了简要阐述,最后对当前系统已实现的目标和存在的问题以及有待改进的方面进行了总结。
With the development of network technology and application information-based, the level of application based on network and information has improved increasingly. The major platform of network development and construction has transferred from Internet to Intranet.Generally,most organizations and departments have built all kinds of internet application systems based on Intranet.New technology such as dynamic server-side script and database has widely been used in web application development.As a result,the information based on Intranet grows rapidly with transferring the main point of building and application of new technologies."How to efficiently control the information based on network,especially on Intranet?" has become a challenge,which makes the network administrations have to face.
     The information on Intranet is different from the one on Internet,which plays an important role in society,and has more significant influence on organizations and departments.Therefore,it is the efficiently supervision that should be paid attention on by internet management.However,there are some technology difficulties to manage this kind of information for the characteristic of information based on Intranet.
     As to the issues,a software method is introduced,which is based on data collection and full-text search engine to develop an Intranet information supervision system.With the help of this system,network administrator can catch the ability of information collection and data filter fast,thus helping the administrator to supervise the web information on Intranet.In the process of development,some popular software technologies are adopted,such as web crawler,which is widely used in web search engine,full-text search engine based on RDBMS.On the other hand, considering the characteristic of information on Intranet,researchers take some additional technical measures to ensure the system work more efficiently,such as "site by site search mode","restriction search rules".In addition,after the search task on Intranet completed by data collection module,researchers use full-text search engine based on RDBMS to manipulate the data,such as merge and filter,and extract valuable information.Also,it is useful to implement a web module in system, which combines web with full-text search engine and RDBMS,and it provides an easy-to-use user interface based on browser,which offers a convenient way for users to get access to the system.Furthermore,people can use this system to search any keyword they are interested in.Meanwhile,through analyzing keyword log which record all keywords user has utilize,it is helpful to find what users most interested in. As a result,network administrator can further improve supervising ability of the system.
     At first,the paper makes an analysis of the difficulty and embarrassing situation, which network administrator confront.Then,the writer summarizes the characteristic of the information on Intranet.After that,there is a presentation of a software solution to supervise information on Intranet,as well as a description of the software architecture and implementation of system.At last,the paper makes a conclusion of the system's goal achieved,the shortage and the improvements.

引文

[1]谢国强,蓝立新.基于Web的网络爬虫技术研究[J].科教文汇(中旬刊),2008,(04)
    [2]尹训宁,.创新引领搜索引擎未来[N];中国知识产权报;2007,(05)
    [3]杨娜,周长胜,马志强,丁维.基于校园网的搜索器技术研究与改进[J]计算机与数字工程,2007,(03).
    [4]刘洁清,吴京慧.面向主题的个人实时搜索引擎的设计与实现[J].现代图书情报技术,2006,(05).
    [5]胡东涛.基于web的网上文献传递系统研究与设计[D];大连理工大学;2006,(04).
    [6]孟晓明.浅谈搜索引擎及其发展趋势[J].福建电脑,2006,(03).
    [7]刘世涛.简析搜索引擎中网络爬虫的搜索策略[J].阜阳师范学院学报(自然科学版),2006,(03).
    [8]刘智浓,张永利.搜索引擎技术简析[J].电脑知识与技术,2006,(02).
    [9]徐雪梅.中外主流搜索引擎搜索能力研究[J].情报探索,2006,(02).
    [10]彭建荣,罗永会.搜索引擎的基本原理及发展趋势[J].电脑知识与技术,2006,(02).
    [11]张岚.搜索引擎百度与Google的比较分析[J].科技情报开发与经济,2006,(02).
    [12]黄琛.十大著名中文搜索引擎的特征及其比较[J].现代情报,2006,(01).
    [13]代六玲.互联网内容监管系统关键技术的研究[D]南京理工大学,2005,(10).
    [14]魏秀玲.浅谈网络搜索引擎[J].中共福建省委党校学报,2005,(12).
    [15]卢亮.搜索引擎暗规则全攻略[J].中国教育网络,2005,(07).
    [16]万国根.面向内容的网络安全监控模型及其关键技术研究[D]电子科技大学,2005,(07).
    [17]苏坤,夏旭.搜索引擎分类研究的现状与发展[J].图书馆论坛,2005,(01).
    [18]徐险峰,.网络信息检索搜索引擎技术及发展趋势[J].江西图书馆学刊,2005,(04).
    [19]黄崑,赖茂生.Web信息检索技术及研究进展[J]现代图书情报技术,2004,(05).
    [20]Ray R.Larson.Bibliometrics of the World Wide Web:An Exploratory Analysis of the Intellectual Structure of Cyberspace.http://Sherlock.berkeley.edu/asis96/asis96.html,2003.
    [21]Ronald Rousseau.Daily time series of common single word searches in AltaVista and NortherLighe.http://www.cindoc.csic.es/cybermetries/artieles/v2ilp2.html,2003.
    [22]Judit Barllan.How much information do search engines disclose on the links to a web page?Alongitudinal case study of the "cybermetrics"home page.Journal of Information Science.2002.
    [23]Haveliwala T H.Topic-sensitive PageRank[c].Proceedings of the Eleventh International World Wide Web Conference,Hoho Lulu Hawaii,2002.
    [24]Michelangelo Diligenti,Frans Coetzee,Steve Lawrence,etc.Focused Cranling
    using Context Graphs[J],International Conference on Very Large Data base.2002.
    [25]Munay B H,Moore A Sizing the Interent[M],A Whote Paper:Cyveillance,Ine.2000.
    [26]Chakrabarti S,van den berg M,Dom B,Focused crawling:a new approach topic -specific Webresource discovery[J],Computer Networks 1999.
    [27]Bertossi A A,Mancini L V,Rossini F.Fault-tolerant Rate-monotonic Frist-fit Scheduling in Hard-Real-Time Systems[J].IEEE Transactions on Parallel and Distributer Systems,1999,10(9).
    [28]Brin S,Page L.Tje amatomy of a large-scale hypertextual web search engine[A].proceedings of the Seventh International World Wide Web Conference,1998.
    [29]胡百敬,姚巧玫.SQL Server 2005数据库开发详解.电子工业出版社.
    [30]Nutch Version 1.0.http://www.nutch.org.
    [31]Lucene Version 2.4.0.http://lucene.apache.org..
    [32]MSDN.http://msdn.microsoft.com/zh-cn/default.aspx.
    [33]Nagel.C#高级编程(第六版).清华大学出版社.
    [34]Matthew MacDonald,Mario Szpuszta.ASP.NET 3.5高级程序设计(第2版).人民邮电出版社.
    [35]Francois Liger,Craig McQueen,Paul Wilton.C# Text Manipulation.清华大学出版社.
    [36]Thiru Thangarathinam.ASP.NET 2.0 XML高级教程.清华大学出版社.
    [37]Ben-Gan,Kollar,Sarka.SQL Server 2005技术内幕T-SQL查询.电子工业出版社.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700