基于特征分块的面向专业领域的网络信息搜索系统的研究与实现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于特征分块的面向专业领域的网络信息搜索系统的研究与实现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study and Implementation of Speciality-Oriented Web Information Searching System Based on Character Blocking
作者：肖燕华
论文级别：硕士
学科专业名称：控制理论与控制工程
中文关键词：特征分块 ; 信息搜索 ; 专业兴趣度 ; 用户知识模型
英文关键词：character blocking ; information searching ; interest in special domain ; user's knowledge model
学位年度：2004
导师：邵世煌
学科代码：081101
学位授予单位：东华大学
论文提交日期：2004-01-01

摘要

随着产业信息化的不断推进，越来越多的企事业单位通过Internet发布信息。因特网上蕴藏着大量的专业信息，如何克服目前网络信息搜索过程中遇到的信息过载和资源迷向的问题，高效、便捷地获取专业领域的信息已经成为当今网络信息搜索研究领域的一个新方向。
     面向特定领域的搜索具有很好的专业指向性，搜索对象明确，资源相对集中，在一定程度上克服了网络资源发散、复杂、多元的特点，可以较容易地跟上站点和网页增长及内容更新的速度，文档的分析、自动处理和面向领域的知识库的建立也相对较容易实现。
     本文结合科研项目“纺织企业信息库的动态刷新和自动搜索分析系统”，研究并实现了基于特征分块的面向专业领域的网络信息搜索系统。文章首先对网络信息搜索的发展与研究现状进行了分析，指出现有的网络信息搜索系统存在的问题，并提出了网络信息搜索今后的发展方向；在对网络信息搜索技术作了大量研究的基础上，本文提出了一种基于信息分块的特征提取与分析方法，即根据传递信息的意义度的大小将网页分成不同的区块，对不同的区块根据内容和结构进行信息挖掘。文章对专业信息捕获技术进行了研究，提出基于向量空间的网络专业信息获取模型，建立以面向领域网站内容为主的索引数据库，实现专业领域信息提取的Robot爬行．算法和专业搜索策略。同时本文还研究、构建了基于专业兴趣度的用户知识模型，为用户提供个性化服务，提高系统的智能性。其中包括专业兴趣度的获取、模型的建立与优化、反馈处理等方面的研究。
     最后，本文给出了在纺织化纤领域的一个应用实例---SOKEY系统，介绍了SOKEY系统的总体架构和工作流程，并在Windows2000 Server平台上使用Active Server Page 3.0、VbScript等嵌入式编程语言和面向对象的数据库技术，完成了网页信息的动态刷新、文档信息的自动查询、结果信息的分段浏览、用户信息收发等功能模块，使用VC完成了网络信息自动捕获模块。
With the continuous advancement of information industry more and more enterprises and companies begin to issue their information on internet. There are volumns of professional information on the Internet and it has become a new aspect of the study of the web information retrieving to obtain professional information effieciently and conviently while overcoming the problems of data-overloaded and i nformation-lost w hich people n ow encounter w hen t hey search web information.
    The speciality-oriented web information possesses good speciality direction w ith c lear se arching object a nd r elatively concentrative d ata so urce. Moreover, it conquers web information's characteristics of radiation, complication and varity. It can easily keep up with the space of web sites' increasement and web content's updating. And the documents' analysis, auto-diaposal and establishment of speciality-oriented database will be realized more easily.
    Combined with the scientific research project, textile enterprise's database dynamic-renovating, auto-searching and analyzing system, speciality-oriented web information searching system which is based on character blocking is studied and realized in this paper. At first, the author studies the actuality and development of the web information searching research, indicating the problems occurring in the existed web information system, putting forward the further developing aspect of web information. Based on great study of speciality-oriented web information searching technology, a new method to extract characteristic is advanced in this paper which is to mine content and structure data of web pages on the basis of character blocking. The paper also



    makes great work on the technologies of capturing speciality web information which involves proposing a model of speciality information capturing in the vector space, designing index database whose mainly content is information of specialty websites, developing robot algorithm and searching strategy to extract information of specialty domains. Meanwhile a user's knowledge model which is found on one's interest in special domain is constructed to provide personal services and improve the accuracy rate of web searching. And the study of user's knowledge model includes seizing users' interest in their special domain, building and optimizing model, dealing with user's feedback, etc.
    At the end of the paper an application named SOKEY system is given out in textile industry. The system's whole structure and working flow are introduced and the dynamical refreshment of web pages, the automatic query of document information, segment browsing of result information, receiving and sending of user' information have been realized by using Active Server Page 3.0, insert language of VbScript and object orient database technology on the platform of Windows2000 Server. And the module of auto-capturing web information has been achieved by using VC.

引文

[1] 刘波、王克宏，Internet上的各类常用资源，北京：清华大学出版社，1998
    [2] (美)D．A．沃尔著，曹成等译,World Wide Web用户使用指南，北京：科学出版社，1997
    [3] 曾明，World Wide Web使用开发指南，北京：人民邮电出版社，1996
    [4] 黄建华、顾春华，World Wide Web(环球信息网)实用技术，上海：华东理工大学出版社，1997
    [5] (美)WagneAuse，ScottApaJian著；龚杰、秦颂、梅丽冬译，怎样使用World Wide Web，北京：电子工业出版社，1997
    [6] 储荷婷、张晓林、王芳，Internet网络信息检索原理、工具及技巧，清华大学出版社，北京，1999年10月
    [7] 曾明，Internet的信息资源及其访问方法，http://www.hubce, edu.cn/jwc/jwc2/messages/215.html,2000年4月13日
    [8] Pcbooks.myrice.com, Internet的服务方式，http://pcbooks. myrice .com/internet-c.htm, 2000年12月14日
    [9] It. gn. cninfo. net, Telnet基础知识浅谈，http://it.yn.cninfo.net/news/21/2001-6-4/news_388_0.shtml, 2001年6月4日
    [10] Us2001.cn.st, Telnet 协议规范，http://www.xyg.dyns.cx/-kzq/base/bas-e6.htm,2001年821日
    [11] Lcqz.com，利用Usenet参加网上讨论，http://www.lcqz.com/keji/internet/xinwen.htm, 2000年4月22日
    [12] Stu.ccnu.edu.cn用电子邮件访问internet的多种资源 http://stu.ccnu.edu.cn/～internet/ninth/ninth23.htm, 1999年5月31日
    [13] PrideRock Network Studio, Introduction of Gopher, http://school.cnsun.com/netschool/internet/gopher.htm,2000.9
    [141 Qd.col.com.cn,Gopher的工作原理，http://www.qd.col.com.cn/internet/lect/lect241.htm, 1999年1月20日
    [15] 陕西公用计算机互联网信息中心广域信息服务，广域信息服务(wais)，

    http://www.lnu.edu.cn/inter/intemet_dc_16.html, 2001年3月13日
    [16] Nanhai.gd.cn, 索引服务, Archie, veronica和wais,http://www.nanhai.gd.cn/cai/computer/archie.htm, 1998年1月22日
    [17] 汤志伟、傅强、单力，Internet实用宝典，重庆，四川大学出版社, 1997.8
    [18] 中文搜索引擎指南，AltaVista搜索引擎登陆详解,http://www.searchchinese.com/sousuo/ur15.htm, 2001年1月10日
    [19] 中文搜索引擎指南，Excite搜索引擎登陆详解，http://www.searchchinese.com/sousuo/ur16.htm, 2001年1月2日
    [20] 中文搜索引擎指南，Google为搜索引擎添加网页预览功能,http://www.search-chinese.com/2001101.htm, 2001年2月21日
    [21] 中文搜索引擎指南，HotBot搜索引擎登陆详解，http://www.searchchinese.com/sousuo/ur17.htm, 2001年2月10日
    [22] 中文搜索引擎指南，Infoseek搜索引擎登陆详解，http://www.searchchinese.com/sousuo/ur18.htm, 2001年2月5日
    [23] 中文搜索引擎指南，Lycos搜索引擎登陆详解，http://www.searchchinese.com/sousuo/ur19.htm, 2001年3月21日
    [24] Elizabeth D. Liddy, etc., Text Categorization for Multiple Users Based on Semantic Features from a Machine Readable Dictions, ACM Trans, on Inf. Sys., 1994, 12(3):278-295
    [25] Yang YiMing, etc., An Example-Based Mapping Method for Text Categorization and Retrieve, ACM Trans, on Inf. Sys., 1994, 12(3): 257-277
    [26] Michie D., Spiegelhalter D. J., TaylorD C.C., Machine learning of roles and trees, In Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994.
    [27] Joachims T., A Probabilistic Analysis of the Rocchio Algofithm with TFIDF for Text Categorization, Proceedings of International Conference on Machine Learning (ICML), 1997
    [28] Guan T., Wong K. F., KPS:a Web Information mining algofithm, Computer

    Networks, 1999, 31:1495-1507
    [29] Ashish N., Knoblock C., Wrapper generation for semistructured Internet sources, SIGMOD Record 1997,26 (4) :8-15
    [30] Hammer J., Molina H. G., Cho J., Aranha R., Crespo A., Extracting semistructured information from the Web, Proc. of 1st Workshop on Management of Semistructured Data, Arizona, 1997
    [31] 刘广钟、曾聪文。基于Agent的分布式计算的研究，计算机工程于应用，2002(21)，p88—89
    [32] Hsichun Chen,Yi Ming Chung,Marshall Ramsey, Christopher C.Yang. A Smart Itsy Spider for the Web. http://ai.bpa.arizona.edu/-mramsey/papers/itsy/nodel.html
    [33] James Jansen Using an Intelligent Agent To Enhance Search Engine Performance.http://www.firstmonday.dk/issues/issue2_3/jansen/index.html
    [34] Bjorn Hermans..An Inventory of Currently Offered Functionality in the information
    [35] 汪晓岩，胡庆生，李斌，庄镇泉．面向Internet的个性化智能信息检索，计算机研究与发展，1999,36(9)：1039～1046
    [36] 潘金贵，胡学联，李俊，张灵玲．一个个性化的信息搜集Agent的设计与实现．软件学报，2001，12(7)：1074～1079
    [37] 自然语言理解技术及其应用探讨(上)http ://www0.ccidnet.com/school/net//2001/11/14/70_5482.html
    [38] 自然语言理解技术及其应用探讨(下)http://www0.ccidnet.com/school/net//2001/11/14/70_5483.html
    [39] 苗传江．简论黄曾阳先生创立的HNC理论．http://www.hncnlp.com/hncmcj3.htm
    [40] 张卫丰，徐宝文，许蕾，陈振强，赵凯华．利用Agent个性化搜索结果小型微型计算机系统，2001，22(6)：724 727
    [41] 石晶，龚震宇，裘杭萍，张毓森．基于用户兴趣模型的智能信息检索系统技术与实现．情报学报,2003,22(3)：282 286


    [42] 徐宝文，张卫丰．搜索引擎与信息获取技术．2003．4．清华大学出版社，北京．P25
    [43] Salton G. etc., A Vector Space Model for Automatic Indexing, Communications of the ACM, Vol 18, 1985
    [44] Joachims T., A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Proceedings of International Conference on Machine Learning (ICML), 1997
    [45] 陈洋，李金林，盖振华．数据库设计的新思路．北京理工大学学报(社会科学版)，2001，3(2)：33'36
    [46] 徐宝文，张卫丰．搜索引擎与信息获取技术．2003．4．清华大学出版社，北京．P175
    [47] 王志军，于超．基于隐式反馈的个人信息检索技术及实现．计算机工程，2003，Vol．29,No．6：158'159,192
    [48] 电子教程(下载)中心，CGI脚本入门学习，http://www.my0511.com/netschool/tindex.htm, 1999.6
    [49] Matthew D H, Chapter 12 Database Application Using CGI, http://166.111. 172.8/e-library/perl_for_web/ch 12.htm, 2001.3
    [50] 张蕾，Internet数据库连接器(IDC)技术，微电脑世界，http://www.pcworld.com.cn99/script/9901/010702a.asp,2001.2
    [51] Kruglinski D．J．著，潘爱民，王国印译。Visual C++技术内幕(第四版)，1999年，清华大学出版社，北京
    [52] 中国网站制作技术联盟，亲密接触ASR.Net，http://www.knowsky.com/,2000.9
    [53] Wille C，Koller C著，潇湘工作室译，24小时学通ASP教程。人民邮电出版社，2000．3

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700