专题搜索引擎关键技术的研究

英文题名：Research on Pivotal Technology of Focused Search Engine
作者：杨治秋
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息聚类 ; 专题搜索引擎 ; 汉语自动分词 ; 向量空间模型 ; 专题词典
英文关键词：Information clustering ; Focused search Engines ; Chinese word segmentation ; Vector space model ; Special dictionary
学位年度：2006
导师：原福永
学科代码：081203
学位授予单位：燕山大学
论文提交日期：2006-01-01

摘要

随着Internet技术的飞速发展，WWW已成为人们进行信息交流不可缺少的巨大的信息空间。面对如此巨大的海量信息，人们在寻找自己所需的信息时常常迷失方向。如何快速、准确的从浩瀚的信息资源中找到自己所需的信息已成为困扰用户的一大难题。
     本课题针对现有搜索引擎的不足，提出专题搜索引擎的解决方案，实现搜索引擎的专题化需求，并就方案中涉及到的一系列理论和技术问题进行研究，主要包括：
     首先，改进专题搜索引擎开发模型框架，并给出工作原理，在元搜索引擎的基础上，实现搜索引擎的专题性服务。
     其次，文本自动分类技术是专题搜索引擎开发的一个重要环节，针对文本自动分类中存在的不足，重点论述了对特征提取技术、特征加权技术、词干提取技术和日志分析技术的改进和完善。从而有效地保证了设计的专题搜索引擎在查全率和查准率方面的提高。
     然后，分词技术是专题搜索引擎的一个重要研究方面，本文在搜索引擎分词方面采用了一种基于数据视图的实用分词匹配方法，该方法实现简单，效果较好。同时，构造了专题分词词典，为用户进行检索提供了便利，提高了工作效率。
     最后，在分析了传统k平均聚类方法不足的基础上，提出了一种文本聚类算法，通过选取较优的初始聚类中心，为更好的进行文本聚类提供了前提条件。实验表明该聚类算法可以提高聚类的稳定性并改善聚类效果。
With the rapid development of Internet, the WWW has become an indispensable enormous information space to exchange information. In the face of such tremendous flood of information, people often lost themselves in the required information. How to find out the information they need fast and accurately has become a depressing problem.
     With the drawback of existing search engines, a solution to the topic search engines is proposed, and it satisfies what the search engines' specialization required. As the series of problems about theory and technology mentioned in the solution, follow research has been done:
     Firstly, a model framework of the topic search engine is improved and operated principles are given. Based on the achievement of thematic search engines, the special topic service of the search engine is realized.
     Secondly, text automatic classification technology in this paper is an important part in developing special topic search engine. In allusion to the shortage in text automatic classification, more illuminations are given to the improvement and perfection of the feature extraction technology, features weighted technology, importation word extraction technology and the log analytic technology, which ensure the improvement in completion and precision of the designed topic search engine.
     Thirdly, classification is an important research aspect of the topic search engine. A practical classification method based on data view was adopted for search engine. In the mean time, the participle dictionary of special subject has been constructed, which offers convenient searching for users, and the working efficiency has been raised.
     Finally, after analyzing the deficiencies of the traditional k-average clustering method, a text clustering algorithm is put forward. It can better improve the text clustering by selecting the better initial clustering center. It is proved to improve the stability and the targeting mechanism results by selecting the better initial clustering center.

引文

1 Steve Lawrence, C.Lee Giles. Accessibility of Information on the Web, Nature, 1999, 400(8):213-218
    2 Steve Lawrence, C. Lee Giles. Searching the World Wide Web. Science,1998,280 (5360):98-100
    3 许晋军，苏新宁．信息搜索引擎综述．计算机系统应用，1999，4(9)：22-24
    4 戴雅琴．WWW信息专题式智能化检索系统的研究和设计．[西安交通大学硕士学位论文]．2001：17-21
    5 傅欣．搜索引擎质量评价研究．[北京大学硕士学位论文]．2003：16-20
    6 李晓黎．WEB信息检索与分类中的数据采掘技术．[中科院计算所博士学位论文]．2001：45-51
    7 丁国良，王嘉祯．专题式Web信息检索系统的设计与实现．军械工程学院学报，2000，14(1)：58-61
    8 Lucas Introna, Helen Nissenbaum. Defining the Web: the Politics of Search Engines. IEEE Computer,2000,(3):54-62
    9 邹涛，王继成．网络上的信息挖掘技术及实现．计算机研究与发展，2001，36(8)：79-82
    10 韩家炜，M．Kamber．数据挖掘：概念与技术．北京：机械工业出版社，2002：185-220
    11 郑毅，吴斌，史忠值．基于概念空间的文本检索系统．计算机工程与应用，2002，(11)：67-69
    12 J.Lallan, Z.Lu, W.Croft. Searching Distributed Collections with Inference Networks. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finiand,2001:21-28
    13 Sergey Brin, Lawrence Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings of the 7th International WWW Conference, Stanford University, Computer Science, 1998:14-18
    14 张士靖．试论搜索引擎的发展和专题搜索引擎的建立．医学情报工作，2001，22(5)：18-20
    15 王奕飞，梁国庆．国外常用医学专业搜索引擎．医学与计算机应用，2003，16(12)：75-78
    16 郑磊，曾方银．医学常用专业搜索引擎．解放军检验医学杂志，2002，14(1)：86-87
    17 任丽娟．科学搜索引擎SCIRUS研究．水利电力科技，2000，30(3)：57-63
    18 Soumen Chakrabarti. Focused Crawling:a New Approach to Topic-specific Web Resource Discovery. Computer Networks, 1999,(7): 1623-1640
    19 S.Lawrebce, C.L.Giles. Digital Libraries and Autonomous Citation Indexing. IEEE Computer, 1999,32(6):67-71
    20 高薇薇．国内搜索引擎现状述略．情报杂志，2001，21(9)：47-49
    21 候震宇．主题型搜索引擎的研究与实现．[中国科学院文献情报中心硕士学位论文]．2003：1-38
    22 曹玉霞．搜索引擎新思维．现代图书情报技术，2000，(5)：12-15
    23 唐铭节．论搜索引擎的发展概况及发展趋势．情报杂志，2001，(5)：70-71
    24 李振星．搜索引擎专业化智能化研究．[北京航空航天大学博士学位论文]．2003：1-30
    25 C.J.van Rijsbergen, Buttersworth. Information Retrieval. Second Edition. London:Springer Netherlands,2000:203-228
    26 刘向辉．专题性智能搜索引擎的研究与实现．[昆明理工大学硕士学位论文]．2001：24-29
    27 Gerald Kowalski. Information Retrieval Systems-theory and Implementation. Seattle: Kluwer Academic,2001:201-225
    28 S.Lawrence, C.L.Giles. Inquire the NECI Meta Search Engine. Computer Networks and ISDN Systems,1998,30(7):95-105
    29 陈智健．WWW上Meta-Search的研究与实现．计算机科学，1999，26(4)：38-42
    30 邹涛．中文文档自动分类系统的设计与实现．中文信息学报，1998，13(3)：26-32
    31 游荣彦，邓志才，李传宏．向量空间模型中特征词的区分度的定量研究．中文信息学报，2001，16(3)：15-19
    32 陶跃华，王锡钢，王云爱．信息检索向量空间模型中特征提取的研究．云南师范大学学报，2000，20(6)：18-20
    33 杨文清．基于Web文档库的中文全文检索技术与实现．[南京大学硕士学位论文]．1998：1-45
    34 蔡巍．Internet网上信息检索新趋势．情报杂志，1998，17(4)：41-43
    35 D.R.Cutting, D.R.Karger, J.O.Pedersen. Scatter/Gather:a Cluster-based Approach to Browsing Large Document Collections. In proceedings of the 12th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finiand,1992:89-92
    36 J.E.Jackson. A User's Guide to Principal Components. New York:John Wiley & Sons, 1991:47-59
    37 M.W.Berry, S.T. Dumais, G.W.Brien. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 1995,(10):573-595
    38 T.K.Landauer, S.T.Dumais. A Solution to Plato's Problem: the Latent Semantic Analysis Theory of Acquisition. Psychological Review, 1997,104(2):211-220
    39 C.Papadimitriou, P. Raghavan, H.Tamaki. Latent Sematic Indexing:a Probabilistic Analysis. In Proceedings of the ACM Conference on Principles of Database Systems, Seattle, Washington, United States, 1998:254-261
    40 C.H.Q.Ding. A Similarity-based Probability Model for Latent Semantic Indexing. In SIGIR-99, New York,1999:58-65
    41 M.Mecella, B.Pernici. Parallel VLSI Matrix Pencil Algorithm for High Resolution Finding. IEEE Tram on ASSP,1991,39(2):383-394
    42 Michael Steinbach. A Comparison of Document Clustering Techniques. Technical Report of University of Minnesota, Boston,2000:20-23
    43 尹浩．基于WWW的新闻搜索引擎的设计与实现．[西南交通大学硕士学位论文]．2003：35-41
    44 Sugiura, A. O. Etzion. Qvery Routing for Web Search Engines. Computer Networks, 2001,33(6): 74-78
    45 戴先宇，王明文．带参数的搜索引擎．江西师范大学学报，2002，26(4)：344-348
    46 盛宪锋，邹山峰．基于元搜索引擎的专业式智能网络信息检索系统．计算机工程与设计，2004，25(1)：145-147
    47 涂征．专题搜索引擎的设计与实现．[中国科学院硕士学位论文]．2001：15-21
    48 边新志．搜索引擎原理分析及实现．农机化研究，2005，(01)：248-251
    49 易开屏．当今网络搜索引擎的局限与发展．韶关学院学报，2001，22(3)：51-57
    50 C.M.Bowman, P.B.Danzig. The Harvest Information Discovery and Access System. Proceedings of the Second International World Wide Web Conference, Boston,1994: 763-771
    51 M.Diligenti, F.M.Coetze, S.Lawrence. Focused Crawling Using Context Graphs. VLDB Conference, Brisbane,2000:124-128
    52 K.K.Nambiar. Theory of Search Engines. Computers and Mathematics with Applications,2001,42(12):1523-1526
    53 Shu Bo, Kak Subhash. A Neural Network-based Intelligent Meta Search Engine. Information Sciences, 1999,120(4):7-11
    54 马瑞民，李建平，王浩畅．基于元搜索的专题式Web搜索引擎的实现．大庆石油学院学报，2002，26(4)：55-60
    55 胡誉耀．元搜索引擎在数字图书馆中的运用．图书与情报，2003，(5)：54-59
    56 吕传宇，李华，耿虎．一种适合于专题元搜索引擎的信息检索策略．重庆大学学报，2004，27(7)：69-73
    57 Jeonghee Yi, N.Sundaresan. Metadate Based Web Mining for Relevance. In IEEE 2000 International Database Engineering and Applications Symposium, Yokohama, Japan, 2000:225-230
    58 Zhexue Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997:41-47
    59 苏伟峰，李绍滋．一个基于概念的中文文本分类模型．中文信息学报，2002，13(3)：6-10
    60 钟涛，陈新明，万钧．中文文本Web搜索引擎的设计与实现．计算机工程与应用，2001，(10)：149-151
    61 邹海山，吴勇．中文搜索引擎中的中文信息处理技术．计算机应用研究，2000，17(12)：9-12
    62 揭春雨，刘源，梁南元．论汉语自动分词方法．中文信息学报，1989，23(1)：1-9
    63 黄昆，符绍宏．自动分词技术及其在信息检索中应用的研究．现代图书情报技术，2001，15(3)：26
    64 刘正清．汉语自动分词方法．浙江大学学报，1997，31(3)：306-312
    65 刘客松．汉语语言的无词典分词模型系统．计算机应用研究，1999，(10)：8-10
    66 卜东波．聚类/分类理论研究及其在文本挖掘中的应用．[中科院计算技术研究所博士学位论文]．2002：9-27
    67 H.X.Wang, C.Zaniolo. Database System Extensions for Decision Support:the AXL Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, California,2000:11-20
    68 R.O.Duda, P.E.Hart. Pattern Classification and Scene Analysis. New York:John Wiley and Sons,1998:145-156

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700