基于元搜索与内容聚类的情报获取技术研究

英文题名：A Study on the Technique of Information Acquirement Based on Meta-Search and Content Clustering
作者：翁勍力
论文级别：硕士
学科专业名称：情报学
中文关键词：情报获取 ; 元搜索引擎 ; 聚类挖掘 ; OLAM ; 多维分析
英文关键词：Information Acquirement ; Meta-search Engine ; Clustering ; OLAM ; Multidimensional Analysis
学位年度：2007
导师：赵捧未
学科代码：120502
学位授予单位：西安电子科技大学
论文提交日期：2007-01-01

摘要

目前网络信息已经成为主要的情报源,其获取的主要方式之一就是使用搜索引擎。但是,利用搜索引擎获取的网络信息仍存在很多问题:例如获取的信息量很大但是有用信息很少;获取的信息多样但是用户无法识别相关信息群体等。有用信息资源的获取已经逐渐成为情报业发展的一个瓶颈。因此,如何从海量信息中剔除无用信息,迅速定位至信息群,从而快速、高效地获取情报资源,并对其进行加工整理并提供给情报用户,是情报界人士面临的一大挑战,也是目前亟需解决的问题。
本论文以提高情报获取效率与质量为主要目标,研究和实现了基于元搜索与内容聚类的情报获取系统。主要创新点:(1)设计了情报获取系统的总体框架,提出了搜索模块、运算模块、用户模块三大功能模块,并阐述各模块的功能流程。(2)提出了基于网页标题摘要分析方法进行元搜索引擎结果相关性判断。实验结果表明,元搜索引擎搜索结果的平均准确率比各个成员引擎的搜索结果平均准确率都有较大提高。(3)结合当前两种主要的聚类算法—K-means划分法和BIRCH聚类算法,提出了在元搜索结果处理基础上进行聚类的方法。实验证明,该方法在聚类效果上有较明显的改善,并且效率得到了很大提高。(4)在情报获取系统的设计实现方面,提出了数据库系统、软件系统、人机界面的设计方案,实现了基于网页标题摘要分析的信息检索、基于元搜索结果和K-means与BIRCH算法结合算法的聚类分析,以及基于OLAM的多维分析。
Web has become the main resource to acquire information, and the Search Engine is main tool. However, the information acquired is still unsatisfactory. Users cannot distinguish the useful information from enormous unstructured search results. Users desire to get good information with high efficiency, conquer information overload and harness the true power of information.
In this paper, to promote the efficiency and quality of information acquisition, we study and develop the system of information acquirement based on Meta-search and content clustering techniques with following approaches: (1) we propose a whole frame with three modules: search module, operation module and user module, their work flow are introduced too. (2) Promote the method of analyzing title and abstract of web page to judge relevance of search results. The experiment proved the improvement of average veracity comparing with the member search engines. (3) Put forward a clustering method based on meta-search and two clustering algorithms--K-means and BIRCH. The evaluation of experiment shows the improvement on clustering results and efficiency. (4) In system design and realization aspects, we introduce the database system, software system, and human-machine interface, it can complete the three functions, i.e. the information retrieval based on title and abstract analysis, clustering based on meta-search and two clustering algorithms--K-means and BIRCH, and multidimensional analysis based on OLAM.

引文

[1] 严怡民等著. 现代情报学理论. 武汉:武汉大学出版社,1996. ,87～93 页
    [2] Holscher.C&Strube.G(2000).Web Search Behavior of Internet Experts and Newbies.Proceesings of 9th International WWW Conference.
    [3] 李晓明,闫宏飞,王继民著.搜索引擎—原理、技术与系统.北京:科学出版社,2005,27～300 页
    [4] 符绍宏,黄崑. 搜索引擎技术与服务的研究及启示. 情报学报 2000.Vol19.No.6 628～636
    [5] Selberg, E. & Etzioni, O. (1997). The metacrawler architecture for resource aggregation on theWeb. IEEE Expert,12.8-14
    [6] Profusion: Intelligent Fusion from Multiple Distributed Search Engines, Robert M. Losee, Gary Marchionini, Gregory B.Newby, Paul Solomon, Ellen Voorhees.
    [7] 张卫丰等. 基于遗传算法的搜索引擎调度. 微电子学与计算机.2001.(4): 34-38
    [8] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, O. Frieder, N. Goharian, "Disprovingthe Fusion Hypothesis: An Analysis of Data Fusion via Effective Information Retrieval Strategies",Proceedings of the 2003 ACM Symposium on Applied Computing (ACM-SAC), Melbourne, FL, March2003.
    [9] Glover.E.Flaker,G..Lawrence,S.Birmingham,W.P.Kruger,A.Giles,C.L,&Pennok.D (2001).Improving category specific Web search by learning query modifications. Symposium on Applications and the Internet.
    [10] Eric J.Glover, Steve Lawrence, William P.Birmingham, C.Lee Giles. Architecture of a Metasearch Engine that Supports User Information Needs. In Proceedings of the Eigth International Conference on Information Knowledge Management.pp.210-216,Copyright 1999,ACM
    [11] Labrou.Y&Finin.T(1999). Yahoo!as an ontology:using Yahoo! Categories to describe documents. Proceedings of the 8th ACM International Conference on Information and Knowledge Management, 180-187
    [12] 赖茂生. 计算机情报检索. 北京大学出版社, 2001
    [13] Lawrence.S,Bollacker.K&Giles,C.L(1999). Indexing and retrieval of scientific literature. Proceedings of the 8th International Conference on Information and Knowledge Managemenf,139-146
    [14] 陈伟雄.基于元搜索的中文搜索引擎研究与实现:硕士论文.北京:清华大学, 2004
    [15] 赵玮,温小霓编著. 应用统计学教程. 西安:西安电子科技大学出版社.2004
    [16] 王连军. Web 文本挖掘浅析. 现代图书情报技术,2002,97(6):38～40
    [17] http://vivisimo.com/
    [18] Zamir.O&Etziono.O(1999).Grouper: a dynamic clustering interface to Web search results. Proceedings of the 8th International World Wide Web Conference.
    [19] Hearst M A, Pedersen J. Reexamining the cluster hypothesis: Scatter/gather on retrieval results[C]. Proc. Of the 19th Annual Int’1 ACM/ SIGIR Conf. Zurich, 1996. 76～84.
    [20] Paul Bradley, Usama Fayyad, and Cory Reina. Scaling clustering algorithms to databases. Processing of the International Conference on Knowledge Discovery and Data Mining, pages 9-15, 1998.
    [21] 涂杰,俞桂平,张哲毅. 决策支持系统在税务系统中的应用. 计算机工程. 1997.23(S1).60-64
    [22] Tony Bain 等著,邵勇译. SQL Server2000 数据仓库与 Analysis Services[M] . 中国电力出版社,2003
    [23] 袁虹,何厚存 . 联机分析及数据仓库的建模技术 . 计算机应用研究 . 1999.(12).61-63
    [24] Beldell J. Data Modeling and Database Design for Data Warehouses. The Data Warehouseing. 1996. 278-279
    [25] 黄若波,左春,孙玉芳. 基于 Web 环境下的 OLAP 技术的研究和实现. 计算机工程. 2000.26(10).7-8
    [26] 王虹,王爱民. 基于 OLAP 与 DM 一体化思想的数据建模技术的研究[J ] . 计算机工程与应用,2002 , (2) :166- 167
    [27] 魏正红,欧阳为民,蔡庆生:基于数据立方的多层关联规则的元模式制导发现。小型微型计算机系统,Vol.20, No.7,July 1999
    [28] 刘夫涛,张雷,艾波:OLAM 以及基于 WEB 的 OLAM. 计算机工程与应用. 2000.9,108-109
    [29] 黄若波,左春,孙玉芳:基于 WEB 环境下的 OLAP 技术的研究和实现。计算机工程,VOL.26,NO,10,Oct.2000
    [30] 王珊:数据仓库技术与联机分析处理. 科学出版社,1998
    [31] 曹蓟光博士学位论文联机分析挖掘处理技术(OLAM)的研究,杭州:浙江大学计算机系 2001
    [32] (美)Mike Guderloy Tim Sneath. 张伟,宋霞. SQL Server 开发指南—OLAP(联机分析处理). 北京. 电子工业出版社. 2001.254-269。
    [33] 汤艳艳,邵伟民,王子红.数据仓库中的多维数据模型及其对象关系的实现.计算机工程.2003.29(9).88-92
    [34] 周晓峥 , 刘勘 . 多维数据集的平行坐标表示及聚簇分析 . 计算机工程.2002.28(1).94-95
    [35] 王继奎。宁云晖.数据仓库中的一种立方体数据模型.计算机工程与应用.2002.(05).188-190
    [36] 唐长杰,杨峰等:时间序列数据立方的存储与聚集计算 . 计算机应用.VOL.19,NO,9,Sep.1999
    [37] 李建中,高宏. 一种数据仓库的多维数据模型. 软件学报.2000.11(7).908-917

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700