基于索引云的企业搜索引擎实现研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于索引云的企业搜索引擎实现研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Study on Implementation of Enterprise Search Engine Based on Index Cloud
作者：陈旭毅
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：企业搜索引擎 ; 索引云 ; 索引组织策略 ; 索引树结构 ; 任务调度设计
英文关键词：Enterprise Search Engine ; Index Cloud ; Index Organization Strategy ; Tree-Index Structure ; Task Scheduling Design
学位年度：2011
导师：周宁
学科代码：1201
学位授予单位：武汉大学
论文提交日期：2011-11-01

摘要

随着企业信息化的发展,企业内部的数据资源正在急剧膨胀。企业对信息的管理和资源的访问提出了更高的要求,因此,建立企业内部搜索引擎具有必然性,也是企业信息资源管理的发展趋势。企业搜索引擎实现的关键技术之一是对企业内各种信息化资源索引结构的构建,索引结构的构成方式在很大程度上对企业搜索引擎的检索性能起着决定性作用。本文在对企业内部搜索引擎设计时,在传统索引结构的基础之上经过创新改造,将云计算思想引入到索引系统中,提出了一种新的索引框架——索引云模型,并在此基础上提出了新的企业搜索引擎体系架构。
     本文首先阐述了搜索引擎的概念和分类,研究了搜索引擎的工作原理和技术,了解了搜索引擎的发展,然后阐述了云计算的概念和分类,研究了云计算的技术和实现。
     本文对索引的组织方式进行了细致的研究,阐述了索引的概念和索引文件的组织方式,对几种常用的索引组织方式B-树、B+树、R树、R*树进行了详细的研究和讨论。对索引项的构成方式,如正排索引、倒排索引、后缀数组、签名文档技术进行了介绍。在搜索引擎和云计算理论的基础上,依据索引理论提出了索引云模型,该模型基于数据分类存储、分布式运算及并行处理三个基本原理进行设计,具有高度虚拟化、高性能、高可靠性、安全性强、可扩展性强、通用性好等显著特点,更适合于企业搜索引擎的需求。
     本文对索引云模型进行了全面深入的研究,详细给出了索引云的定义、索引云的原理、索引云的基本特征。针对搜索引擎中索引组织策略在检索性能和可扩展性等方面存在的问题,在对基本索引组织策略进行比较后,本文在索引云系统中采用了一种混合型分布式索引组织策略。在索引云数据结构中,采用了一种新的以B+树为基础结合字典顺序数据结构的DicB+Tree索引树结构的框架DPIC (Distributed & Paralleling Index Cloud).基于DPIC设计了索引云的核心管理策略,保证了系统资源能够得到最大限度的利用。研究并给出了索引云的内部处理架构、索引数据的组织方式,索引数据的分配,索引项数据的备份以及索引数据的调整和重构的方法。此外,本文还详细阐述了索引云中的数据检索任务的分析、分布式调度的处理过程。
     本文系统综述了企业搜索引擎的特点、企业搜索引擎技术的研究现状,分析了企业搜索引擎在检索需求、检索方法、检索对象和安全性等方面与传统的web检索存在的差异。因此,我们需要从搜索引擎的系统架构、索引组织策略、信息检索算法以及任务调度算法等方面全面研究企业搜索引擎系统,提出了企业搜索引擎与云计算相结合的思想。
     本文进一步提出了基于索引云的企业搜索引擎体系架构。介绍了企业搜索引擎的三个组成部分：通用存储平台、通用服务平台、通用应用平台,并详细说明了三个平台实现的方法。它以较低的硬件投入解决了全文搜索系统索引文件膨胀,网络带宽瓶颈以及磁盘I/O瓶颈等问题,提供了高效的数据存储和并行计算服务。本文设计出针对此体系的分布式的任务调度设计,综合考虑到索引节点的任务负载水平和索引词频,优化任务分配,避免出现系统热点,提高了索引系统的查询速度和可靠性。
     本文利用分布式开源系统框架Hadoop和开源搜索引擎系统Lucene,搭建了基于索引云原型的企业搜索引擎系统,进行了系统性能实验验证。本文详细讨论了基于索引云架构的企业搜索引擎的实验系统中各个部分的详细构建方法,从响应时间、吞吐率、负载均衡度等三个方面,对索引云原型系统进行了评估,证明了其可行性和良好的应用效果。
With the high speed development of the enterprises informatization, the internal data resources of enterprises are rapidly raising. Therefore, the enterprises call higher request on the information management and resource access, which brings out the higher demand for enterprise search engine. Index Structure is one of the core technologies of search engine and has an influence on the performance of whole search engine directly. For the design of enterprise search engine, this dissertation applied the typical thought of Cloud Computing to the index system and presented a novel index framework:Index Cloud. We also presented a new architecture of enterprise search engine based on the design of Index Cloud.
     The paper firstly gives the concept of a search engine and how the search engine is classified. Then studied the principle and technology of the search engine, and take a look at the development of the search engine. On the other hand, the paper states the concept and classification of the Cloud Computing. Then the paper studied core technology of Cloud Computing.
     This organization of the index and then conducted a detailed study to explain the concept of indexing, and index files are organized on the organization of several commonly used B-tree index, B+trees, R-tree, R*tree for a detailed research and discussion. Then the composition of the index entry methods, such as being ranked index, inverted index, suffix array, the signature document technology are discussed. Cloud computing in the search engine and based on the theory, based on the index theory of the index cloud model, the model classification based on data storage, distributed computing and parallel processing of three basic principles of design, with a high degree of virtualization, high performance, high reliability, strong security, scalability, versatility and other notable features, more suitable for enterprise search engine requirements.
     In this paper, a comprehensive index of the cloud model to study in depth. Detailed definition of the cloud is given an index, the index of the principles of cloud; cloud the basic characteristics of the index. Index for search engine retrieval performance in organizational strategy and scalability problems, etc., in the basic index-organized strategy comparison, this cloud system, the index uses a hybrid distributed index-organized strategy. Cloud data in the index structure, to use of a new B+tree-based dictionary index tree(DicB+Tree) forming DPIC(Distributed & Paralleling Index Cloud), and based DPIC, an index designed to cloud the core management strategies to ensure that the system resources can be utilized. Research shows the index of the main cloud of internal processing architecture, distributed parallel index tree structure, the index distribution of the cloud index data, index data replication, data migration and reconstruction of the index method. In addition, This paper describes the index to retrieve data in the cloud analysis tasks, distributed scheduling process.
     Then this systematic review of the concept of enterprise search engine and features, enterprise search engine technology, Research, analyzed the needs of enterprise search engine in the search, retrieval, retrieve objects, and security aspects of traditional web search with the existing differences. Therefore, we need a system architecture from the search engines, indexes organizational strategy, information retrieval algorithms and scheduling algorithms in a comprehensive study of enterprise search systems, search engines and the proposed business combination of cloud computing.
     The design of Index Cloud model is based on three fundamentals:data classification storage, distributed computing and parallel processing. It is characterized by visualizations, high performance, high reliability, strong safety, easy extensibility as well as universality; hence it can be more suitable for the requirements of enterprise search engine.
     The architecture of enterprise search engine based on the Index Cloud is further put forward. The new architecture not only resolves the problems exist in full-text searching system, such as index data inflation, network bandwidth bottleneck and disk I/O capability bottleneck, but also provides efficient data storage and parallel computing service. A distributed task scheduling model is established for the architecture, which took the task load level of index node and the index frequency into account with the purpose of optimizing task allocation, avoiding hot spots and ultimately improving the performance of system.
     Finally, a prototype system of Index Cloud based on Hadoop and Lucene has been constructed as a platform for the validation of system performance. We have conducted extensive simulation studies for response time, throughput, load balance and precision ratio. The experiment results demonstrate its feasibility and satisfactory applicable effects.

引文

①何赞.企业搜索引擎现状研究[J].科技创新导报,2007,(36)：116.
    ②马颖仪,李利强.中小型企业搜索引擎应用研究[J].科技信息,2008,(30)：393-394.
    ③星竹.信息管理不善将导致企业损失利润[J].中国建设信息,2010,(4)：34-35.
    ①张德政,张萍萍.非结构化信息管理[J].微计算机信息,2006,22(3)：218-219
    ①Abrol.M, Doshi.B, Kanihan.J,.ect. Intelligent taxonomy management tools for enterprise content. Proceedingss of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005
    ② A. Z. Broder, A. C. Ciccolo. Towards the next generation of enterprise search technology. IBM Systems Journal, v.43 n3, p.451-454, July 2004
    ① Peter Bailey, David Hawking, Brett Matson. Secure search in enterprise webs:tradeoffs in efficient implementation for document level security. Proceedings of the 15th ACM international conference on Information and knowledge management.2006, USA
    ① F.Brauer, M.Huber, G.Hackenbroich, U.Leser, F. Naumann. Graph-based concept identification and disambiguation for enterprise search. Proceedings of the 19th international conference on World wide web.2010
    ① Pavel Dmitriev, Nadev Eiron, Marcus Fontoura, Eugene Shekita. Using annotations in enterprise search. Proceedings of the 15th international conference on World Wide Web,2006
    ①文必龙,李智新,王英艳.基于元数据的企业搜索引擎研究[J].郑州轻工业学院学报(自然科学版),2008,(6)：4-6
    ①梁昌勇,张申恒.基于本体的企业文本检索模型研究[J].计算机应用研究,2005(12)：27-30
    ①王秀平,马保权,李治柱.企业专用搜索引擎的搜索策略[J].计算机与现代化,2006,(11)：59-61
    ①张明宝,米传民.一种基于UIMA的企业级信息检索系统研究[J].情报杂志,2009,(4)：128-13l
    ①刘俊晖,吴文涛,杨珉.企业内部基于角色协作的个性化搜索系统[J].计算机工程,2009,(35)：39-41
    ① Peter Bailey, David Hawking, Brett Matson. Secure search in enterprise webs:tradeoffs in efficient implementation for document level security. Proceedings of the 15th ACM international conference on Information and knowledge management.2006, USA
    ②张建梁.基于云计算的语义搜索引擎的研究.复旦大学硕士毕业论文.2009
    ①林乐然,陈德龙.基于云计算的分布式企业搜索引擎研究[J].电脑知识与技术,2009,(33)：9429-9430,9434
    ①李晓明,闫宏飞,王继民.搜索引擎[M].北京：科学出版社,2005
    ②张卫丰,徐宝文,周晓宇,李东,许蕾.Web搜索引擎综述.计算机科学,2001,28(9)：24-28
    ③http://www.microsoft.com/china/rcscarch
    ④武晓娟.基于网站的搜索引擎研究[J].大连海事大学学报,2008,(S1)：148-149
    ①曾宜礼.搜索引擎技术综述[J].科技情报开发与经济,2007,(6)：198-199
    ②刘冰,胡风华,申丽红.搜索引擎技术研究[J].软件导刊,2009,(7)：137-138
    ①陈治平.智能搜索引擎理论与应用研究.湖南大学博士学位论文,2003
    (?)I Rogers. The Google Pagerank algorithm and how it works. IPR Computing,2002
    ①李红梅.智能元搜索引擎关键技术研究.西安电子科技大学博士学位论文,2009
    ① http://www.youdao.com,2011
    ②熊回香,夏立新.自然语言处理技术在中文全文检索中的应用[J].情报理论与实践,2008,(7)：432-435
    ③王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述[J].中文信息学报,2007,(2)：35-45
    ①伯晓晨Internet 与 Intranet中的人工智能技术.计算机世界网,1998
    ②http://www.hudong.com/wiki/ASK%20JEEVES&prd=button_doc_jinrn
    ③Robert Armstrong,Dayne Freitag,Thorsten Joachims,Tom Mitchell.WebWatcher:A Learning Apprentice for the World Wide Web.Proceedings of the AAAI 1995 Spring Symposium on Information Gathering from Heterogencous,Distributed Environments,Stanford,March 1995
    ④伯晓晨.Internet与Intranet中的人共智能技术.计算机世界网,1998
    ⑤Robert B.Doorenbos,Oren Etzioni,Daniel S.Weld.A Scalable Comparison-Shopping Agent for the World Wide Web.In:P roceedings of the First International Autonomous Conference on Agents,1997
    ⑥Marko Balabanovic,Yoav Shoham.Fab:Content-based,Collaborative Recommendatiorn.Communications of the ACM,1997,40(3)
    ⑦Menczer F,Bclew R.Adaptive Retrival Agents Internalizing Local Context and Scaling up to the Web.Machine Learning 2000,39(2/3)
    ① Menczer F,Bclcw R.Adaptive Rctrival Agents Internalizing Local Context and Scaling up to the Web. Machine Learning 2000,39(2/3)
    ②张义忠等.基于内容的网页特征提取.计算机工程与应用,2001,(10)：12-19
    ③蒋晓冬,金宇晖,谈征。网上高质量智能信息检索系统的实现[J].计算机工程与科学,1999,21(4)：49-53
    ④冯翱,刘斌,卢增祥,路海明,王普,李衍达。Open Bookmark-——基于Agcn t的信息过滤系统[J],清华大学学报(自然科学版),2001,41(3)：85-88
    ⑤唐忠,欧旭.因特网搜索引擎技术原理及发展趋势研究[J].大众科技,2009,(1)：17-18
    ⑥陈旭春,赵明生.分布式多搜索引擎系统的研究与实现[J].微计算机信息,2005,(20)：37-38,129
    ⑦赵仲孟,戚晓光,沈钧毅.分布式搜索引擎系统中协作检索机制的研究[J].微电子学与计算机,2005,(5)：32-35
    ⑧ M.G.Norman, P.Thanisch. Parallel Database Technoloyg:An Evaluation and Comparison of Scalable Systems.The Bloor Research Group,1995
    ⑨ D.DeWitt,J.Gray.Parallel Database Systems:The Future of High Performance Database Systems.Communications of the ACM,1992,35(6):85-98
    ⑩蒋维,郝文宁,杨晓恝,靳大尉,.分布式数据库搜索引擎的索引建立和优化[J].计算机工程,2008,(18)36-38
    ①郭家义,张智雄,张会娥,黄永文LDAP/WHOIS++研究情况及其应用建议.数字资源检索与应用标准规范研究,科技部科技基础条件平台工作重点项目项目研究报告,2004.5
    ②张智雄,郭家义LDAP/WHOIS++协议应用指南,科技部科技基础条件平台工作重点项目,2004.5
    ③梁娜,张晓林.元数据标准规范开放登记系统发展趋势,科技部科技基础条件平台工作重点项目研究报告,2004.6
    ④梁娜,张晓林,,元数据标准规范开放登记系统发展趋势,科技部科技基础条件平台工作重点项日研究报告,2004.6.
    ⑤董华山,孙济庆.基于P2P的分布式检索模式的研究[J].情报学报,2004,23(6)：683-688
    ①武晓娟.基于网站的搜索引擎研究[J].大连海事大学学报,2008,(S1)：148-149,152
    ②李光雷,张世禄.云计算在资料查询中的应用初探[J].大连大学学报,2009,(06)：72-74
    ③李勇.云计算对信息服务的影响及存在的问题[J].情报理论与实践,2009,(12)：89-91,120
    ①http://baonidetou.blog.chinabyte.com/2010/06/10/125/
    ②陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,(9)：2562-2567
    ①《现代操作系统》,机械工业出版社,1999年中文版
    ① http://www.pcpop.com/doc/0/507/507424.shtml,2011
    ②匡胜徽,李勃.云计算体系结构及应用实例分析[J].计算机与数字工程,2010,(3)：60-63,91
    ③张建勋,古志民,郑超云.云计算研究进展综述[J].计算机应用研究,2010,(2)：429-433
    ④ Mladen A Vouk. Cloud computing-issues, research and implementations[J]. Journal of Computing and Information Technology,2008,(4):235-246.
    ① Jeffrey Dean, Sanjay Ghemawat. Map/Reduce:Simplifed Data Processing on Large Clusters[C]. OSDI2004, San Francisco,2004,137-150
    ① Yang H C, Dasdan A, Hsiao R L, etc.Map-Reduce-Merge:Simplified Relational Data Processing on Large clusters[C]. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data. 2007:1029-1040
    ②张晓清,费江涛,潘清。分布式海量数据管理系统Bigtable主服务器设计[J],计算机工程与设计,2010,31(5)：1141-1144
    ①邓自立.云计算中的网络拓扑设计和Hadoop平台研究.硕士学位论文,中国科学技术大学,2009。
    ① R.Bayer M.Schkolnick. Concurrency of Operations on B-trees. Acta Informatica,1977(9):173-189
    ② D.Comer.The ubiquitious b-tree.Computing Surveys,1979,11 (2):121-137
    ③ T.Johnson and D.Shasha.Utilization of B-tree with inserts, deletes and modifies. Proc. ACM-PODS Conf.,1989,235-246
    ④ Guttman A.R-Trees:A Dynamic Index Structure for Spatial Searching.Proc.Int'l Conf.on Management of Data.Boston:ACM Press,1984,47-57
    ⑤ellis T.,Roussopoulos N.,Faloutsos C.,The R+-tree:a dynamic index for multi-dimensional objects. Proceedings of the 13th Very Large Database Conference.San Mateo:Morgan Kaufmann,1987,507-518
    ⑥ Beckmann N.,Kriegel H.,Schneider R,ct al.The R*-trce:an efficient and robust access method for points and rectangles.In Proceeding of the ACM SIGMOD Conference. New York:ACM Press,1990,322-331
    ⑦Robinson J.T.The K-D-B-Tree:A Search Structure for Large Multidimensional Dynamic Indexes.In Proc.Int I Conf.on Management of Data.Ann Arbor,Michigan:ACM Press,1981,10-18
    ⑧ Berchtold S.,Keim D.,K.riegel H. P. The X-Tree:An Index Structure for High-Dimensional Data.Proc.22nd Int.Conf.on Very Large Databases(VLDB).1996,28-39
    ⑨ W.Litwin,M.A.Neimat,D.A.Schneider.RP*:A Family of Order-Preserving Scalable Distributed Data Structures.In Proc.of VLDB 94.1994,342-353
    ⑩ Michael J.Carey,David J.Dewitt,Joel E.Richardson,et al.Object and file management in the exodus extensible database system.Proc.of VLDB.Kyoto, Japan,1986,91-100
    (>)Screenath.B,Seahadri.S. The hcC-tree:An Efficient Index Structure for Object Oriented Databases. Proc Conf.on Very Large Data Bases.Chile,1994,203-213
    12郭俊峰.数据仓库查询优化方法及索引技术研究.硕士学位论文,合肥工业大学,2010
    ①周帆.基于R-树的空间数据索引技术的研究与实现.硕士学位论文,哈尔滨理工大学,2009
    ① Ribeiro Neto B, Barbosa R. Query performancefor tightly coupled dist ributed digital libraries. [C] Proceedings of 3rd ACM Conference on Digital Libraries. ACM,1998:182-190.
    ② Mac A, Mccann J A, Robertson S E. Parallel search using partitioned inverted files[C]. Proceedings of7th I nternational Symposium on String Processing and Information Retrieval. IEEE,2000:209-220
    ③ Zoebl J, Moffat A. Inverted files for text search engines [J]. ACM Computing Surveys,2006,38 (2):Article 6
    ④潘雪峰.《走进搜索引擎(第2版)》,电子工业出版社,2011
    ①潘胜一.基于倒排索引的压缩算法性能研究.硕士学位论文,杭州电子科技大学,2009
    ① http://blog.csdn.net/ladofwind/archive/2005/01/10/247403.aspx
    ② Fox,E. A.and C.L.Whay, FAST-INV:A Fast Algorithm for building large inverted files[J].1991,6:36-38
    ③ Harman,D.and GCandela,Retrieving Records from a Gigabyte of Text on a Minicomputer UsingStatistical Ranking[J].Journal of the American Society for Information Science,1990.(8):581-589.
    ④Rogers,W.,G.Candela,and D.Harman.Space and time improvements for indexing in informationretrieval[C].in In Proceedings of the Annual Symposium on Document Analysis and InformationRetrieval.1995.Las Vegas.
    ⑤ Moffat,A.,Economical inversion of large text files[J].Computing Systems,1992.vol.5 no.2.
    ⑥ Heinz,S.and J.Zobel,Efficient single-pass index construction for text databases[J].Journal of the American Society for Information Science and Technology,2003.(8):713-729.
    ①荆涛.基于后缀数组的WEB用户访问模式高效挖掘算法.硕士学位论文,吉林大学,2005
    ① Badue C, Ribeiro Neto B, Baeza Yates R, etal. Distributed query processing using partitioned inverted files[C]. Proceedings of 8th International Symposium on String Processing and Information Retrieval. IEEE,2001:10-20.
    ② Toamasic A, Garcia Molina H. Performance of inverted indices in shared2nothing distributed text document information retrieval systems[C]. Proceedings of 2nd International Conference on Parallel and Distributed Information Systems. IEEE,1993:8-17
    ①姚树宇,赵少东.一种使用分布式技术的搜索引擎[J].计算机应用与软件,2005,(10)：127-129
    ②陶跃华,鲁晓南,张玉琢,.一种瘦服务器一胖客户分布式搜索引擎的设计[J].广西师范大学学报(自然科学版),2007,(2)：74-77
    ① Google Inc.10 Tips for Enterprise Search[EB]. http://www.google.com/appliancc
    ② Laurent Proulx. Enterprise Search as a productivity tool[EB]. http://www..nstern.com
    ③王晓悦.企业级搜索引擎omulfind自动化测试系统的设计与实现,北京邮电大学,硕士学位论文,2007
    ① Autonmy,Inc.Introduce to Autonray[EB]. http://www.Autonomy.com
    ② COVEO,Inc. COVEO Enterprise Search 4.0 [EB]. http://www.COVEO.com
    ③ Endeca.NEW RESULTS TO DEMAND FROM ENTERPRISE SEARCH[EB]. http://www.endeca.com
    ④ Guy Creese. Xl'S Enterprise Search[EB]. http://www.ballardvale.com
    ⑤Oracle, Inc. Oracle's Enterprise Search[EB]. http://schorlar.google.com
    ⑥IBM, Inc. Unstructured Information Management Architecture(UIMA)[EB]. http://domino.research.ibm.com
    ⑦ Autonomy, Inc. IDOL Server 7 Te chnical Brief [EB]. http://www.Autonomy.com
    ①刘俊晖,吴文涛,杨珉.企业内部基于角色协作的个性化搜索系统[J].计算机工程,2009,(35)：39-41
    ②陈艳春,李双平.基于Lucene的企业级搜索引擎的设计与实现[J].现代图书情报技术,2007,(8)：63-66
    ③周祥,王丽芳,蒋洋军,张羽.基于Lucene的企业信息门户搜索引擎设计[J].微处理机,2009,(4)：62-63,68
    ④李海丰.基于Lucene的企业搜索引擎研究及应用[J].电脑知识与技术,2009,(4)：926-928
    ⑤曾韬.应用作结构化信息管理技术实现多层次知识管理需求[J].软件导刊,2004(6)：30-32
    ⑥吴宝贵,丁振国.基于Map/Reduce的分布式搜索引擎研究[J].现代图书情报技术,2007,(8)：52-55
    ①邓自立.云计算中的网络拓扑设计和Hadoop平台研究.硕士学位论文,中国科学技术大学,2009
    ①吴宝贵,丁振国.基于Map/Reduce的分布式搜索引擎研究[J].现代图书情报技术,2007,(8)：52-55
    ②Otis Gospodnetic, Hatcher E. Lucene in Action[M]. USA:Manning Publications Co.,2006
    ③栾静,李军锋.基于Lucene全文检索引擎的应用研究[J],计算机与数字工程,2010,(12)：184-187
    ①周登朋,谢康林Luccne搜索引擎[J].计算机工程,2007,(18)：95-96
    ①周登朋,谢康林Lucene搜索引擎[J].计算机工程,2007,(18)：95-96
    ①管建和,甘剑峰.基于Luccne全文检索引擎的应用研究与实现[J].计算机工程与设计,2007,(2)：489-491
    ①吴宝贵,丁振国.基于Map/Reduce的分布式搜索引擎研究[J].现代图书情报技术,2007,(8)：52-55
    [1]何赞.企业搜索引擎现状研究[J].科技创新导报,2007,(36)：116
    [2]马颖仪,李利强.中小型企业搜索引擎应用研究[J].科技信息,2008,(30)：393-394
    [3]星竹.信息管理不善将导致企业损失利润[J].中国建设信息,2010,(4)：34-35
    [4]张德政,张萍萍.非结构化信息管理[J].微计算机信息,2006,22(3)：218-219
    [5]梁吕勇,张中恒.基于本体的企业文本检索模型研究[J].计算机应用研究,2005(12)：27-30
    [6]冯奇峰,李言.一种基于WEB数据挖据的企业智能化专业搜索引擎的研究与实现[J].西安理工大学学报,2006,(1)：10-14
    [7]文必龙,李智新,王英艳.基于元数据的企业搜索引擎研究[J].郑州轻工业学院学报(自然科学版),2008,(6)：4-6
    [8]王秀平,马保权,李治柱.企业专用搜索引擎的搜索策略[J].计算机与现代化,2006,(11)：59-61
    [9]陈海波,张新家.企业文档服务器中英文搜索引擎的设计与实现[J].微处理机,2009,(2)：122-125,128
    [10]李武装.基于语义的企业搜索引擎的研究与实现[J].电脑知识与技术(学术交流),2007,(8)：456-458
    [11]张明宝,米传民.一种基于UIMA的企业级信息检索系统研究[J].情报杂志,2009,(28)：128-131
    [12]林乐然,陈德龙.基于云计算的分布式企业搜索引擎研究[J].电脑知识与技术,2009,(33)：9429-9430,9434
    [13]企业搜索,知识引擎[J].软件世界,2005,(9)：55
    [14]李晓明,闫宏飞,王继民.搜索引擎[M].北京：科学出版社,2005
    [15]曾宜礼.搜索引擎技术综述[J].科技情报开发与经济,2007,(6)：198-199
    [16]刘冰,胡风华,申丽红.搜索引擎技术研究[J].软件导刊,2009,(7)：137-138
    [17]钱兵,王永成,高凯.面向搜索引擎的自然语言理解的设计与实现[J].计算机应用研究,2006,(15)：260-262
    [18]熊回香,夏立新.自然语言处理技术在中文全文检索中的应用[J].情报报理论与实践,2008,(7)：432-435
    [19]王灿辉,张敏,马少平.自然语言处理在信息检索中的应用综述[J].中文信息学报,2007,(2)：35-45
    [20]张义忠等.基于内容的网页特征提取.计算机工程与应用,2001(10)：1-3
    [21]唐忠,欧旭.因特网搜索引擎技术原理及发展趋势研究[J].大众科技,2009,(1)：17-18
    [22]陈旭春,赵明生.分布式多搜索引擎系统的研究与实现[J].微计算机信息,2005,(20)：37-38,129
    [23]蒋建洪.主要分布式搜索引擎技术的研究[J].科学技术与工程,2007,(10)：2418-2424
    [24]徐高潮.分布计算系统[M].北京：高等教育出版社,2004：12-27.
    [25]董华山,孙济庆.基于P2P的分布式检索模式的研究[J].情报学报,2004,23(6)：683-688.
    [26]蒋维,郝文宁,杨晓恝,靳大尉.分布式数据库搜索引擎的索引建立和优化[J].计算机工程,2008,(18)：36-38
    [27]赵仲孟,戚晓光,沈钧毅.分布式搜索引擎系统中协作检索机制的研究[J].微电子学与计算机,2005,(5)：32-35
    [28]李光雷,张世禄.云计算在资料查询中的应用初探[J].大连大学学报,2009,(06)：72-74
    [29]李勇.云计算对信息服务的影响及存在的问题[J].情报理论与实践,2009,(12)：89-91,120
    [30]刘俊晖,吴文涛,杨珉.企业内部基于角色协作的个性化搜索系统[J].计算机工程,2009,(35)：39-41
    [31]陈艳春,李双平.基于Lucene的企业级搜索引擎的设计与实现[J].现代图书情报技术,2007,(8)：63-66
    [32]周祥,王丽芳,蒋泽军,张羽.基于Lucene的企业信息门户搜索引擎设计[J].微处理机,2009,(4)：62-63,68
    [33]李海丰.基于Lucene的企业搜索引擎研究及应用[J].电脑知识与技术,2009,(4)：926-928
    [34]管建和,甘剑峰.基于Lucene全文检索引擎的应用研究与实现[J].计算机工程与设计,2007,28(2)：489-491.
    [35]陈全,邓倩妮.云计算及其关键技术[J].计算机应用,2009,(09).
    [36]张建勋,古志民,郑超云.计算研究进展综述[J].计算机应用研究,2010,(02)：429-433
    [37]田力威,尹朝万.面向虚拟企业的智能化专业搜索引擎的研究与实现[J].计算机学报,2004,(3).
    [38]姚树宇,赵少东.一种使用分布式技术的搜索引擎[J].计算机应用与软件,2005,(10).
    [39]匡胜徽,李勃.云计算体系结构及应用实例分析[J].计算机与数字工程,2010,(3)：60-63,91
    [40]朱珠.基于Hadoop的海量数据处理模型研究和应用[D].北京：北京邮电大学,2008
    [41]]陈伟,刘康苗,卜佳俊,陈纯,张利军.搜索引擎中混合型分布式索引组织策略[J].浙江大学学报(工学版),2009,(8)：1361-1366
    [42]曾韬.应用非结构化信息管理技术实现多层次知识管理需求[J].软件导刊,2004(6)：30-32
    [43]吴宝贵,定振国.基于Map/Reduce的分布式搜索引擎研究[J].现代图书情报技术,2007,(8)：52-55
    [44]邱哲,符滔滔.Lucene+Heritrix开发自己的搜索引擎[M].北京：人民邮电出版社,2007
    [45]陶跃华,鲁晓南,张玉琢.一种瘦服务器—胖客户分布式搜索引擎的设计[J].广西师范大学学报(自然科学版),2007,(2)：74-77
    [46]池金环,卢宁,尚飞,于天彪,王宛山.基于模糊理论的供应链企业商业搜索引擎评价[J].装备制造技术,2007,(11)：96-99
    [47]王战平.网络传播环境下的企业危机预擎智能元搜索引擎研究[J].科技进步与对策,2006,(12)：128-131
    [48]赖祖龙,万幼川,申邵洪,徐景中.基于Hilbert排列码与R树的海量LIDAR点云索引[J].测绘科学,2009,(6)：128：130
    [49]王安莉,蒋外文.浅析分布式计算技术的发展,材料物理与化学(专业)博士论文,2000
    [50]陈海勇,周蓓,黄永忠,郭金庚.移动Agent在分布式计算中的应用[J].信息工程大学学报,2002,(3)：24-27
    [51]程亚琪.云计算与网格计算[J].福建电脑,2010,(03)：52-53
    [52]刘必雄,蔡建兵.当前分布式计算解决方案简介[J].重庆科技学院学报(自然科学版),2005,(1)：86-89
    [53]吴华,杨安祺.分布式文件系统中恢复机制的研究[J].微计算机信息,2006,(24)：73-75
    [54]朱莹芳.JAVA技术与大工智能在搜索引擎上的应用[J].硅谷,2009,(24)62：63
    [55]张斌,周尔宁.基于Nutch的分布式纺织垂直搜索引擎研究[J].电脑知识与技术,2009,(21)：5785-5787
    [56]胡继钧.基于Lucene全文检索引擎的研究与实现[J].科技创新导报2010,(20)：27,29
    [57]张革伏,徐琪.基于语义Web服务的分布式服装搜索引擎系统设计[J].计算机应用,2009,(6)：1601-1604
    [58]卜世波.网格搜索引擎SE4SEE及GridIR[J].图书馆学研究,2009,(4)：64-67
    [59]王颖.中文局域网搜索引擎的有关介绍[J].读与写(教育教学刊),2009,(2)：73-74
    [60]秦海峰,许南山,山岚.超级节点方式的搜索引擎系统的设计与实现[J].计算机与数字工程,2008,(8)：81-84
    [61]汪全莉.元搜索引擎信息服务的分布式数据挖掘[J].图书馆学研究,2008,(6)：51-54
    [62]陈金龙.分布式FTP并行搜索引擎的设计与实现[J].计算机时代,2008,(7)：70-75
    [63]贺皓,王正刚,杨义传,胡运发.Web Services在黄页搜索引擎中的应用[J].计算机工程,2008,(11)：258-272
    [64]田俊华,杨晓江.分布式并行信息检索系统的设计与实现——基础教育资源搜索引擎个案研究[J].现代图书情报技术,2007,(8)：76-79
    [65]耿亚玮,熊桂喜.一种用于数据库搜索引擎的数据采集模型[J].微计算机信息,2007,(33)：136-138
    [66]陈德礼.基于JXTA的层次性P2P搜索引擎框架的研究与设计[J].重庆工学院学报(自然科学版),2007,(7)：139-143
    [67]刘峰,施水才,肖诗斌,下弘蔚.基于RSS的分布式新闻博客搜索引擎设计[J].现代图书情报技术,2007,(1)：29-32
    [68]廖小飞,李津生,洪佩琳,薛开平.一种基于移动代理技术的类应用层组播的搜索引擎模型[J].应用科学学报,2007,(1)：51-56
    [69]钟荣.基于XML的移动Agent搜索引擎[J].中国永远(理论版),2006,(5)：135-136
    [70]欧阳剑,李冠盛.元搜索引擎原理在实现分布式虚拟联合目录中的应用研究[J].现代图书情报技术,2006,(9)：63-67
    [71]刘崇学.基于智能搜索引擎的数字图书馆个性化服务研究[J].现代情报,2006,(11)：16-18
    [72]何淑庆,李村合,张培颖.URL分级散列在分布式搜索引擎中的应用[J].电子技术应用,2006,(7)：25-28
    [73]石翌轶,宋自林,尹康银.一种基于语义的Web数据搜索引擎方法研究[J].山东大学学报(理学版),2006,(3)：25-29
    [74]魏振达,阳小华,刘军.基于消息中间件的智能元搜索引擎设计[J].淮阴师范学院学报(自然科学版),2006,(1)：78-82
    [75]崔舒宁,冯博琴.融合搜索引擎结果集的模糊积分算法[J].西安交通大学学报,2006,(2)：175-178
    [76]张莲梅,陈世鸿,陈红梅,许继红,杨璃.基于分布式电力资源库的搜索引擎框架[J].高电压技术,2005,(8)：66-68
    [77]陈彤兵,汪保友,胡金化,施伯乐.一个实时搜索引擎的设计[J].小型微型计算机系统,2004,(5)：855：858
    [78]赵新慧,朱伟.分布协作式搜索引擎系统的初步探索[J].抚顺石油学院学报,2003,(4)：57-60
    [79]胡庆华.Web搜索引擎中Mobile Agent动态路由问题研究[J].安徽大学学报(自然科学版),2003,(3)：22-25,29
    [80]贺广宜,罗莉.分布式搜索引擎的设计与实现[J].计算机应用,2003,(5)：83-85,88
    [81]刘芳,何守才.搜索引擎技术的优化处理方法[J].计算机工程,2003,(16)：130-132
    [82]郭松涛,朱征宇.一种新的协作式搜索引擎的设计与实现[J].计算机工程与应用,2003,(35)：180-182
    [83]刘翀,曹阳.基于移动Agent的智能搜索引擎的设计与实现[J].计算机工程,2002,(10)：105-110
    [84]肖诗源,叶俊,刘贤德.一种基于Agent的分布式搜索引擎[J].计算机工程,2002,(7)：38-39,115
    [85]印鉴,邹胜.一种分布式搜索引擎设计[J].计算机科学,2001,(10)：76-77
    [86]张晓刚,李明树.智能搜索引擎技术的研究与发展[J].计算机工程与应用,2001,(24)：67-70
    [87]王建勇,单松巍,雷鸣,谢正茂,李晓明.海量Web搜索引擎系统中用户行为的分布特征及其启示[J].中国科学E辑,2001,(4)：372-384
    [88]陈华,罗昶,王建勇,段晖,薛明.基于Web的百万级FTP搜索引擎的设计与实现[J].计算机应用,2000,(9)：68-70
    [89]卢群.UIMA架构下WEB访问信息的研究和应用.上海交通大学硕十学位论文,2007.
    [90]王晓悦.企业级搜索引擎omulfind自动化测试系统的设计与实现.北京邮电大学.硕十学位论文,2007.
    [91]周登朋,谢康林.Lucene搜索引擎[J].计算机工程,2007,33(18)：95-96
    [92]郭家义,张智雄,张会娥,黄永文.LDAP/WHOIS+-+研究情况及其应用建议.数字资源检索与应用标准规范研究,科技部科技基础条件平台工作重点项目项目研究报告,2004.5.
    [93]梁娜,张晓林.元数据标准规范开放登记系统发展趋势.科技部科技基础条件平台工作重点项目研究报告,2004.6.
    [94]武晓娟.基于网站的搜索引擎研究[J].大连海事大学学报,2008,(S1)：.148-149.
    [95]王波.基于Lucene的企业搜索引擎.北京邮电大学.硕士学位论文,2009.
    [96]陈治平.智能搜索引擎理论与应用研究.湖南大学博十学位论文,2003.
    [97]李红梅.智能元搜索引擎关键技术研究.西安电子科技大学博十学位论文,2009.
    [98]蒋晓冬,金宇晖,谈征.网上高质量智能信息检索系统的实现[J].计算机工程与科学,1999,21(4)：49-53
    [99]冯翱,刘斌,,卢增祥,路海明,王普,李衍达.Open Bookmark——基于Agent的信息过滤系统[J],清华人学学报(自然科学版),2001,41(3)：85-88
    [1]David Hawking. Challenges in Enterprise Search. Proceedings of the Fifteenth Database Conferences of Australasian. January 01,2004, Dunedin, New Zealand. ADC,2004
    [2]A. Z. Broder, A. C. Ciccolo. Towards the next generation of enterprise search technology. IBM Systems Journal, v.43 n3, p.451-454, July 2004
    [3]Krisztian Balog, Leif Azzopardi, Maarten de Rijke. Formal models for expert finding in enterprise corpora. Proc. of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,2006, Seattle, Washington, USA
    [4]Abrol.M, Doshi.B, Kanihan.J,.ect. Intelligent taxonomy management tools for enterprise content. Proceedingss of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence,2005
    [5]Pavel Dmitriev, Nadev Eiron, Marcus Fontoura, Eugene Shekita. Using annotations in enterprise search. Proceedings of the 15th international conference on World Wide Web,2006
    [6]Peter Bailey, David Hawking, Brett Matson. Secure search in enterprise webs: tradeoffs in efficient implementation for document level security. Proceedings of the 15th ACM international conference on Information and knowledge management.2006, USA
    [7]F.Brauer, M.Huber, G.Hackenbroich, U.Leser, F. Naumann. Graph-based concept identification and disambiguation for enterprise search. Proceedings of the 19th international conference on World wide web.2010
    [8]Robert Armstrong, Dayne Freitag, Thorsten Joachims, Tom Mitchell. Web Watcher: A Learning Apprentice for the World Wide Web. Proceedings of the AAAI 1995 Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, March 1995
    [9]Robert B.Doorenbos, Oren Etzioni, Daniel S.Weld. A Scalable Comparison-Shopping Agent for the World Wide Web. In:P roceedings of the First International Autonomous Conference on Agents,1997
    [10]Marko Balabanovic, Yoav Shoham. Fab:Content-based, Collaborative Recommendation. Communications of the ACM,1997,40(3)
    [11]Menczer F, Belew R.Adaptive Retrival Agents Internalizing Local Context and Scaling up to the Web. Machine Learning 2000,39(2/3)
    [12]Jeff Heflin, James Hendler. A Portrait of the Semantic Web in Action.IEEE Inielligent Systems,2001, (4):54～59
    [13]M.G.Norman, P.Thanisch. Parallel Database Technoloyg:An Evaluation and Comparison of Scalable Systems.The Bloor Research Group,1995
    [14]D.DeWitt, J.Gray.Parallel Database Systems:The Future of High Performance Database Systems.Communications of the ACM,1992,35(6):85～98
    [15]R.Bayer M.SchkoInick. Concurrency of Operations on B-trees. Acta Informatica, 1977(9):173-189
    [16]D.Comer.The ubiquitious b-tree.Computing Surveys,1979,11 (2):121～137
    [17]T.Johnson and D.Shasha.Utilization of B-tree with inserts, deletes and modifies. Proc. ACM-PODS Conf.,1989,235-246
    [18]Guttman A.R-Trees:A Dynamic Index Structure for Spatial Searching.Proc.Int'l Conf.on Management of Data.Boston:ACM Press,1984,47-57
    [19]Sellis T., Roussopoulos N., Faloutsos C..The R+-tree:a dynamic index for multi-dimensional objects. Proceedings of the 13th Very Large Database Conference.San Mateo:Morgan Kaufmann,1987,507-518
    [20]Beckmann N., Kriegel H., Schneider R, et al.The R*-tree:an efficient and robust access method for points and rectangles.In Proceeding of the ACM SIGMOD Conference.New York:ACM Press,1990,322-331
    [21]Otis Gospodnetic, Hatcher E. Lucene in Action[M]. USA:Manning Publications Co., 2006.
    [22]Robinson J.T.The K-D-B-Tree:A Search Structure for Large Multidimensional Dynamic Indexes.In Proc.Int 1 Conf.on Management of Data.Ann Arbor, Michigan:ACM Press, 1981,10-18
    [23]Berchtold S., Keim D., Kriegel H.P. The X-Tree:An Index Structure for High-Dimensional Data.Proc.22nd Int.Conf.on Very Large Databases(VLDB).1996, 28-39
    [24]W.Litwin, M.A.Neimat, D.A.Schneider.RP*:A Family of Order-Preserving Scalable Distributed Data Structures.In Proc.of VLDB 94.1994,342-353
    [25]Michael J.Carey, David J.Dewitt, Joel E.Richardson, et al.Object and file management in the exodus extensible database system.Proc.of VLDB.Kyoto, Japan,1986,91-100
    [26]Screenath.B, Seahadri.S. The hcC-tree:An Efficient Index Structure for Object Oriented Databases. Proc Conf.on Very Large Data Bases.Chile,1994,203～213
    [27]Ribeiro Neto B, Barbosa R. Query performancefor tightly coupled dist ributed digital libraries. [C] Proceedings of 3rd ACM Conference on Digital Libraries. ACM 1998:182-190.
    [28]Mac A, Mccann J A, Robertson S E. Parallel search using partitioned inverted files[C]. Proceedings of7th I nternational Symposium on String Processing and Information Retrieval. IEEE,2000:209-220
    [29]Zoebl J, Moffat A. Inverted files for text search engines [J]. ACM Computing Surveys, 2006,38 (2):Article 6
    [30]Badue C, Ribeiro Neto B, Baeza Yates R, etal. Distributed query processing using partitioned inverted files[C]. Proceedings of 8th International Symposium on String Processing and Information Retrieval. IEEE,2001:10-20.
    [31]oamasic A, Garcia Molina H. Performance of inverted indices in shared2nothing distributed text document information retrieval systems [C]. Proceedings of 2nd International Conference on Parallel and Distributed Information Systems. IEEE, 1993:8-17
    [32]Jeffrey Dean, Sanjay Ghemawat. Map/Reduce:Simplifed Data Processing on Large Clusters[C]. OSDI2004, San Francisco,2004,137-150
    [33]Yang H C, Dasdan A, Hsiao R L, etc.Map-Reduce-Merge:Simplified Relational Data Processing on Large cIusters[C]. Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data.2007:1029-1040
    [34]Mladen A Vouk. Cloud computing-issues, research and implementations[J]. Journal of Computing and Information Technology,2008(4):235-246.
    [35]Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google File System. Google, Inc
    [36]Garfinkel, S. An Evaluation of Amazon's Grid Computing Services:EC2, S3 and SQS. Tech. Rep. TR-08-07, Harvard University, Aug.2007.
    [37]Tom White. Hadoop:The Definitive Guide[M].United States of America:OReilly, 2009:35-87.
    [38]Man Abrol, Neil Latarche, Uma Mahadevan,.etc. Navigating large-scale semi-structured data in business portals, Proceedings of the 27th International Conference on Very Large Data Bases, p.663-666, September 11-14,2001
    [1]Google Inc.10 Tips for Enterprise Search[EB]. http://www.google.com/appliance, 2010.
    [2]Laurent Proulx. Enterprise Search as a productivity tool[EB]. http://www..nstern.com, 2010.
    [3]Autonmy, Inc. Introduce to Autonray[EB]. http://www.Autonomy.com,2009.
    [4]COVEO, Inc. COVEO Enterprise Search 4.0 [EB]. http://www.COVEO.com,2010.
    [5]Endeca. New Results to Demand from Enterprise Serach[EB]. http://www.endeca.com, 2011.
    [6]Guy Creese. X1'S Enterprise Search[EB]. http://www.ballardvale.com,2011.
    [7]Oracle, Inc. Oracle's Enterprise Search[EB]. http://schorlar.google.com,2010.
    [8]IBM, Inc. Unstructured Information Management Architecture(UIMA)[EB]. http:// domino,research.ibm.com,2009.
    [9]Autonomy, Inc. IDOL Server 7 Te chnical Brief [EB]. http://www.Autonomy.com, 2011.
    [10]Hadoop 中文资料[EB/OL].http://www.cloudcomputing-china.cn,2009.
    [11]Amazon elastic compute cloud(amazon EC2)[EB/OL]. http://aws.amazon.com/ec2, 2009.
    [12]http://www.robotstxt.org/db/wanderer.html,2007.
    [13]http://linux.ustc.edu.cn/tutorials/se_cgi/Cgi11fi.htm,1996.
    [14]http://www.cs.washington.edu/education/courses/cse454/02au/history.html,2005.
    [15]http://info.lycos.com/overview.php,2011.
    [16]http://en.wikipedia.org/wiki/Infoseek,2010.
    [17]http://www.w3.org/Conferences/WWW4/Papers/169/,1995.
    [18]http://news.cn.yahoo.com/yahoo10years/,2011.
    [19]http://www.hotbot.com/,2011.
    [20]http://zh.wikipedia.org/zh-cn/AltaVista,2011.
    [21]http://en.wikipedia.org/wiki/History_of_Google,2011.
    [22]www.alltheweb.com/,2011.
    [23]www.youdao.com,2011.
    [24]http://cn.engadget.com/2011/06/03/computex-2011-nufront-show-fatest-cortex-a9-ap, 2011.
    [25]http://www.cloudcomputing-china.cn/article/cloudcomputing/201003/445.html,2010.
    [26]http://tech.idcquan.com/cloud/735065.html,2011.
    [27]http://www.pcpop.com/doc/0/507/507424.shtml,2011.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700