黄页搜索引擎系统扩展技术研究与实现
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
为处理计算机系统内存储的海量非结构化全文数据,国内外对全文数据库技术展开了广泛的研究。其中全文检索技术作为全文数据库技术中关键技术之一引起了研究人员的普遍关注。
     首先在已有研究结果的基础上介绍了目前尚处在起步阶段的新型全文检索模型——互关联后继树(简称IRST)模型,就互关联后继树及其主要改进模型同几种主流的全文检索模型如位图模型、倒排表模型、Pat数组模型等性能方面的差异进行了比较。同时为进一步探讨全文检索模型性能和存储方式间的关系问题,对互关联后继树模型和Pat数组模型在不同储存方法下的性能差异进行了详细研究。
     此外,针对互关联后继树现有模型中存在的部分问题提出了改进方案。主要引入了双排序互关联后继树二分加验证检索以及预处理后继区间表检索算法:并改进了现有关系型数据库和全文数据库协同检索模型。
     最后,为推进我国民用航空适航审定基础能力建设;充分将先进技术应用于工程实践。针对我国民用航空适航审定能力建设所急需的审定情报中心建设项目提出了初步建设设想。
To deal with a mass of unstructured data stored within the computer system,researchers home and abroad carried out extensive research on full-text database.Among these,full-text information retrieval as one of the key technologies in full-text database has aroused general concern of researchers.
     First,on the base of existing studies,introduced a new full-text information retrieval model,Inter-Relevant Successive Tree(IRST),which is still in the initial stage and then make comparison between the model of the IRST and its major improved models and several mainstream models of full-text information retrieval,such as Bitmap,Inverted Files,Pat Array in terms of performance.At the same time in order to further explore the relationship between full-text information retrieval model performance and storage means,the writer conducted a detailed study of the performance difference between IRST and Pat array under different storage status.
     In addition,the writer proposed improvements for the problems in the existing models of IRST.Mainly a area binary search together with verifying process are introduced and cooperative query of IRST and B-Tree are improved as well.
     Finally,in order to promote the basis capacity-building on China's civil aviation airworthiness certification and to fully apply advanced technology to engineering practice,the writer put forward a preliminary vision on validation information center construction project which is urgently needed for the capacity-building on China's civil aviation airworthiness certification.
引文
[1] R. Baeza-Yates and B. Ribeiro-Veto, Modern Information Retrieval, Addison- Wesley [M] , 1999.
    
    [2] E. Bertino, F. Rabitti, and S. Gibbs. Query processing in a multimedia document system [J] . ACM Transactions on Office Information Systems, 6(1) :—41, January 1988.
    
    [3] J. Zobel, A. Moffat, K. Ramamohanarao. Inverted files versus signature files for text indexing. Transactions on Database Systems [J] ,23(4):453-490, 1998
    
    [4] G. Gonnet. Un-structured Data Bases or Very Efficient Text Searching. [J] ACM PODS, Vol.2, 117-24, Atlanta,GA., 1983.
    
    [5] U. Manber, and G.Myers. Suffix arrays: A New Method for Online String Searches [C] . 1~(st) ACM-SIAM Symposium on Discrete Algorithms, 319-27, San Francisco. 1990.
    
    [6] J. Zobel et al. An efficient indexing technique for full-text database systems [C] .Proceedings of the 18th VLDB conference. Vancouver, Canada. 1992
    
    [7] J. Zobel, Alistari Moffat, Ross Wikinson, and Ron Sacks-Davis. Efficient retrieval of partial documents [M] . Information Processing and Management, 31(3):361-377,1995.
    
    [8] Yang Chuanyao, Li Yuqin, Wang Zhenghua, Hu Yunfa. Research on case similarity measuring technique based on inverted array and simulated annealing [C] . 第10届中国机器学习会议. 海口, 2006
    
    [9] Chuanyao Yang, Yuqin Li, Zhenghua Wang, Chenghong Zhang, Yunfa Hu. A Yellow Page Information Retrieval System Based on Sorted Duality Inter-relevant Successive Tree and Industry Ontology. In Proceedings of 8th ACIS International Conference on Software Engineering[C], Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD2007) . Tsingtao, 2007.
    
    [10] Chuanyao Yang, Yuqin Li, Zhenghua Wang, Chenghong Zhang, Yunfa Hu. A Collaborative Retrieval System—Full Text Base and Database [C] . In Proceedings of 8th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD2007) . Tsingtao, 2007.
    [11]M.Gu,M.Farach,R.Beigel.An Efficient Algorithm for Dynamic Text Indexing[C].Proceeding of the 5~[th]Annual ACM-SIAM Symposium on Discrete Algorithms.New York,1994.
    [12]T.Chiuen,S.Varadarajan.SASE:Implementation of a Compressed Text Search Engine[C].Proceedings of the 1th USENIX Symposium on Internet Technologies and Systems.Monterey,1997.
    [13]中国民用航空局PBN实施领导小组中国民用航空基于性能的导航实施路线图[Z].2009年7月.
    [14]胡运发.周水庚.《基于邻接矩阵的全文数据库表达和操作方法》[P].专利号99109122.1.1999.
    [15]胡运发,《互关联后继树--一种新型全文数据库数学模型》[R],复旦大学技术报
    [16]胡运发,《扩充的∑2相邻矩阵模型-小膨胀比的全文数据库模型》[R],1999
    [17]胡运发,陶晓鹏,《基于∑2邻接矩阵模型的文本压缩模型的生成技术》[R]2000
    [18]陶晓鹏,《面向[中文]全文数据库的全文索引的研究》[D],复旦大学博士学位论文,2000
    [19]周水庚,《中文文本数据库若干关键技术研究》[D]复旦大学博士学位论文,2000
    [20]刘永丹.《文档数据库若干关键技术研究》[D].复旦大学博士学位论文,2004.
    [21]刘学文,陶晓鹏,于玉,胡运发;《一种全新的全文索引模型--后继数组模型》[J];软件学报:2002 No.01
    [22]申展,王建会,吴爱民,胡运发,《互关联后继模型--种新颖的全文搜索模型》[J].计算机科学,2003Vol.30NO-.10[增刊].
    [23]马科.胡运发.《一个改进的互关联后继树数据模型》[J].计算机工程,2003 Vol.29 No.21.
    [24]申展,江宝林,张谧,唐磊,胡运发.《互关联后继树模型及其实现》[J].计算机应用于软件,2005 Vol.23 No.3.
    [25]申展.《互关联后继树模型研究》[D].复旦大学硕士论文,2001.
    [26]马科.《面向中文的全文数据库建索引的关键技术的研究实现》[D].复旦大学硕士学位论文,2004
    [27]颜文伟.《全文数据库若干关键技术研究》[D].复旦大学硕士论文,2005.
    [28]王雷.《文本分类相关技术研究》[D].复旦大学硕士论文,2006
    [29]袁天宇.《互关联后继树模型扩展研究》[D].复旦大学硕士论文,2007.
    [30]袁天宇,胡运发.《后继序列有序的互关联后继树创建和搜索算法》[J].计算机应用与软件,Vol.25 No.4 2008
    [31]王政华,胡运发,《基于后继区间的互关联后继树搜索算法》[J],计算机工程,2007
    [32]王竟源.《异构数据的联合索引与协同查询研究》[D].复旦大学硕士论文,2007
    [33]李卓尔.《互关联后继树索引改进研究与应用》[D].复旦大学硕士论文,2008.
    [34]杨茹.《互关联后继树索引模型的改进研究》[D].复旦大学硕士论文.2009.
    [35]陶晓鹏,《面向[中文]全文数据库的全文索引的研究》[D],复旦大学博士学位论文,2000
    [36]周水庚,《中文文本数据库若干关键技术研究》[D]复旦大学博士学位论文,2000
    [37]申展,江宝林,唐磊,胡运发.《基于互关联后继树的频繁模式挖掘研究》[J].计算机工程,2004 No.21.
    [38]黄海东.《挖掘ACARS下载数据有用信息的方法研究》[D].中国民用航空学院硕士学位论文,2004
    [39]狄强.《数据挖掘在航空安全自愿报告分析中的应用研究》[D].中国民用航空大学硕士学位论文,2007
    [40]殷人昆,陶永雷,谢若阳,盛绚华.数据结构[用面向对象方法与C++描述][M].北京:清华大学出版社,2005:350-361

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700