基于结构化向量空间模型的中文信息检索系统研究与实现

英文题名：Research and Implementation on Chinese Information Retrieval System Based on Structured Vector Space Model
作者：操卫平
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：信息检索 ; 搜索引擎 ; 向量空间模型 ; 倒排索引
英文关键词：IR ; search engine ; vector space model ; inverted index
学位年度：2008
导师：李玉鑑
学科代码：081203
学位授予单位：北京工业大学
论文提交日期：2008-04-01

摘要

信息检索(Information Retrieval, IR)是从数据集中提取相关文档和信息的过程。Internet的出现为人们提供了一种新的信息检索方式,也把信息检索的处理数据从结构化逐步转向半结构化、乃至无结构化。随着Web文本的不断增加,传统的Web检索技术已经很难满足高质量的查询需求。本文的主要内容是研究基于Web的文本信息检索算法。
     首先,本文介绍了信息检索技术的发展概况,并对基于关键字和基于超链接的检索算法进行了比较和分析。针对关键字检索的查全率不高、链接分析检索方法容易产生主题漂移等缺点,本文将这两种算法相结合,通过页面之间的链接关系来计算每个页面的hub值和authority值,并利用页面链接的锚文本和页面的文档内容与用户查询式进行相似度匹配,获取每个页面的相关度权值,同时在此基础上将检索结果进行排序输出。
     其次,针对Web信息检索的特点,通过分析传统向量空间模型(Vector Space Model,VSM)在Web检索中存在的若干问题,对传统向量空间模型进行改进,提出了结构化向量空间模型(Structured Vector Space Model, SVSM),其基本思想是将Web文档表达为具有一定逻辑结构的复杂向量,即结构化向量组。每个结构化向量组由若干子向量构成,每个子向量对应Web文档中相对独立的文本段,比如标题、子标题、正文和锚文本等内容。
     再次,本文对Web信息检索系统中的页面采集器、索引器及相关的原理和技术进行详细地介绍,同时讨论了如何利用页面标记树对网页内容进行去噪处理和主题提取的方法,并给出了一种提高页面索引质量、效率和压缩比的实现途径。
     最后,本文在已有信息检索算法的基础上,通过结构化向量空间模型,把关键字与超链接检索算法相结合,设计并且实现了一个基于Web的中文信息检索系统。通过参加2007年度全国搜索引擎和Web挖掘评测会议(SEWM2007),证明了该系统的检索算法能够有效地提高Web信息检索的查全率和查准率。
Information Retrieval (IR) is a procedure to extract related information and documents from data sets. The emergence of the Internet has provided a new way of information retrieval, with structured data gradually shifting to semi- structured, even non- structured data. It has been very difficult for traditional web information retrieval technologies to satisfy the need of high-quality results retrieved from increasing web texts. The main content of the thesis is to study a Web-based information retrieval algorithm.
     Firstly, this thesis briefly outlines the development of information retrieval technology, including analysis and comparison of keyword-based and hyperlink-based methods. To cope with low recall in keyword-based retrieval and topic drift in hyperlink-based retrieval, it proposes a new algorithm combining the two methods, which ranks the retrieval results based on hub and authority values from links between web pages as well as the relevant weight of each page by matching link anchor and document content with user query.
     Secondly, considering the characteristics of web information retrieval, the thesis proposes the concept of structured vector space model by analyzing some problems in traditional vector space model. The new model represents a web document as a logically structured vector, which contains several sub-vectors related to relatively independent parts such as title, subtitle, plain text and anchor text, etc.
     Thirdly, the thesis gives a detailed introduction to web pages collector and indexer as well as pertinent principles and techniques in web information retrieval systems. Meanwhile, it discusses some methods of how to denoise and extract themes from web content with page marked trees, and establishes an implementation to improve the quality, efficiency and compression ratio of web indexes.
     Finally, based on traditional information retrieval algorithms, the thesis designs and implements a web-based Chinese information retrieval system, which uses a combination of keywords-based and hyperlink-based retrieval algorithms by structured vector space model. In the evaluation of SEWM2007(Symposium of Search Engine and Web Mining 2007), it is shown that the searching algorithm used by the system can greatly improve the recall and the precision of web information retrieval.

引文

1 化柏林. Googel 搜索引擎技术实现探究. 现代图书情报技术. 2005, (115):40~43.
    2 孙建军, 成颖, 丁芹. 信息检索技术. 科学出版社, 2004:4~81
    3 C. Buckley. Implementation of the SMART information retrieval system. Cornell University, 1985:85~686
    4 J. P. Callan, W. B. Croft, S. M. Harding. The INQUERY retrieval system. In:Proc of the 3rd Int' l Conf on Database and Expert Systems Applications. 1992:78~83
    5 王继成, 潘金贵, 张福炎. web 文本挖掘技术研究. 计算机研究与发展. 2000, 37(5): 513~520
    6 Salton, G. Wong, C. S. Yang. On the specification of Word values in automatic indexing. Journal of Documentation, 1973, 29(4):351~372
    7 Rocchio, J. J. Relevant. Feedback in Information Retrieval.PrenticeHall Inc. 1971, (14): 313~323
    8 Michal Cutler, Yungming Shih, Weiyi Meng. Using the structure of HTML documents to improve retrieval. USENIX Symposium on Internet Technologies and Systems (NSITS'97). Monterey, California, 1997, (12):241～251
    9 J. Kleiberg, S. Lawrence. The Structure of the Web. Science, 2001, 30(294): 1849~1850
    10 S. Brin, L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine . Proc. of the 9th Int' l World Wide Web Conferences. 1998, 30(1-7):107~117
    11 R. Lempel, S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect.ACM Trans. on Information Systems. 2001, 19(2):131～160
    12 朱巧明, 李培峰, 吴娴. 中文信息处理技术教程. 清华大学出版, 2005:203~250
    13 李晓明, 门宏飞, 王继民. 搜索引擎—原理、技术与实现. 科学出版社, 2005
    14 M. Tom Micthell, 卡内基梅隆大学. 机器学习. 机械工业出版社, 2003
    15 韩立新, 陈贵海, 谢立. 一个面向 Internet 的个性化信息检索系统模型. 电子学报. 2002, 30(2):240~244
    16 A. Leuski, J. Allan. ImProving interaetive retrieval by combining ranked lists and clustering. In the Proceedings of RIAO 2000 conference. Paris, 2000:665~681
    17 A. Mendelzon, D. Rafiei. What dot he Neighbours Think. Computing Web Page Reputations. IEEE Data Engineering Bulletin. 2000, 23(3):9~16
    18 王梅. 全文检索系统测评的探讨. 情报学报. 2000
    19 陈华辉. 一个中英文全文搜索引擎的设计与实现. 计算机应用研究. 2001, (03)
    20 杨文清, 黄宜华, 张福炎. 中文 Web 文档库全文检索技术研究与实现. 中文信息学报. 1999, (04)
    21 王继成, 箫嵘, 张福炎. web 信息检索研究进展. 计算机研究与发展. 2001, 38(2):187~193
    22 范晴, 徐振华, 宋震. 查全率与查准率关系初探. 情报学报. 2002, (9):41~42
    23 JianZhang, JianfengGao. Improving the effective of information retrieval with clustering and fusion. Computational Linguistics and Chinese Language Proeessing. 2001, 6(1):109~125.
    24 M. Kleinbergj. Authoritative Source in s hyperlinked Environment. Journal of ACM. 1999, 46(5):604~632
    25 Amanda Spink, Dietmar Wolfram, B. J. Jansen, Tefko Saracevic. Searching the web:The public and their queries. Journal of the American Society for Information Science. 2001
    26 G. Salton, M. Lesk. Computer evaluation of indexing and text pro-cessing. Journal of ACM. 1968
    27 李洁. 搜索引擎中相关性测算发展研究. 情报杂志. 2003, (12):60~61.
    28 汪涛, 樊孝忠. 主题爬虫的设计与实现. 计算机应用. 2004, (6):27~272
    29 林彤. 网上机器人及其实现. 天津大学学报. 1999, (1):53~57
    30 阳小华. WWW 信息收集的 RoBoT 技术. 计算机应用研究. 2000, (4):90~91
    31 张刚, 刘挺, 郑实福, 车万翔,李生. 大规模网页快速去重算法.中国中文信息学会二十周年学术会议论文集(续集). 2001, (11)
    32 赫枫龄, 左万利. 利用超链接信息改进网页爬行器的搜索策略. 吉林大学学报(信息科学版). 2005, 23(1):12~18
    33 J. Cho, H. Garcia-Molina, L. Page. Efficient Crawling Through URL Ordering. Computer NetWork. 1998, 30(1-7):161~172
    34 欧阳柳波, 李学勇, 李国徽, 王鑫. 网络蜘蛛搜索策略进展研究. 小型微型计算机系统. 2005, (04):143~146
    35 F. Bergadano. Java-based and secure learning agents for information retrieval in distributed systems. Information Sciences. 1999
    36 M. Galles. Spider:A High-Speed Network Intereonnect. IEEE Micro. 1999, 17(1):34~39
    37 Jon Kleinberg. Authoritative Sources in A Hyperlinked Environment. Journal of the ACM. 1999, 46(5):604~632
    38 张志刚. 基于网页的信息系统的一种预处理过程. 北京大学硕士学位论文. 2004:8~10
    39 彭波, 李晓明. 搜索引擎倒排文件的一种分块组织技术. 电子学报. 2005, (02):167~171
    40 鲁声清. 一类海量数据倒排文件的组织. 天津大学学报(自然科学与工程技术版). 1999, (05):657~659
    41 许维平, 许驰. 基于 C 语言编程实现倒排文件的数据查找. 微机发展. 1998, (02):57~59
    42 赵捧未, 靳雅静, 徐国华. 关于顺排检索和倒排检索的并行化探讨.情报学报. 1996, (01)
    43 彭波. 搜索引擎的混合索引技术. 计算机工程与应用. 2004, (22):16~18
    44 曹桂宏, 何丕廉, 吴光远, 聂颂. 中文分词对中文信息检索系统性能的影响. 计算机工程与应用. 2003, (19):81~83+93
    45 孙家广, 陈玉健, 李庆虎. 一种中文分词词典新机制. 中文信息学报. 2003, (04)
    46 李向阳, 张亚非. 一种 Hash 高速分词算法解放军理工大学学报(自然科学版). 2004, (02)
    47 万建成, 杨春花. 书面汉语的全切分分词算法模型. 小型微型计算机系统. 2003, (07)
    48 C. Apte, S. Weiss. Data Mining with Decision Trees and Decision Rules. Future Generation Computer Systems. 1997, 13 (2):197~210
    49 J. Burges. A tutorial on support vector machines for pattern recognition. Knowledge DiscoveR and Data Mining. 1998, 2(2):21~167
    50 J. Lafferty, A. McCallum, F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Williamstown. Williams College. 2001:282~289.
    51 费洪晓, 康松林, 朱小娟, 谢文彪. 基于词频统计的中文分词的研究. 计算机工程与应用. 2005, (07)
    52 孙茂松, 黄昌宁. 利用汉字二元语法关系解决汉语自动分词中的交集型歧义. 计算机研究与发展. 1997, 34(5): 332~339
    53 王开铸, 李俊杰, 吴岩. 无词典自动分词的研究. 计算语言学进展与应用. 1995, 13(2): 35~38
    54 刘群, 张华平, 俞鸿魁. 基于层叠隐马模型的汉语词法分析. 计算机研究与发展. 2004, 41(8)
    55 B. S. Jeong, E. Omiecinski. Inverted file partitioning schemes in multiple disk systems. IEEE Transactions on Parallel and Distributed Systems. 1995
    56 胡师彦. 基于多重倒排文件的中文题库全文模糊检索系统. 兰州工业高等专科学校学报. 2001, (04):10~13
    57 A. Tomasic, H. Garcia-Molina, K. A. Shoens. Incremental updates of invertedLists for text document retrieva. Proceedings of 1994 ACM SIGMOD International Conference on Management of Data. 1994
    58 徐家树, 覃征, 陈伟雄. web 面相关度算法. 华南理工大学学报. 2004, 32(12):81~83
    59 张志刚, 陈静, 李晓明. 一种 HMTL 网页净化方法. 情报学报. 2004, 23(4):387~393.
    60 A. Moffat, J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems. 1996
    61 G. Salton, M. J. McGill. Introduction to modern information retrieval. McGraw-Hil. 1983
    62 S. E. Robertson , K. Sparck Jones. Relevance Weighting of Search Terms. Journal of the American Society for Information Sciences. 1976, 27(3):129~146
    63 L. Page, S. Brin, R. Motwani. The PageRank citation ranking:Bringing order to the Web Technical report. CA:Stanford University, Stanford. 1998
    64 J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM. 1999,46 (5):604~622
    65 朱炜, 王超, 潘金贵. web 超链分析算法研究. 计算机科学. 2003, 30(0):89~93
    66 李剑, 金蓓弘. web 链接结构信息研究综述. 计算机科学. 2003, 30(4):95~98
    67 韩家炜, 孟小峰, 王静等. web 挖掘研究. 计算机研究与发展. 2001, 38(4):405~414
    68 王峰松. 网典:新一代智能搜索引擎. 网络世界. 1999, 12(22)
    69 邹涛. 基于 WWW 的文本信息挖掘. 情报学报. 1999, 18(4)
    70 Allan Borodin. Finding authorities and hubs from link structures on the World Wide Web. Proceedings of the 10th International WWW Conference. 2001
    71 黄营菁, 夏迎炬, 吴立德. 基于向量空间模刑的文本过滤系统. 软件学报. 2003, 4(3):435~442.
    72 李凡, 鲁明羽. 文本特征选择新方法的研究. 清华大学学报. 2001, 41(7):98~101
    73 M. Persin, J. Zobel, R. Sacks Davis. Filtered document retrieval with frequency-sorted indexes. Journal of the American Society for Information Science. 1996
    74 M. F. Jiang. Intelligent Query Agent for Structural Document Database. Expert Systems with Application. 1999
    75 陆玉昌, 鲁明羽, 李凡. 向量空间法中单词权重函数的分析和构造. 计算机研究与发展. 2002, 39(10)
    76 C. Y. Christopher. Intelligent internet searching agent based on hybrid simulated annealing. Decision Support Systems. 2000
    77 宋玲丽, 成颖, 单启成. 信息检索系统中的相关反馈技术. 情报学报. 2005, 24(1):34~40
    78 刘芳, 卢正鼎. 有效地检索 HTML 文档. 小型微型计算机系统. 2000, 21(9):986~988
    79 Soumen Chakrabarti, Mukul Josln. Enhaneed topic distillation using text,markup tags,and hyperlinks. In Proc.of 24th ACM-SIGIR conference on Research and Development in Information Retrieval.New Orleans. Louisiana, USA. 2001:208~216
    80 杨思洛. 搜索引擎的排序技术研究. 现代图书情报技术. 2005, (01):53~57
    81 杨广翔, 俞宁, 谌莉. 索引擎结果的重排序方法. 计算机应用. 2005, (2):69~72

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700