基于N元分析与词频统计的文本复合标引研究

作者：杲晓锋
论文级别：硕士
学科专业名称：情报学
中文关键词：自动标引 ; N元分析 ; 词频统计 ; 文本复合标引
英文关键词：Automatic indexing ; N-gram analysis ; Word frequency statistic ; Text combined indexing
学位年度：2009
导师：李培
学科代码：120502
学位授予单位：南开大学
论文提交日期：2009-05-01

摘要

科学技术的发展已将人类带入智能化的信息社会,使得信息成为重要的资源,但随之也带来了信息资源的爆炸性增长和无限扩张。面对庞大的信息资源,信息处理成为人们有效利用信息必须借助的关键手段。在信息处理中,一项重要的工作就是根据原文信息内容产生简明准确的信息标引,因为信息标引的质量在一定程度上决定了信息处理的效果,也就必然影响信息对于人们的利用价值。在此背景下,研究出低成本、高效率的信息标引方法显得至关重要。
     因此,本文围绕自动标引技术和方法,以文本信息的自动标引作为研究对象,利用比较分析与实验分析相结合的研究方法,针对N-gram标引和词频统计标引展开研究与探讨。在此基础上,提出了基于N元分析与词频统计的文本复合标引这一新型标引方法。本文主体内容如下:
     首先,本文从文本和自动标引相关介绍切入,对自动标引的研究发展进行了系统的回顾与总结,重点从自动标引基本理论的宏观层面划分、自动标引发展过程中兼具创新性与影响力的代表性方法纵览以及自动标引研究路线图三方面进行简要论述,继而指出自动标引发展中存在的问题和可能的解决途径,引出本文的复合标引这一研究主题。
     其次,本文从原理、方法和实现过程三个角度对词频统计标引和N-gram标引两种方法进行较为全面系统的分析与比较,阐述了两种方法在本质上的一致性和方法过程上的互补性,通过引入统计学领域中的条件概率和信息论领域中的信息熵这两个工具,将N-gram标引和词频统计标引有效的复合为一体,提出了兼具二者优势的基于N元分析与词频统计的文本复合标引方法,并对其进行了详细的介绍,给出了具体的实现过程。
     最后,本文采用实验分析法,通过对比试验,进一步的从实践的角度论证本文提出的文本复合标引方法在理论方面的正确性和在应用实践方面的可行性与有效性,相关实验结果也对本文的方法提供了有力的论证。
     因此,本文的研究工作具有一定的创新性,同时对他人在自动标引方法的复合研究方面也具有一定的借鉴和指导意义。
Due to the development of science and technology, the information has become an important resource in our modern information society, which also makes the information resource keeping a speed of explosive growth and unlimited expansion. To cope with this problem, information processing is the key factor to achieve the satisfying condition for information utilization. It is an important task to generate concise and accurate information indexing for information processing. To some extent, the quality of automatic indexing could determine the effect of information processing and the value of information utilization. Under this background, it's very important to improve and promote methods of automatic indexing for information indexing with low cost and high efficiency.
     Therefore, centering on technologies and methods of the automatic indexing as well as taking text information indexing as an object of study, this paper discussed the new combined method of automatic indexing for text information based on N-gram analysis and word frequency statistic by combining comparative analysis method with experimental analysis method. The main content as follows:
     First, taking text and automatic indexing as main breakthrough point, this paper provided a review and summary of automatic indexing from micro-segmentation of basic theory, representative methods and map of research route, then, it pointed out the problems in development of automatic indexing and the possible solution as well as the research topic of combined method of automatic indexing.
     Second, based on the comprehensive and systematic comparison and analysis between the method of N-gram automatic indexing and word frequency statistic automatic indexing from aspects of theory, approach and realization process, this paper pointed out that they shared an essential agreement and complemented each other's advantages of approach. Furthermore, the author presented a new combined method of automatic indexing for text information based on N-gram analysis and word frequency statistic, which combined N-gram analysis with word frequency statistics by introducing two tools of conditional probability in Statistics and entropy in information theory.
     In the end, to verify the validity in the theory and the feasibility and effectiveness in the application of this new method, a detailed realizing plan and process for the automatic indexing was produced by computer program. Furthermore, through the comparative experiment from the view of practice, the result showed that it had certain superiority in the performance of automatic indexing.
     So this paper's research work possesses certain innovation. And this method could provide certain reference and guiding significance for studying combined method of automatic indexing.

引文

[1]“信息爆炸”给我们带来了什么”.数字福建:http://www.szfj.gov.cn/Show.asp?NewsID=14301(2009-4-12查)
    [1]曹丽英等.数据挖掘在精准农业中的应用现状及发展趋势.吉林农业大学学报,2008(4):621
    [1]文本特征提取方法研究.[EB/OL].http://download.csdn.net/source/772451(2009-01-19查)
    [1]HP.Luhn.The Automatic Creation of Literature Abstracts.IBM Journal of Research and,1958(2):2
    [1]章成志.自动标引研究的回顾与展望.现代图书情报技术,2007(11):33-37
    [1]Salton & Gill.Introduction to Modern Infornation Retrieval.New York McGraw-Hill,1983:113
    [1]Yaakov H.K.Automatic Extraction of Keywords from Abstracts.In:Proceedings of the 7th Internationl Conference on Knowledge-Based Intelligent Information and Engineering Systems(KES2003),Oxford,UK,2003:843-946
    [1]刘涌泉.中国计算机与自然语言处理的新进展.情报科学,1987(1):64-70
    [1]Cohen,J.D.Highlights:language and domain independent automatic indexing terms for abstracting.Journal of the Americas Society for Information Science,1995,46(3):162-174
    [1]马费成,张勤.国内外知识管理研究热点-基于词频的统计分析.情报学报,2006(2):163-171
    [2]Abdullah.M.ALShehri.Optimization and effectiveness of N-grams approach for indexing and retrieval in Arabic information retrieval systems.University of Pittsburgh,2002:24
    [1]许剑颖.统计分析法自动标引的改进研究.现代图书情报技术,2004(2):92-96
    [1]R.Brandow,K.Mitze and L.F.Rau.Automatic Condensation of Electronic Publications by Sentence Selection.Information Processing & Management,1995,31(5):675-685
    [1]澜科语言科技中心网址:http://www.languagetech.cn/
    [1]曹艳,杜慧平,刘竟等.基于词表和N-gram算法的新词识别实验[J].情报科学,2007(11):1688-1691.
    [2]何浩,杨海棠.一种基于N-Gram技术的中文文献自动分类方法[J].情报学报,2002(4):421-423
    [3]胡吉祥,许洪波,刘悦等.重复串特征提取算法及其在文本聚类中的应用[J].计算机工程,2007(2):65-67
    [4]柯平,赵益民.从关键词与高频词的相关度看自动标引的可行性[J].情报科学,2009(3):326-332
    [5]刘晓丽,张佳骥.基于N-Gram的中文文本示例检索方法研究[J].无线电通信技术,2001(6):24-26
    [6]刘华.基于关键短语的文本内容标引研究[博士学位论文].北京:北京语言大学,2005
    [7]刘洪波.词频统计的发展[J].情报科学,1991(6):14-16
    [8]李培.单汉字标引方法的改进研究[J].情报学报,1999(5):418-421
    [9]廖浩,李志蜀,王秋野等.基于词语关联的文本特征词提取方法[J].计算机应用,2007(12):3009-3012
    [10]李朝阳.经济文献数据库计算机标引研究[硕士学位论文].南京:南京农业大学,2000
    [11]李素建,王厚峰,俞士汶等.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004(9):1192-1198
    [12]马志柔,叶屹.一种有效的多关键词词频统计方法[J].计算机工程,2006(10):191-192
    [13]马金山,刘挺,李生.基于N-gram及依存分析的中文自动查错方法.哈尔滨工业大学信息检索研究室论文集第一卷.2003
    [14]邱越峰,田增平,周傲英.一种基于N-Gram的检测相似重复记录的高效方法[J].兰州大学学报,1999(7):256-261
    [15]任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J].中文信息学报,2006(5):40-43
    [16]苏新宁.信息检索理论与技术[M].北京:科学技术出版社,2004
    [17]索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006(6):25-27
    [18]孙健,王伟,钟义信.基于统计的常用词搭配(Collocation)的发现方法[J].情报学报,2002(1):12-16
    [19]田苗苗.Web信息自动标引研究[硕士学位论文].长春:长春工业大学,2006
    [20]王大亮,张德政,涂序彦等.基于相对条件熵的搭配抽取方法[J].北京邮电大学学报,2007(6):41-45
    [21]王大亮,涂序彦,郑雪峰等.多策略融合的搭配抽取方法[J].清华大学学报(自然科学版),2008(4):608-612
    [22]王素格,杨军玲,张武.自动获取汉语词语搭配[J].中文信息学报,2006(6):31-37
    [23]王晔,黄上腾.基于N-gram相邻字的中文文本特征提取算法.第一届全国信息检索与内容安全学术会议论文集,上海,2004:27-32
    [24]王灿辉,张敏,马少平等.基于相邻词的中文关键词自动抽取[J].广西师范大学学报(自然科学版),2007(2):161-163
    [25]王映,常毅,谭建龙等.基于N元汉字串模型的文本表示和实时分类的研究与实现[J].计算机下程与应用,2005(5):88-91
    [26]吴应良,韦岗,李海洲.一种基于N-gram模型和机器学习的汉语分词算法[J].电子与信息学报,2001(11):149-1150
    [27]于津凯,王映雪,陈怀楚.一种基于N-gram改进的文本特征提取算法[J].图书情报工作,2008[8]:48-50
    [28]王文林,席临平,高进龙等.计算机应用于词频统计的算法研究[J].现代电子技术,2007(26):64-66
    [29]熊文新,宋柔.信息检索用户查询语句的停用词过滤[J].计算机工程,2007(6):195-198
    [30]杨海棠.基于N-gram的大规模中文文档聚类研究[硕士学位论文].武汉:华中师范大学,2003
    [31]曾华琳,李堂秋.基于上下文信息提取的概率分词算法[J].学术问题研究,2006(1):127-131
    [32]张敏.生物学文献的自动标引系统的研究与开发[硕士学位论文].上海:东华大学,2006
    [33]章成志.基于集成学习的自动标引方法研究.中国索引学会第三次全国会员代表大会暨学术论坛论文集,36-41
    [34]张民,李生,赵铁军.大规模汉语语料库中任意n的n-gram统计算法及知识获取方法[J].情报学报,1997(1):27-33
    [35]赵云志.统计分析法自动标引的改进[J].情报学报,2000(4):334-337
    [36]赵妍,侯汉清,耿金玉等.中文期刊论文自动标引加权设计研究[J].新世纪图书馆,2004(1):40-43
    [37]周新栋,王挺.基于N元语言模型的文本分类方法[J].计算机应用,2005(1):11-15
    [38]周丽琴,杨季文,吕强.基于Web的字词频统计程序的设计与应用[J].苏州大学学报(自然科学版),2002,(01)
    [39]朱小娟,陈特放.基于SVM的词频统计中文分词研究[J].微计算机信息,2007(30):205-207
    [40]Abdullah M.ALShehri Optimization and effectiveness of N-grams approach for indexing and retrieval in Arabic information retrieval systems.[D].University of Pittsburgh,2002
    [41]Cohen.J.D.Highlights:language and domain independent automatic indexing terms for abstracting[J].Journal of the Americas Society for Information Science,1995,46(3):162-174
    [42]C.Korycinski and A.F.Newell.natural language processing and automatic indexing[J].The Indexer,1990,17(1):21-29
    [43] J D Cohen. Recursive hashing functions for N-gram. ACM Transaction Information Systems, 1997,15(3):291 -320

    [44] A. Chen et al. Chinese text retrieval without using a dictionary. Proceedings of the 20th Annual International ACMSIGIR Conference on Research and Development in Information on retrieval, 1997:42-49

    [45] Marc Damashek. Gauging similarity with N-gram: language-independent categorization of text[J].Science,1995(267):843-848

    [46] Ido Dagan.Fernando Pereira,Lillian Lee. Similarity-based models of word cooccurrence probabilities[J]. Machine Learning, 1999,34(1-3):43-69

    [47] J.C.Schmitt. Tri-gram-based method of language identification[P]. U.S.Patent Office No.5.062 Washington, 1990

    [48] Jie Cheng, Russell Greiner, Jonathan Kelly, David Bell, Weiru Liu. Learning Bayesian networks from data: An information-theory based approach[J]. Artificial Intelligence, 2002(137):43-90

    [49] LI Su-jian,WANG Hou-feng,YU Shi-wen,et al. News-oriented automatic Chinese keyword indexing[C]. Proceedings of the Second SIGHAN Workshop on Chinese Language Processing. Morristown,NJ,USA. Association for Computational Linguistics, 2003(17):92-97

    [50] Moens M F. Automatic indexing and abstracting of document texts[M]. Boston/Dordrecht/London: Kluwer Academic Publishers, 2000,78,104

    [51] Man and Cybernetics. Entropy-based indexing term for N-gram text search system Systems[J]. IEEE International Conference. 2003.4852-4856

    [62] Mike Thelwall , Rudy Prabowo, Identifying and characterizing public science-related fears from RSS feeds: Research Articles[J], Journal of the American Society for Information Science and Technology, 2007,58(3):379-390

    [53] Jian-yun Nie, Jiangfeng Gao, Jian Zhang, Ming Zhou. On the use of words and n-grams for Chinese information retrieval[A].Fifth International Workshop on Information Retrieval with Asian Languages, IRAL-2000[C]. P141-148, Hong Kong, ACM Press, 2000

    [54] PeterD.Turney. Learning algorithms for keyphrase Extraction[J]. Information Retrieval, 2000,2(4):303-336

    [55] Shannon, Claude E. Prediction and entropy of printed english[J]. Bell Systems Technical. 1950(30):50-64..

    [56] Chade-Meng Tan,Yuan-Fang Wang, Chan-Do Lee. The useful of Digrams to enhance text categorization [J]. Information Processing and Management, 2002, 38 (4):529-546

    [57] William.B.Cavanar. N-grarn-based text filtering for TREC-2. Second Text Retrieval Conference(Proc.of TREC-2), 1994

    [58] Wilbur WJ and Yang Y. An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts[J]. Computers in Biology and Medicine, 1996; 26(3):209-22.

    [59] Wiesniewski, J.L. Effective text compression with simultaneous bigram and trigram encoding. Journal of Information Science, 1987(13): 159-164.
    [60]Yamamoto,H,Ohmi,S,Tsuji,H.Incremental indexing and its evaluation for full text search.IRMA(International Resource Management Associates)International Conference 2003,(Philadelphia,2003.LISA)
    [61]Yamamoto,H.,Ohmi,S.,Tsuji,H.Experimental simulation on incremental three-gram index for two-gram fuⅡ-text search system.TEEE International Conference an Systems,Man&Cybernetics(IEEE1SMC 2003),(In this conference),October 2003:4846-4851
    [62]Yip Chi Lap,Kao,B.A study on N-gram indexing of musical features[J]IEEE 2000(2):869-872
    [63]Yiming Yang,Jan O.Pedersen.Feature selection in statistical learning of text categorization[J].ICML 97,1997:412-420
    [64]Yiming Yang,Jan O.Pedersen.A comparative study on feature selection in text categorization[J].ACM Computing Surveys,1997,34(1):1-47
    [65]Yannakoudakis.E.J,Tsomokos.I,Hutton.P.J.N-Grams and their implications to natural language understanding[J].Pattern Recognition.1990,23(5):509 -528

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700