中文信息处理关键问题的研究

英文题名：Research on Key Topics in Chinese Information Processing
作者：朱冲
论文级别：硕士
学科专业名称：信号与信息处理
中文关键词：中文信息处理 ; 汉语基本短语分析 ; 最大熵模型 ; 常用问答系统 ; 网络爬虫
英文关键词：chinese information processing ; basic chinese phrase analysis ; max-imum entropy model ; frequently-asked question system ; web crawler
学位年度：2009
导师：张向利
学科代码：081002
学位授予单位：桂林电子科技大学
论文提交日期：2009-03-01

摘要

语言文字信息的计算机自动处理水平和处理量已成为衡量一个国家是否步入信息社会的重要标准之一。汉语自身的复杂性导致我国中文信息处理(Chinese InformationProcessing, CIP)水平远远滞后于21世纪中国经济全球化的步伐,因此,如何实现中文自然语言的有效理解,已经成为备受人们关注的极具挑战性的国际前沿课题。
     本文针对目前中文信息处理领域存在的问题,重点研究了中文语法层词法、基本短语分析和中文语义处理及其在信息检索中的应用技术。
     本文的创新主要体现在以下几个方面:
     1.在语法层面上,研究了汉语词法分析和基本短语分析相关技术。重点研究了最大熵模型,给出了必要的数学推导及IFS、SGC、GIS、IIS算法的伪代码描述,针对汉语的特点,提出了一个汉语基本短语分析模型,将汉语短语的边界划分和短语标识分开,假定这两个过程相互独立,采用最大熵方法分别建立模型解决。最大熵模型的关键是如何选取有效的特征,文中给出了两个步骤相关的特征空间以及特征选择过程和算法。实验表明,模型的短语定界精确率达到95.27%,标注精确率达到96.20%。
     2.在应用层面上,研究了将中文信息处理引入信息检索领域需要解决的关键问题。设计了一个基于潜在语义分析(Latent Semantic Analysis, LSA)的常用问答(Frequently-Asked Question, FAQ)系统,并给出了系统中各个子模块的详细实现过程,其中,在自然语言接口模块中提出了一种新的语义匹配方法,在数据采集子系统中提出了一种新的聚焦网络爬虫主题相关度判断算法。农业领域实验表明,该FAQ系统性能上优于FAQ-Finder系统。
The computer auto-processing quality and amount on language character infor-mation is one of the important standard to judge whether a country has stepped intoinformation age or not. As we known, Chinese Information Processing(CIP) level ofchina can’t meet the needs of it’s global economy developing in the 21st century, sohow to realize the e?ective understanding of chinese is a real challenge, and also a hotresearch field.
     Based on these challenges mentioned above, this dissertation studies on the chineseword, basic phrase analysis, chinese sematic processing and the applied technology ofinformation retrieval system base on CIP. The major creative work of this dissertationis as follows:
     1. Syntax layer: mainly introduce the basic theory, mathematics deduction andalgorithms with pseudo code of maximum entropy method, theose algorithms includeIFS, SGC, GIS, IIS, and then, a basic chinese phrase parsing model is presented ,which separate the prediction of the phrase boundary location and tagging, a maximumentropy method was adopt to solve the model, respectively. The focus of ME modelis how to select useful features, and the procedure and algorithms of feature selectionwith feature space was given. Experimental results demonstrate a high rate of thesuccess for predicting the phrase boundary(95.27%), the 96.20%correct the predictionof phrase tagging.
     2. Application research: a genenal framework of Frequently-Asked Question(FAQ)system base on Latent Semantic Analysis(LSA) is designed, which uses an new ap-proach to semantic inference for FAQ mining, and in the data-gathering system, itgives a new approach to design agriculture ontology based web focused crawler. Ex-perimental results indicate that the FAQ system outperformed the FAQ-Finder systemin the agriculture field.

引文

[1]宗成庆,高庆狮.中国语言技术进展[J].中国计算机学会通讯,2008,4(8):39-48.
    [2]徐波,孙茂松,靳光瑾.中文信息处理若干重要问题[M].北京:科学出版社,2003:2-10.
    [3]刘小冬.自然语言理解综述[J].统计与信息论坛,2007,22(2):5-12.
    [4]刘挺.中文信息处理奇葩绽放[J].中国计算机学会通讯,2008,4(2):39-48.
    [5]冯志伟.国外自然语言理解系统简介[J].计算机科学,1984,2.
    [6] Harrismd. Introduction to Natural Language Processing[M]. Reston, Virginia: RestonPublishing Company,Inc.,1985:15-16.
    [7]王晓龙,关毅等编著.计算机自然语言处理[M].北京:清华大学出版社.2005:2-10.
    [8]张爱民,袁占亭,张秋余.自然语言处理及其智能搜索引擎模型的设计研究[D].兰州:兰州理工大学,2003:9-18.
    [9]周强.规则与统计相结合的汉语词类标注方法[J].中文信息学报,1995,9(2):1-10.
    [10]周强,俞士汶.一种切分与词性标注相融合的汉语语料库多级处理方法,计算语言学研究与应用[M].北京:北京语言学院出版社,1993:126-131.
    [11]梁南元.书面汉语自动分词系统-CDWS[J].中文信息学报,1987,2(2):101-106.
    [12]孙斌.分歧义字段的综合性分级处理方法-北京大学计算语言学研究所讨班.http://icl.pku.edu.cn/doubtfire/NLP/Lexical Anal-ysis/Word Segmentation Tagging/Chinese Word Seg&Tag/seg tag BSWEN.htm,1999.4.13.
    [13] Yubin Dai, eck Ee Loh, hristopher Khoo. A New Statistical Formula for Chinese TextSegmentation Incorporating Contextual Information[C]. Processings of the 22nd An-nual International ACM SIGIR Conference on Research and Development in Informa-tion Ketrieval,1999,pp.82-89.
    [14] Church K, Hanks P. Word association norms,mutual information,and lexicography[J].Computational Linguistics,1990,16(1):22-29.
    [15] E.T.Jaynes. Information Theory and Statistical Mechanics[J]. Physics Reviews. 1957,106:620-630.
    [16] A.L.Berger, S.A.Della Pietra, V. J. Della Pietra. A Maximum Entropy Approach toNatural Language Processing[J].Computational Linguistics,1996,22(1):39-72.
    [17] Kamal Nigam, John La?erty, et al. Using maximum entropy for text classification[C].In proceedings of the IJCAI-99 workshop on information filtering, Stockholm, SE, Oct,1999:61-67.
    [18] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging[C]. In pro-ceedings of conference on empirical method in natural language processing, universityof Pennsylvania,1996.pp.133-142.
    [19]周稚倩.最大熵方法及其在自然语言处理中的应用[D].上海:复旦大学,2005:11-26.
    [20]贾丽洁,刘培玉.基于最大熵模型的分词技术研究[D].济南:山东师范大学,2007:22-23.
    [21]周雅倩,郭以昆,黄营著,吴立德.基于最大熵方法的中英文基本名词短语识别[J].计算机研究与发展,2003,40(3):440-446.
    [22] J.N.Darroch, D.Ratcli?. Generalized iterative scaling for log-linear models[C]. The An-nals of Mathematical Statistics,1972,43:1470-1480.
    [23] Ronald Rosenfeld. Adaptive Statistical Language Modeling:A Maximum Entropy Ap-proach[D]. Pittsburgh, PA: Carnegie Mellon University, 1994:22-28.
    [24] Joshua Goodman. Sequential Conditional Generalized Iterative Scaling[C]. Proceedingsof the 40th Annual Meeting of the Association for Computational Linguistics(ACL),Philadelphia, July 2002, pp.9-16.
    [25] Donald Hindle. Deterministic parsing of syntactic non-?uencies[C]. Proceedings ofthe 21st annual meeting on Association for Computational Linguistics. Cambridge,1983:123-128.
    [26] Carl G, de Marcken. parsing the LOB Corpus[C]. The 28th annual meeting on Associ-ation for Computational Linguistics. Pittsburgh. 1990:243-251.
    [27] K.Lari, S.J.Young. The estimation of stochastic context-free grammars using the Inside-Outside algorithm[J]. Compute Speech and Language, 1990, 4(1):35-56.
    [28]李京葵,周明,黄昌宁.统计和规则相结合的汉语句法分析研究,计算语言学研究和应用[M].北京:北京语言学院出版社,1993:176-182.
    [29]周强.汉语语料库的短语自动划分和标注研究[D].北京:北京大学.1996:1-2.
    [30]李素建,刘群,杨志峰.基于最大熵模型的组块分析[J].计算机学报.2003,26(12):1722-1727.
    [31] Pietra S D, Pietra V D, La?erty J. Inducing features of random fields[J]. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 1997, 19(4):380-393.
    [32]张昱琪,周强.汉语基本短语的自动识别[J].中文信息学报,2002,16(6):1-8.
    [33]周强.一个汉语短语自动界定模型[J],软件学报,1996,第7卷增刊:315-322.
    [34]刘挺,车万翔.中文语义处理[J].中国计算机学会通讯,2008,4(2):29-35.
    [35]梅家驹,竺一鸣,高蕴琦等编.同义词词林[M].上海:上海辞书出版社,1983:1-5.
    [36]《同义词词林》扩展版.http://www.ir-lab.org/,2006.
    [37]王惠,俞士汶,詹卫东.现代汉语语义词典(SKCC)的新进展[C].全国第七届计算语言学联合学术会议-语言计算与基于内容的文本处理,2003:351-356.
    [38]王惠,詹卫东,俞士汶.现代汉语语义词典规格说明书[J].Journal of Chinese Lan-guage and Computing,Singapore,2003,13(2):159-176.
    [39]刘群,李素建.基于《知网》的词汇语义相似度计算[J].计算语言学及中文信息处理.7(2):59-76.
    [40]董振东.机器翻译中词典和文法的关系[J].中文信息学报,1988.(3):58-62.
    [41]杨思春,陈家骏.中文自动问答中句子相似度计算研究[J].情报学报,2008,27(1):35-41.
    [42]张玉娟.基于《知网》的句子相似度计算的研究[D].北京:中国地质大学,2006:2-5.
    [43]刘亚军,徐易.一种基于加权语义相似度模型的自动问答系统[J].东南大学学报:自然科学版,2004,34(5):609-612.
    [44]秦兵,刘挺,王洋,郑实福,李生.基于常问问题集的中文问答系统研究[J].哈尔滨工业大学学报,2003,35(10):1179-1182.
    [45] T.R.Gruber. Towards principles for the design of ontologies used for knowledge shar-ing[J]. International Journal of Human-Computer Studies, 1995, 43:907-928.
    [46]刘海涛.依存语法和机器翻译[J].语言文字应用,1997,3:89-93.
    [47] B.H.Murray, A.Moore. Sizing the Internet,A white paper[Z]. eillance, Inc.http://www.cyveillance.com, 2000.
    [48] Ricardo Baeza-Yates. Challenges in the Interaction of Information Retrieval andNatural Language Processing[C]. In:Proceedings of 5th International Conferenceon Intelligent Text Processing and Computational Linguistics(CICLing 2004),Seoul,Korea,February 15-21,2004:445-456.
    [49] Christopher Manning. Opportunities in Natural Language Processing[Z]. Presentationgiven at Oracle, 2002.
    [50] Ellen M.Voorhees. Natural Language Processing and Information Retrieval[A]. Infor-mation Extraction:Towards Scalable,Adaptable Systems[M].LNCS1714,1999:32-48.
    [51] Hui Yang, Tat2Seng Chua. QUAL IFIER:Question Answering by Lexical Fabric andExternal Resources[C]. In:the Proceedings of the 10th Conference of the EuropeanChapter of the Association for Computational Linguistics(EACL2003). 2003:363-370.
    [52] http://trec.nist.gov/, 2007.
    [53] S.Oyama, T.Kokubo, and T.Ishida. Domain-Specific Web Search with Keyword Spices.IEEE Trans. Knowledge and Data Eng, 2004, 16(1):17-27.
    [54] C.O.Kwok, O.Etzioni, and D.S.Weld. Scaling Question Answering to the Web, ACMTrans. Information Systems, 2001, 19(3):242-262.
    [55] C.H.Wu, J.F.Yeh, and M.J.Chen. Domain-Specific FAQ Retrieval Using IndependentAspects[J]. ACM Trans.Asian Language Information Processing,2005, 4(1):1-17.
    [56] D.Camacho. Using Hierarchical Knowledge Structure to Implement Dynamic FAQ Sys-tem[C]. Proc.Fifth Int’l Conf.Practical Aspects of Knowledge Management (PAKM’04),2004:496-507.
    [57] R.D.Burke, K.J.Hammond, V.A.Kulyukin, S.L.Lytinen, N.Tomuro,and S. Schoenber.Question Answering from Frequently-Asked Question Files Experiences with the FAQFinder System[M], Technical Report TR-97-05, Univ.of Chicago, 1997:1-38.
    [58] E.Sneiders. Automated Question Answering Using Question Templates that Cover theConceptual Model of the Database[C]. Natural Language Processing and InformationSystems, Proc.Int’l Workshop Applications of Natural Language to Information Sys-tems, 2002:235-239.
    [59] Chung-Hsien Wu, Jui-Feng Yeh, and Yu-Sheng Lai. Semantic Segment Extraction andMatching for Internet FAQ Retrieval[J], IEEE Trans. Knowledge and Data Eng, 2006,18(7):27-29.
    [60] Dawei Wang, Rujing Wang, Ying Li, Baozi Wei. Latent Semantic Inference for Agri-culture FAQ Retrieval[C]. CESSE 2007 Summer Meeting in Prague, Czech Republic,120-124.
    [61]李刚,宋伟,邱哲.征服Ajax+Lucene构建搜索引擎[M].北京:人民邮电出版社.2006:1-10.
    [62] V.Jijkoun, J.Mur, and M.deRijke. Information Extraction for Question Answering:Improving Recall through Syntactic Patterns[C]. Proc.Int’l Conf.Computational Lin-guistics,In Proceedings of COLING-2004,2004:1284-1290.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700