信息距离理论及其在问答系统中的应用研究

英文题名：Research on Information Distance Theory and Its Application in Question Answering System
作者：张显
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：Kolmogorov复杂性 ; 距离度量 ; 信息距离 ; 问答系统 ; 段落检索
英文关键词：Kolmogorov complexity ; distance metric ; information distance ; question answering ; passage retrieval
学位年度：2008
导师：朱小燕
学科代码：081203
学位授予单位：清华大学
论文提交日期：2008-06-01

摘要

计算词与词、句与句等文本片段之间的相似度或相关性是自然语言问答系统的核心任务之一。不仅如此,相似度或相关性计算在信息提取、信息检索等很多领域也具有重要的意义。从根本上说,相似度或相似性计算都可以抽象成度量两个实体在某种意义下的距离。本文就集中于建立和完善能够计算对象间距离的统一理论——信息距离理论,并对各种情况下使用信息距离度量文本片段之间的相似度或相关性做出了深入探索,最后在此基础上设计和实现了自然语言问答原型系统QUANTA。本文的主要工作如下:
     ·以传统的max型信息距离理论为基础,提出了基于Kolmogorov复杂性的min型信息距离度量。新的度量解决了传统信息距离在解决实际问题时遇到的部分匹配问题,三角不等式问题和密度问题。在正规化信息距离的普适性方面,我们证明了一系列定理,为传统理论中的遗留问题做出了确定性结论。最后,我们发展了基于条件模式的条件信息距离理论。
     ·在信息距离理论的指导下,对词与词之间、句与句之间的相似性进行了深入研究。基于模式的条件信息距离相比传统信息距离可以提供更强的语义信息,据此我们设计了一套条件模式计算词之间的语义相似度。基于最大交迭原则和min型信息距离的原理,我们提出了估计条件Kolmogorov复杂性的算法,以计算句子与句子之间的相似性。
     ·答案确认是问答系统中的关键环节之一。本文提出了基于条件信息距离的答案确认算法,利用条件信息距离的稳定性以及刻画对象之间相关度时的灵活性,将计算问题与答案相关性的问题转化成为计算问题的中心对象与答案之间关于特定条件模式的条件信息距离的问题。
     ·采用自然语言处理、文本分类和信息检索领域的一系列技术,以信息距离理论为基础,设计并实现了事实型问题回答原型系统QUANTA。系统通过问题预处理、检索条目生成、文档/段落检索、备选答案生成和答案确认等五个模块回答自然语言提出的事实型问题。
One of the most important task in Question answering system is to calculate the similarity/relationship between text segments, such as words or sentences. Besides, similarity/relationship calculation is also important in many domains including Information Extraction, Information Retrieval, Knowledge Representation and Knowledge Reasoning. Theoretically, the similarity/relationship calculation can be treated as a unified problem of measuring the distance between two entities, under some distance measure. This thesis focuses on establishing and completing the unified distance theory——Information Distance theory, to solve this problem. Distance measures betweentext segments under different aspects are discussed using Information Distance and Conditional Information Distance. Based on the theory, a natural language question answering prototype system QUANTA is designed and implemented.
     ·Extending the traditional work on max information distance theory, a Kol-mogorov Complexity based min information distance is proposed. This new measure solves several problems that traditional information distance metric encounters in application, including partial matching problem, triangle inequality problem and density problem. The weak universality of the max normalized information distance is proved, and a conclusion is given for the cases that universality doesn't hold. For the min normalized information distance, the proof of the strict universality is given. Besides, the conditional information distance theory is fully discussed and developed on its physical and mathematical characteristics.
     ·Based on the information distance theory, the similarity between words and sentences are discussed. Word semantic similarity can be measured by pattern-based conditional information distance from variant aspects. Conditional Kolmogorov complexity estimation method based on maximum overlap rule and min distance theory is proposed to calculate the sentence semantic similarity, and this is suc- cessfully applied in passage retrieval task in question answering system.
     ·Answer validation is one of the most important stages in question answering system. In this thesis, this problem is solved by calculating the conditional information distance between the question focus and the answer over certain condition patterns. Extensive experiments are conducted to justify the method.
     ·An open-domain factoid question answering prototype system——QUANTAis designed and implemented combining Natural Language Processing, Text Categorization, Information Retrieval, and Information Extraction technologies. The system answers natural language questions through Question Preprocessing, Query Formulation, Doc/Passage Retrieval, Candidate Generation and Answer Validation stages.

引文

[1]Li M,Vitanyi P M.An Introduction to Kolmogorov Complexity and Its Applications.Berlin:Springer-Verlag,1993.
    [2]H.B C,P.G,M.L,et al.Information Distance.IEEE Trans.Information Theory,1998,44(4):1407-1423.
    [3]Li M,Chen X,Li X,et al.The similarity metric.In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,Philadelphia,PA,USA:Society for Industrial and Applied Mathematics,2003.863-872.
    [4]Keogh E,Lonardi S,Ratanamahatana C A.Towards parameter-free data mining.In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,New York,NY,USA:ACM,2004.206-215.
    [5]Cilibrasi R,Vitanyi P.Clustering by compression.Information Theory,IEEE Transactions on,April 2005,51(4):1523-1545.
    [6]Cilibrasi R,Vitanyi P,Wolf R.Algorithmic Clustering of Music,wedelmusic,2004,00:110-117.
    [7]Kaabneh K,Abdullah A,Al-Halalemah Z.Video Classification Using Normalized Information Distance,gmai,2006,0:34-40.
    [8]Santos C C,Bernardes J,Vitanyi P,et al.Clustering Fetal Heart Rate Tracings by Compression,cbms,2006,0:685-690.
    [9]Zhang L,Zhuang Y,Yuan Z.A Program Plagiarism Detection Model Based on Information Distance and Clustering,ipc,2007,0:431-436.
    [10]Kulkami A,Bush S.Detecting Distributed Denial-of-Service Attacks Using Kolmogorov Complexity Metrics.J.Netw.Syst.Manage.,2006,14(l):69-80.
    [11]Cilibrasi R L,Vitanyi P M.The Google Similarity Distance.IEEE Transactions on Knowledge and Data Engineering,2007,19(3):370-383.
    [12]Tan P N,Kumar V,Srivastava J.Selecting the right interestingness measure for association patterns.In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,New York,NY,USA:ACM,2002.32-41.
    [13]G.L.An Introduction to Arithmetic Coding.IBM J.Res.and Dev,1984,28(2):135-149.
    [14]李明,PM．B．威塔涅．描述复杂性．北京：科学出版社,1998．
    [15]Li M,Badger J H,Chen X,et al.An information-based sequence distance and its application to whole mitochondrial genome phylogeny.Bioinformatics,2001,17(2):149-154.
    [16]Li M,Chen X,Li X,et al.The similarity metric.IEEE Trans.Information Theory,2004,50(12):3250-3264.
    [17]郝宇.基于Kolmogorov复杂性的知识获取方法研究[博士学位论文].北京:清华大学,2005.
    [18]H.B C,LI M,MA B.Chain letters and evolutionary histories.Scientific American,2003,288(6):64-69.
    [19]Zhang X,Hao Y,Zhu X,et al.Information distance from a question to an answer.In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining,New York,NY,USA:ACM,2007.874-883.
    [20]Voorhees E M,Dang H T.Overview of the TREC 2005 Question Answering Track.In Proceedings of the 14th Text Retrieval Conference.NIST,2005.
    [21]Turing A.Computing Machinery and Intelligence.Mind,1950,59(236):433-460.
    [22]Searle J R.Minds,brains,and programs.Mind design,1985.282-307.
    [23]陆汝衿.世纪之交的知识工程与知识科学.北京:清华大学出版社,2001.
    [24]杜永萍.基于模式知识厍的问题回答关键技术研究[博士学位论文].上海:复旦大学,2005.
    [25]王树西.基于文本模式推理的问答系统研究[博士学位论文].北京:中国科学院计算技术研究所,2005.
    [26]Huang G T,Yao H H.Chinese question-answering system.J.Comput.Sci.Technol.,2004,19(4):479-488.
    [27]Voorhees E M.Overview of the TREC 2001 Question Answering Track.In Proceedings of Text REtrieval Conference,2001.
    [28]Voorhees E M.Overview of the TREC 2003 Question Answering Track.In Proceedings of Text REtrieval Conference,2003.
    [29]Dang H,Lin J,Kelly D.Overview of the TREC 2006 Question Answering Track.In Proceedings of the 15th Text Retrieval Conference.NIST,2005.
    [30]Magnini B,Giampiccolo D,al P F.Overview of the CLEF 2006 Multilingual Question Answering Track.In Proceedings of Evaluation of Multilingual and Multi-modal Information Retrieval,2007.223-256.
    [31]白硕.语言计算与基于内容的文本处理.北京:清华大学出版社,2003.
    [32]游斓,周雅倩,黄萱菁,等.基于最大熵模型的QA系统置信度评分算法.软件学报,2005,16(8):1407-1414.
    [33]张亮.面向开放域的中文问答系统问句处理相关技术研究[博士学位论文].南京:南京理工大学,2005.
    [34]李良寓,樊孝忠,李宏乔.语义相似计算驱动领域自动问答.北京理工大学学报,2005,25(11):958-962.
    [35]史忠机.知识发现.北京:清华大学出版社,2002.
    [36]Weizenbaum J.ELIZA—a computer program for the study of natural language communication between man and machine.Commun.ACM,1966,9(1):36-45.
    [37]Colby K,Hilf F,Weber S,等.Turing-like Indistinguishability Tests for the Calidation of a Computer Simulation of Paranoid Processes.Artificial Intelligence,1972,3(1):199-221.
    [38]Shi Z.Autonomic Semantic Grid.In Proceedings of IFIP AIAI2005,2005.
    [39]曹存根.面向专家的知识获取.北京:科学出版社,1998.
    [40]陆汝衿.人工智能.北京:科学出版社,2000.
    [41]宋柔.北京语言文化大学计算机系语言信息处理研究所当前研究工作介绍.第二届中日自然语言处理专家研讨会论文集,2002.65-72.
    [42]Lin J,Katz B.Question answering from the web using knowledge annotation and knowledge mining techniques.In Proceedings of the twelfth international conference on Information and knowledge management,New York,NY,USA:ACM,2003.116-123.
    [43]Zheng Z.AnswerBus question answering system.In Proceedings of the second international conference on Human Language Technology Research,San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,2002.399-404.
    [44]郑灾福,刘挺.自动问答综述.中文信息学报,2002,16(6):46-52.
    [45]Azari D,Horvitz E,Dumais S,et al.Actions,answers,and uncertainty:a decision-making perspective on Web-based question answering.Inf.Process.Manage.,2004,40(5):849-868.
    [46]Zhou Y,Yuan X,Cao J,et al.FDUQA on TREC2006 QA Track.In Proceedings of the 15th Text REtrieval Conference.NIST,2006.
    [47]Cui H,Li K,Sun R,et al.National University of Singapore at the TREC-13 Question Answering Main Task.In Proceedings of the 13th Text REtrieval Conference.NIST,2004.
    [48]Jing He Y L.Tianwang at TREC-2006 QA Track.In Proceedings of the 15th Text REtrieval Conference.NIST,2006.
    [49]Qiu x,Li B,Shen C,et al.FDUQA on TREC2007 QA Track.In Proceedings of the 16th Text REtrieval Conference.NIST,2007.
    [50]Zhao Y,Xu Z,Guan Y,et al.Insun05QA on QA track of TREC2005.In Proceedings of the 14th Text REtrieval Conference.NIST,2005.
    [51]Cui H,Sun R,Li K,et al.Question answering passage retrieval using dependency relations.In Proceedings of SIGIR '05:the 28th annual international ACM SIGIR conference on Research and development in information retrieval,New York,NY,USA:ACM,2005.400-407.
    [52]董振东,董强.知网.http://www.keenage.com,1999..
    [53]Pedersen T,Patwardhan S,Michelizzi J.WordNet::Similarity - Measuring the Relatedness of Concepts.In:Susan Dumais D M,Roukos S,(eds.).Proceedings of HLT-NAACL 2004:Demonstration Papers,Boston,Massachusetts,USA:Association for Computational Linguistics,2004.38-41.
    [54]Banerjee S,Pedersen T.An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet.In Proceedings of CICLING'02:In Proceeding of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing,Mexico City,2002.
    [55]Z.W,M P.Verb Semantics and Lexical Selection.In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics,Las Cruces,New Mexico.
    [56]Resnik L.Disambiguating noun groupings with respect to WordNet senses.In Proceedings of the Third Workshop on Very Large Corpora.Association for Computational Linguistics,1995.54-68.
    [57]J.J,D C.Semantic similarity based on corpus statistics and lexical taxonomy.In Proceedings of International Conference on Research in Computational Linguistics,Taiwan,1997.
    [58]Azari D,Horvitz E,Dumais S,et al.Actions,answers,and uncertainty:a decision-making perspective on Web-based question answering.Inf.Process.Manage.,2004,40(5):849-868.
    [59]Li S,Zhang J,Huang X,et al.Semantic computation in a Chinese question-answering system.J.Comput.Sci.Technol.,2002,17(6):933-939.
    [60]鲁松,白硕.基于向量空间模型中义项词语的无导词义消歧.软件学报,2002,13(6):1082-1089.
    [61]鲁松,白硕.基于向量空间模型的有导词义消歧.计算机研究与发展,2001,38(6):662-667.
    [62]李涓子,黄昌宁.语言模型中一种改进的最大熵方法及其应用.软件学报,1999,10(3):257-263.
    [63]Lin D.An Information-Theoretic Definition of Similarity.In Proceedings of ICML '98:the Fifteenth International Conference on Machine Learning,San Francisco,CA,USA:Morgan Kaufmann Publishers Inc.,1998.296-304.
    [64]Selvi P,Gopalan N R Sentence Similarity Computation Based on Wordnet and Corpus Statistics.In Proceedings of ICCIMA '07:the International Conference on Computational Intelligence and Multimedia Applications(ICCIMA 2007),Washington,DC,USA:IEEE Computer Society,2007.9-14.
    [65]闫宏飞,陈翀.词汇与中心词的距离信息对问句相似度匹配的影响.清华大学学报(自然科学版),2005,45(S1):1873-1877.
    [66]Levenshtein V I.Binary codes capable of correcting deletions,insertions,and reversals.Soviet Physics Doklady,1966,10:707-710.
    [67]车万翔,刘挺,秦兵,等.基于改进编辑距离的中文相似句子检索.高技术通讯,2004,14(7):15-19.
    [68]Song F,Croft W B.A general language model for information retrieval.In Proceedings of CIKM '99:the eighth international conference on Information and knowledge management,New York,NY,USA:ACM,1999.316-321.
    [69]Gao J,Nie J Y,Wu G,et al.Dependence language model for information retrieval.In Proceedings of SIGIR '04:the 27th annual international ACM SIGIR conference on Research and development in information retrieval,New York,NY,USA:ACM,2004.170-177.
    [70]Attardi G,Cisternino A,Formica F,et al.PiQASso:Pisa Question Answering System.In Proceedings of the 10th Text REtrieval Conference.NIST,2001.
    [71]A Penas F V.Overview of the Answer Validation Exercise 2007.In Proceedings of Working Notes for the CLEF 2007 Workshop,2007.
    [72]nas A P,Rodrigo,Sama V,et al.Overview of the Answer Validation Exercise 2006.In:al.C P,(eds.).Proceedings of CLEF2006.Springer,2007.257-264.
    [73]Sutcliffe R F E,White K,Gabbay I,et al.Question Answering using the DLT System at TREC 2006.In Proceedings of the 15th Text REtrieval Conference.NIST,2006.
    [74]Clarke C L A,Cormack G V,Lynam T R.Exploiting redtmdancy in question answering.In Proceedings of SIGIR '01:the 24th annual international ACM SIGIR conference on Research and development in information retrieval,New York,NY,USA:ACM,2001.358-365.
    [75]Echihabi A,Marcu D.A noisy-channel approach to question answering.In Proceedings of ACL '03:the 41st Annual Meeting on Association for Computational Linguistics,Morristown,NJ,USA:Association for Computational Linguistics,2003.16-23.
    [76]Fagin R,Stockmeyer L.Relaxing the Triangle Inequality in Pattern Matching.Int.J.Comput.Vision,1998,30(3):219-231.
    [77]Veltkamp R C.Shape Matching:Similarity Measures and Algorithms.In Proceedings of SMI '01:the International Conference on Shape Modeling & Applications,Washington,DC,USA:IEEE Computer Society,2001.188.
    [78]Leacock C,Miller G A,Chodorow M.Using corpus statistics and WordNet relations for sense identification.Comput.Linguist.,1998,24(1):147-165.
    [79]Budanitsky A,Hirst G.Evaluating WordNet-based Measures of Lexical Semantic Relatedness.Comput.Linguist.,2006,32(1):13-47.
    [80]王斌.汉英双语语料厍自动对齐研究[博士学位论文].北京:中国科学院计算技术研究所,1999.
    [81]Lin D.Automatic retrieval and clustering of similar words.In Proceedings of the 17th international conference on Computational linguistics,Morristown,NJ,USA:Association for Computational Linguistics,1998.768-774.
    [82]Hearst M A.Automatic acquisition of hyponyms from large text corpora.In Proceedings of the 14th conference on Computational linguistics,Morristown,NJ,USA:Association for Computational Linguistics,1992.539-545.
    [83]Miller G,Charles W.Contextual correlates of semantic similarity.Language and Cognitive Processes,1991,6(l):l-28.
    [84]Jarmasz M,Szpakowicz S.Roget's Thesaurus and Semantic Similarity.In Proceedings of the International Conference on Recent Advances in Natural Language Processing(RANLP-03),2003.212-219.
    [85]Li Y,Bandar Z,Mclean D.An approach for measuring semantic similarity between words using multiple information sources.Knowledge and Data Engineering,IEEE Transactions on,July-Aug.2003,15(4):871-882.
    [86]吕学强,任飞亮,黄忠丹,等．句子相似模型和最相似句子查找算法．东北大学学报(自然科学版),2003,24(6)：531-534．
    [87]Tellex S,Katz B,Lin J,et al.Quantitative evaluation of passage retrieval algorithms for question answering.In Proceedings of SIGIR'03:the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval,New York,NY,USA:ACM,2003.41-47.
    [88]Light M,Mann G S,Riloff E,et al.Analyses for elucidating current question answering technology.Nat.Lang.Eng.,2001,7(4):325-342.
    [89]Lee G G,Seo J,Lee S,et al.SiteQ:Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP.In Proceedings of Text REtrieval Conference,2001.442-451.
    [90]Berger A,Lafferty J.Information retrieval as statistical translation.In Proceedings of SIGIR'99:the 22nd annual international ACM SIGIR conference on Research and development in information retrieval,New York,NY,USA:ACM,1999.222-229.
    [91]Al-Onaizan Y,Curin J,Jahr M,et al.Statistical machine translation.Technical report,Final Report,JHU Summer Workshop,1999.
    [92]Moon T.The expectation-maximization algorithm.Signal Processing Magazine,IEEE,Nov 1996,13(6):47-60.
    [93]Brill E,Dumais S,Banko M.An analysis of the AskMSR question-answering system.In Proceedings of EMNLP '02:the ACL-02 conference on Empirical methods in natural language processing,Morristown,NJ,USA:Association for Computational Linguistics,2002.257-264.
    [94]Lin J,Katz B.Building a reusable test collection for question answering.J.Am.Soc.Inf.Sci.Technol,2006,57(7):851-861.
    [95]Tsuruoka Y,Tsujii J.Bidirectional inference with the easiest-first strategy for tagging sequence data.In Proceedings of HLT '05:the conference on Human Language Technology and Empirical Methods in Natural Language Processing,Morristown,NJ,USA:Association for Computational Linguistics,2005.467-474.
    [96]Ramshaw L,Marcus M.Text Chunking Using Transformation-Based Learning.In:Yarovsky D,Church K,(eds.).Proceedings of the Third Workshop on Very Large Corpora,Somerset,New Jersey:Association for Computational Linguistics,1995.82-94.
    [97]Klein D,Smarr J,Nguyen H,et al.Named entity recognition with character-level models.In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003,Morristown,NJ,USA:Association for Computational Linguistics,2003.180-183.
    [98]Lin D.Principle-based parsing without overgeneration.In Proceedings of ACL-93,Columbus,Ohio.,1995.112-120.
    [99]Li X,Roth D.Learning question classifiers.In Proceedings of the 19th international conference on Computational linguistics,Morristown,NJ,USA:Association for Computational Linguistics,2002.1-7.
    [100]C.C.Chang,Lin C.LIBSVM:a library for support vector machines.http://www.csie.ntu.edu.tw/cjlin/libsvm,2001..
    [101]Metzler D,Croft W B.Analysis of Statistical Question Classification for Fact-Based Questions.Inf.Retr.,2005,8(3):481-504.
    [102]Nguyen M L,Nguyen T T,Shimazu A.Subtree Mining for Question Classification Problem.In Proceedings of the 20th International Conference on Artificial Intelligence,2007.1695-1700.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700