基于信息几何的高阶纯相关模型及其应用

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于信息几何的高阶纯相关模型及其应用

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Information Geometry Based Pure High-order Interaction Model and Its Applications
作者：何亮
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：信息几何 ; 高阶纯相关 ; 文本分类 ; 语言模型
英文关键词：Information Geometry ; Pure High-order Interaction ; Text Classification ; Language Model
学位年度：2010
导师：侯越先
学科代码：081202
学位授予单位：天津大学
论文提交日期：2010-11-01

摘要

在统计模型研究中经常需要分析多个随机变量之间的关联。本文以信息几何的理论为基础定义了多个随机变量的高阶“纯”相关性,提出了一种新颖的高阶纯相关变量的提取方法,并在具体的文本处理任务中进行了验证。
     信息几何是指采用微分几何方法研究概率统计模型,它把一族概率分布看作高维空间里的一个黎曼流形,通过微分几何方法分析流形的几何结构,以期得到关于概率分布的深入结论。本文基于信息几何中坐标正交性的理论,得出一种可以恰当表示多个变量间的纯关联性的“混合坐标系”,以此为依据提出了检验高阶纯相关变量的方法,并对算法里参数的设置提供了理论上的依据。另外,从理论上分析出了几种不同的高阶关联之间的蕴含关系,为进一步的研究提供了理论基础。
     在具体的文本处理任务中,每个单词在一篇文本的出现与否可以看作一个布尔随机变量。朴素的模型一般假设各个变量之间是完全独立的或者仅有某种简单的低阶关联,但这种假设在很多情况下并不合理。本文基于信息几何理论,提出了一种有效的挖掘高阶纯相关词组的启发式算法,并利用滑动窗口、迭代增量等手段,有效地提高了算法的时间效率。本文通过在Reuters-21578和20 newsgroups数据集上进行文本分类的结果,以及用高阶纯相关改进N-gram模型的具体实验,证明了高阶纯相关算法的有效性和实用性。
On learning the statistical model, we usually need to analyze the interaction between random variables. Based on the theory of information geometry, we define the pure interactions between multiple variables, and present a novel method for extracting the pure high-order interactions. And we also test our method on real-life text processing tasks.
     Information geometry is the study of probability by way of differential geometry. It considers a space of probability distribution as a manifold in a Riemannian space, and studies the properties of probability distribution by analyzing the structure of geometry object. In this paper, we present a mixture coordinate system which can represent the degree of pure interaction properly. And we provide a method of checking the pure high-order interactions, and then a theoretical basis on the setting of parameters of the model. What is more, we give the relation of implication between several kinds of high order interactions theoretically.
     In the text processing tasks, the presence of a word can be considered as a Boolean random variable. The naive model usually assumes that all the random variables are independent, which is unsuitable in many cases. Based on the theory of information geometry, we present an efficient algorithm employing sliding windows and bottom-up incremental approach for mining the high-order word patterns. We test our method on an extended vector space model for text classification on Reuters-21578 and 20 newsgroups datasets, and also test our method on improving the classical N-gram language model. The experimental results show that our algorithm is effective and practical.

引文

[1]G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11): 1975. 613~620.
    [2]G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
    [3]H. Schuze. Automatic word sense discrimination. Computational Linguistics, 1998, 24(1): 97~123.
    [4]R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures for text semantic similarity. In AAAI'06: Proceedings of the 21st national conference on Artificial intelligence. AAAI Press, 2006, 775~780.
    [5]G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, K. Miller. 1990. WordNet: An online lexical database. Int. J. Lexicograph. 3, 4, 235~244.
    [6]T. Niesler, P. Woodland, T. R. Niesler, P.C.Woodl. A variable-length category-based n-gram language model. In Proceedings, IEEE ICASSP, 1996, 164~167.
    [7]S. Zhang. An effective combination of different order n-grams. In O-COCOSDA, 2003.
    [8]J. Gao, J. Nie, G. Wu, G. Cao. Dependence language model for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 2004, 170~177.
    [9]J. Furnkranz. A study using n-gram features for text categorization. 1998.
    [10]C. R. Rao. Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bull. Calcutta. Math. Soc., 37, 1945, 81~91.
    [11]S. Amari, H. Nagaoka: Method of Information Geometry. AMS Series, Oxford University Press. 2000.
    [12]T. Hofmann. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. 2000.
    [13]C. Cortes, V. Vapnik, Support-Vector Networks, Machine Learning, 20, 1995.
    [14]H. Nakahara, S. Amari. Information geometric measure for neural spikes. Neural Computation, 2002, 14: 2269~2316.
    [15]Chernoff H., Lehmann E.L. The use of maximum likelihood estimates in chi-square tests for goodness-of-fit". 1954. The Annals of Mathematical Statistics 25: 579~586.
    [16]A. M. Turing. Computing machinery and intelligence. 1995. 23~46.
    [17]Reuters-21578, Distribution 1.0 test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578
    [18]Ken Lang. Newsweeder: Learning to filter netnews. Proceedings of the Twelfth International Conference on Machine Learning. 1995. 331~339. http://people.csail.mit.edu/jrennie/20Newsgroups/
    [19]Shakhnarovish, Darrell, Indyk. Nearest-Neighbor Methods in Learning and Vision. MIT Press. 2005.
    [20]T. Hofmann. Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval, 1999.
    [21]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John. Latent Dirichlet allocation. Journal of Machine Learning Research 2003(3): 993~1022.
    [22]Daniel Jurafsky, James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall. 2009.
    [23]R. Kuhn, R. de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 14. Issue 6, 1992.
    [24]European Parliament Proceedings Parallel Corpus 1996-2009. http://www.statmt.org/europarl
    [25]STAR Laboratory: SRI Language Modeling Toolkit. http://www-speech.sri.com/projects/srilm/
    [26]Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, 1994, 487~499.
    [27]Clifford. P. Markov random fields in statistics, in Grimmett, G.R.; Welsh, D.J.A., Disorder in Physical Systems: A Volume in Honour of John M. Hammersley, Oxford University Press, 1990. 19~32.
    [28]Koller, Friedman. Probabilistic Graphical Models. Massachusetts: MIT Press. 2009.
    [29]Wilcoxon, F. Individual comparisons by ranking methods. Biometrics, 1, 1945. 80~83.
    [30]William A. Gale. Good Turing smoothing without Tears. Journal of Quantitative Linguistics. 1995 (2).
    [31]C. Chelba, D. Engle, F. Jalinek, etc. Structure And Performance Of A Dependency Language Model. In Proceedings of Eurospeech. 1997. 2775~2778

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700