用户名: 密码: 验证码:
基于信息几何的高阶纯相关模型及其应用
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
在统计模型研究中经常需要分析多个随机变量之间的关联。本文以信息几何的理论为基础定义了多个随机变量的高阶“纯”相关性,提出了一种新颖的高阶纯相关变量的提取方法,并在具体的文本处理任务中进行了验证。
     信息几何是指采用微分几何方法研究概率统计模型,它把一族概率分布看作高维空间里的一个黎曼流形,通过微分几何方法分析流形的几何结构,以期得到关于概率分布的深入结论。本文基于信息几何中坐标正交性的理论,得出一种可以恰当表示多个变量间的纯关联性的“混合坐标系”,以此为依据提出了检验高阶纯相关变量的方法,并对算法里参数的设置提供了理论上的依据。另外,从理论上分析出了几种不同的高阶关联之间的蕴含关系,为进一步的研究提供了理论基础。
     在具体的文本处理任务中,每个单词在一篇文本的出现与否可以看作一个布尔随机变量。朴素的模型一般假设各个变量之间是完全独立的或者仅有某种简单的低阶关联,但这种假设在很多情况下并不合理。本文基于信息几何理论,提出了一种有效的挖掘高阶纯相关词组的启发式算法,并利用滑动窗口、迭代增量等手段,有效地提高了算法的时间效率。本文通过在Reuters-21578和20 newsgroups数据集上进行文本分类的结果,以及用高阶纯相关改进N-gram模型的具体实验,证明了高阶纯相关算法的有效性和实用性。
On learning the statistical model, we usually need to analyze the interaction between random variables. Based on the theory of information geometry, we define the pure interactions between multiple variables, and present a novel method for extracting the pure high-order interactions. And we also test our method on real-life text processing tasks.
     Information geometry is the study of probability by way of differential geometry. It considers a space of probability distribution as a manifold in a Riemannian space, and studies the properties of probability distribution by analyzing the structure of geometry object. In this paper, we present a mixture coordinate system which can represent the degree of pure interaction properly. And we provide a method of checking the pure high-order interactions, and then a theoretical basis on the setting of parameters of the model. What is more, we give the relation of implication between several kinds of high order interactions theoretically.
     In the text processing tasks, the presence of a word can be considered as a Boolean random variable. The naive model usually assumes that all the random variables are independent, which is unsuitable in many cases. Based on the theory of information geometry, we present an efficient algorithm employing sliding windows and bottom-up incremental approach for mining the high-order word patterns. We test our method on an extended vector space model for text classification on Reuters-21578 and 20 newsgroups datasets, and also test our method on improving the classical N-gram language model. The experimental results show that our algorithm is effective and practical.
引文
[1]G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing. Commun. ACM, 18(11): 1975. 613~620.
    [2]G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.
    [3]H. Schuze. Automatic word sense discrimination. Computational Linguistics, 1998, 24(1): 97~123.
    [4]R. Mihalcea, C. Corley, C. Strapparava. Corpus-based and knowledge-based measures for text semantic similarity. In AAAI'06: Proceedings of the 21st national conference on Artificial intelligence. AAAI Press, 2006, 775~780.
    [5]G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, K. Miller. 1990. WordNet: An online lexical database. Int. J. Lexicograph. 3, 4, 235~244.
    [6]T. Niesler, P. Woodland, T. R. Niesler, P.C.Woodl. A variable-length category-based n-gram language model. In Proceedings, IEEE ICASSP, 1996, 164~167.
    [7]S. Zhang. An effective combination of different order n-grams. In O-COCOSDA, 2003.
    [8]J. Gao, J. Nie, G. Wu, G. Cao. Dependence language model for information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 2004, 170~177.
    [9]J. Furnkranz. A study using n-gram features for text categorization. 1998.
    [10]C. R. Rao. Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bull. Calcutta. Math. Soc., 37, 1945, 81~91.
    [11]S. Amari, H. Nagaoka: Method of Information Geometry. AMS Series, Oxford University Press. 2000.
    [12]T. Hofmann. Learning the similarity of documents: An information-geometric approach to document retrieval and categorization. 2000.
    [13]C. Cortes, V. Vapnik, Support-Vector Networks, Machine Learning, 20, 1995.
    [14]H. Nakahara, S. Amari. Information geometric measure for neural spikes. Neural Computation, 2002, 14: 2269~2316.
    [15]Chernoff H., Lehmann E.L. The use of maximum likelihood estimates in chi-square tests for goodness-of-fit". 1954. The Annals of Mathematical Statistics 25: 579~586.
    [16]A. M. Turing. Computing machinery and intelligence. 1995. 23~46.
    [17]Reuters-21578, Distribution 1.0 test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578
    [18]Ken Lang. Newsweeder: Learning to filter netnews. Proceedings of the Twelfth International Conference on Machine Learning. 1995. 331~339. http://people.csail.mit.edu/jrennie/20Newsgroups/
    [19]Shakhnarovish, Darrell, Indyk. Nearest-Neighbor Methods in Learning and Vision. MIT Press. 2005.
    [20]T. Hofmann. Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval, 1999.
    [21]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John. Latent Dirichlet allocation. Journal of Machine Learning Research 2003(3): 993~1022.
    [22]Daniel Jurafsky, James H. Martin. Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice Hall. 2009.
    [23]R. Kuhn, R. de Mori. A cache-based natural language model for speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 14. Issue 6, 1992.
    [24]European Parliament Proceedings Parallel Corpus 1996-2009. http://www.statmt.org/europarl
    [25]STAR Laboratory: SRI Language Modeling Toolkit. http://www-speech.sri.com/projects/srilm/
    [26]Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, 1994, 487~499.
    [27]Clifford. P. Markov random fields in statistics, in Grimmett, G.R.; Welsh, D.J.A., Disorder in Physical Systems: A Volume in Honour of John M. Hammersley, Oxford University Press, 1990. 19~32.
    [28]Koller, Friedman. Probabilistic Graphical Models. Massachusetts: MIT Press. 2009.
    [29]Wilcoxon, F. Individual comparisons by ranking methods. Biometrics, 1, 1945. 80~83.
    [30]William A. Gale. Good Turing smoothing without Tears. Journal of Quantitative Linguistics. 1995 (2).
    [31]C. Chelba, D. Engle, F. Jalinek, etc. Structure And Performance Of A Dependency Language Model. In Proceedings of Eurospeech. 1997. 2775~2778

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700