用户名: 密码: 验证码:
一种蒙特卡罗贝叶斯分类的改进方法
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着信息技术的发展和数据库技术的广泛应用,人们积累的信息越来越多,如何从海量的信息中提取我们感兴趣的知识,是当前社会面临的一个严峻的问题。知识发现技术随时代的发展应运而生,成为目前较热门的研究课题之一。知识发现(KDD)能够从数据库中识别出有效的、新颖的、潜在有用的、以及最终可理解的信息。数据挖掘是知识发现的一个核心环节,涉及到数据库、人工智能、数理统计、可视化、并行计算等领域。
     分类是数据挖掘的一个重要内容,它通过构造一个分类函数或分类模型(也常称作分类器),把数据库中的数据项映射到给定类别中的某一个,从而能够使用该模型来预测类标号未知的对象类。在众多的分类方法中,贝叶斯分类以其简单的结构和良好的性能而备受关注。与其它分类方法不同,贝叶斯分类建立在坚实的数理统计知识基础之上,基于求解后验概率的贝叶斯定理,理论上讲它在满足其限定条件下是最优的。
     蒙特卡罗是一种采用统计抽样理论近似求解数学或物理问题的方法,它在用于解决贝叶斯分类时,首先根据已知的先验概率获得各个类标号未知类的条件概率分布,然后利用某种抽样器,分别得到满足这些条件分布的随机数据,最后统计这些随机数据,就可以得到各个类标号未知类的后验概率分布。运行一个特定的马尔可夫链可以容易地获得满足某个特定分布的随机抽样,所以马尔可夫链蒙特卡罗(MCMC)是最常用的蒙特卡罗贝叶斯分类方法。
     MCMC可以减少数据挖掘中的时间和空间开销,但对于巨型数据集,MCMC在计算方面也不切实际。本文通过改进MCMC算法,使它能够用于巨型数据集的挖掘。该算法对数据集进行划分,改变MCMC对数据集的扫描策略,将其分开为内、外两个循环过程,外循环中扫描数据集,内循环扫描分布函数的抽样值。另外,本文还评估了抽样效率和有效抽样容量等问题,使用了极小量过滤方法,进一步增强了对巨型数据集的数据挖掘的实际操作能力。
With the development of information technology and databases' wide use, more and more information is accumulated, and how to find out interesting knowledge from it is a serious problem of our society. Technolegy of knowledge discovery emerge as times require, and become one of the hot research projects. KDD (Knowledge discovery in databases) can find out the effective, novel, latent, and apprehensible information. Data mining is the key step of KDD, which concerns on database, artificial intelligence, and statistics, etc.
    Classification is the important content of data mining, which assigns dataitems in databases to a special class by constructing a classification function or model (also be called classifier). Therefore, we can predict the unlabelled object classes with the classification model. Unlike other classifications, Bayesian classification bases on mathematics and statistics, and its foundation is Bayesian theory, which answers the posterior probability. Theoretically speaking, it would be the best solution when its limitation is satisfied.
    Monte Carlo is a method that approximately solves mathematic or physical problems by statistical sampling theory. When comes to Bayesian classification, it firstly gets the conditional probability distribution of the unlabelled classes based on the known prior probability. Then, it uses some kind of sampler to get the stochastic data that satisfy the distribution as noted just before one by one. At last, it can obtain the posterior probability distibution of each unlabelled classes by analysing these stochastic data. It is easy to get a stochastic sample that satisfies some special distribution through running a special Markov chain, so MCMC (Markov Chain Monte Carlo) is the most common Monte Carlo Bayesian method.
    MCMC method can reduce the costs of time and space in data mining, but it is impracticable in massive datasets' computation. This thesis improves the MCMC method so that it can be adapted to massive datasets' data mining. Our proposed approach is to split the dataset sample into two parts and change the strategy of
    
    
    
    scanning datasets into two loop, the inner loop and the outer loop. The scan of the dataset will become the outer loop and the scan of the draws from the posterior distribution. Furthermore, this thesis not only evaluates the sampling efficiency and the effective sample size, but also enhances the practical operation capability of massive datasets' dataming through particle filtering.
引文
[1] 秦鑫,朱绍文,陈绪君等。.NET框架数据访问技术。计算机系统应用,2002年第12期,25-27。
    [2] 朱绍文,胡宏银,王泉德等。决策树采掘技术及发展趋势。计算机工程,2000年第10期,pp.1-3,35。
    [3] 朱绍文,王泉德,黄浩等。关联规则挖掘技术及发展动向。计算机工程,2000年第9期,pp.4-6。
    [4] 朱绍文,熊伟,张大斌等。不完整数据库中关联规则的评估。计算机工程,2001年第11期,pp.39-41。
    [5] A. Feelders, H. Daniels, M. Holsheimer. Methodological and practical aspects of data mining. Information & Management, No.37, pp.271-281, 2000.
    [6] Aaron D. Lanterman, Ulf Grenander, and Michael I. Miller. Bayesian Segmentation via Asymptotic Partition Functions. IEEE Transaction on pattern analysis and machine intelligence, May 2000.
    [7] Aki Vehtari, and Jouko Lampinen. Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities. MIT Press: Neural Computation, Vol. 14, issue 10, 2002.
    [8] Aki Vehtari, and Jouko Lampinen. Bayesian Neural Networks with Correlating Residuals. In Proc. IJCNN'99, Washington, DC, USA, July 1999.
    [9] Aki Vehtari, Simo Srkk, and Jouko Lampinen. On MCMC Sampling in Bayesian MLP Neural Networks. Proceedings of the 2000 International Joint Conference on Neural Networks, volume Ⅰ, pp. 317-322, IEEE 2000.
    [10] Andrew D. Keller, Michel Schummer, Lee Hood, Walter L. Ruzzo. Bayesian Classification of DNA Array Expression Data. Technical Report UW-CSE-2000-08-01, University of Washington, USA, August 2000.
    [11] Andrew Kusiak. Data Analysis: Models and Algorithms. Proceedings of the SPIE Conference on Intelligent Systems and Advanced Manufacturing, P.E. Orban and G.K. Knopf (Eds), SPIE, Vol. 4191, Boston, MA, pp. 1-9, 2000.
    
    
    [12] Arnold Goodman. A Data Mining Overview. SIGKDD Explorations 2000, Volume 1, Issue 2, pp.97-104, January 2000.
    [13] B. Baesens, M. Egmont-Petersen, R. Castelo, and J. Vanthienen. Learning Bayesian network classifiers for credit scoring using Markov Chain Monte Carlo search. Pattern Recognition, vol.3, pp. 49-52, 2002.
    [14] Baback Moghaddam, Chahab Nastar, Alex Pentland. Bayesian Face Recognition with Deformable Image Models. IEEE International Conference on Image Analysis & Processing, pp. 26-35, September 2001.
    [15] Barbara Groβmann-Hutter, Anthony Jameson, Frank Wittig. Learning Bayesian Networks with Hidden Variables for User Modeling. The Sixteenth International Joint Conference on Artificial Intelligence, July 1999.
    [16] Bart Bakker, Tom Heskes. Clustering ensembles of neural network models. Neural Networks, No.16, pp.261-269, 2003.
    [17] C.K.Chow, and C.N.Liu. Approximating Discrete Probability Distributions with Dependence Trees. IEEE Transactions on Information Theory, Volumn IT-14, No.3, May 1968.
    [18] Carl Edward Rasmussen and Zoubin Ghahramani. Bayesian Monte Carlo. Gatsby Computational Neuroscience Unit, University College London, 2003.
    [19] Charles Mathis, and Thomas Breuel. Classification using a hierarchical Bayesian approach. In Proceedings of the International Conference on Pattern Recognition (ICPR'02), Quebec City, Quebec, Canada, 2002.
    [20] Charles X. Ling, and Huajie Zhang. Toward Bayesian classifiers with accurate probabilities. PAKDD 2002, LNAI 2336, pp.123-134, 2002.
    [21] Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth. Statistical Themes and Lessons for Data Mining. Data Mining and Knowledge Discovery 1, pp.11-28, 1997.
    [22] Dale Schuurmans, and Finnegan Southey. Monte Carlo inference via greedy importance sampling. Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pp. 523-532, 2000.
    
    
    [23] David Heckerman. Bayesian Networks for Data Mining. Data Mining and Knowledge Discovery 1, pp.79-119, 1997.
    [24] David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20: 197-243, 1995.
    [25] David M. Rocke, and David L. Woodru. Some Statistical Tools for Data Mining Applications. Center for Image Processing and Integrated Computing, University of California (USA), February 1998.
    [26] David M. Rocke, and Jian Dai. Sampling and Subsampling for Cluster Analysis in Data Mining: With Applications to Sky Survey Data. Data Mining and Knowledge Discovery, 7, 215-232, 2003.
    [27] David Madigan, Adrian E. Raftery, Chris T. Volinsky, Jennifer A. Hoeting. Bayesian Model Averaging. In Proceedings of the AAAI-96 Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, Portland, OR, 1996, AAAI Press.
    [28] Dimitris Meretakis, and Beat Wuthrich. Extending nave Bayes classifiers using long itemsets. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 165-174, 1999.
    [29] Dingding Li, and Thanasis Stengos. The Partially Linear Regression Model: Monte Carlo Evidence from the Projection Pursuit Regression Approach. Discussion Paper 2001-11, Department of Economics, University of Guelph, USA, April 2001.
    [30] Dreg Ridgeway, David Madigan. A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets. Data Mining and Knowledge Discovery, 2003.7,301-309.
    [31] Edison H. Tito, Gerson Zaverucha, Marley Vellasco, and Marco Pacheco. Applying Bayesian Neural Networks to Electric Load Forecasting. In Proc. Sixth IEEE International Conference on Neural Information Processing (ICONIP'99), 16-20 November, Perth, Australia, Volume 1, pp. 407-411.
    
    
    [32] Fengzhan Tian, Yuchang Lu, and Chunyi Shi. Learning Bayesian Networks with Hidden Variables Using the Combination of EM and Evolutionary Algorithms. PAKDD 2001, pp. 568-574.
    [33] Fosca Giannotti, and Giuseppe Manco. Making Knowledge Extraction and Reasoning Closer. PAKDD 2000, LNAI 1805, pp.360-371, 2000.
    [34] Frank Dellaert, Steven M. Seitz, Charles E. Thorpe, and Sebastian Thrun. EM, MCMC, and Chain Flipping for Structure from Motion with Unknown Correspondence. Machine Learning, Journal, 2003.
    [35] Gangshu Cai, Peter R. Wurman. Monte Carlo Approximation in Incomplete—Information, Sequential-Auction Games. Decision Support Systems, 2004.
    [36] Gregory F. Cooper. A Bayesian Method for Causal Modeling and Discovery Under Selection. Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pp. 98-106, 2000.
    [37] Hedvig Sidenbladh, Michael J. Black. Learning Image Statistics for Bayesian Tracking. In Proc. 8th IEEE International Conference on Computer Vision, volumn 2, pp. 709-716, Vancouver, Canada 2001.
    [38] Hei Chan, Adnan Darwiche. Reasoning about Bayesian Network Classifiers. Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence, pp. 107-115, 2003.
    [39] Heikki Mannila. Theoretical Frameworks for Data Mining. SIGKDD Explorations, Volume 1, Issue 2, pp.30-32, January 2000.
    [40] Huajie Zhang, and Charles X. Ling. An Improved Learning Algorithm for Augmented Nave Bayes. PAKDD 2001, LNAI 2035, pp. 581-586, 2001.
    [41] Jaime S. Ide, and Fabio G. Cozman. Testing MCMC algorithms with randomly generated Bayesian networks. Workshop de Teses e Dissertaces em IA (WTDIA2002), Recife, Pernambuco, Brazil, 2002
    [42] James W. Myers, Kathryn Blackmond Laskey, and Tod S. Levitt. Learning Bayesian Networks from Incomplete Data with Stochastic Search Algorithms. Proceedings of the 15th Annual Conference on Uncertainty in Artificial
    
    Intelligence (UAI-99), pp. 476-485, 1999.
    [43] Jie Cheng, Russell Greiner. Comparing Bayesian Network Classifiers. In 15th Conference on Uncertainty in Artificial Intelligence, pp. 101--107, 1999.
    [44] Jie Cheng, Russell Greiner. Learning Bayesian Belief Network Classifiers: Algorithms and System. Artificial Intelligence, Vol.137, No.1-2, pp.43-90, 2002.
    [45] Jie Cheng, David A. Bell, Weiru Liu. An Algorithm for Bayesian Belief Network Construction from Data. Proceedings of the 6th International Workshop on Artificial Intelligence and Statistics, 1997.
    [46] Jo Eidsvik. Markov chain Monte Carlo algorithms and their applications to petroleum reservoir characterization. Dr. Ing. Thesis, Department of Mathematical Sciences, Norwegian University of Science and Technology, 2003.
    [47] Justus H. Piater, Roderic A. Grupen. Feature Learning for Recognition with Bayesian Networks. Proc. Fifteenth International Conference on Pattern Recognition (ICPR 2000), September 2000, Barcelona, Spain, 2000.
    [48] Kyu-Baek Hwang, Dong-Yeon Cho, Sang-Wook Park, etc. Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis. Methods of Microarray Data Analysis, pp. 167-182, 2002.
    [49] Lei Xu. BYY Harmony Learning, Independent State Space, and Generalized APT Financial Analyses. IEEE Transactions on Neural Networks, Vol. 12, No. 4, pp.822-849, July 2001.
    [50] Lili Diao, Keyun Hu, Yuchang Lu, etc. A Method to Boost Nave Bayesian Classifiers. PAKDD 2002, LNAI 2336, pp.115-122, 2002.
    [51] Lucas Paletta, Manfred Prantl. Learning Temporal Context in Active Object Recognition Using Bayesian Analysis. In 15th International Conference on Pattern Recognition (ICPR2000), Vol. 3, pp. 695-699, 2000.
    [52] Luis M. de Campos, Juan M. Fernandez-Luna, and Juan F. Huete. A Layered Bayesian Network Model for Document Retrieval. ECIR 2002, LNCS 2291,
    
    pp.169-182, 2002.
    [53] Man Leung Wong, Wai Lam, and Kwong Sak Leung. Using Evolutionary Programming and Minimum Description Length Principle for Data Mining of Bayesian Networks. IEEE Transactions on pattern analysis and machine intelligence, Vol.21, No.2, pp.174-178, 1999.
    [54] Martin Neil, and Norman Fenton. Predicting Software Quality using Bayesian Belief Networks. Proceedings of 21st Annual Software Engineering Workshop NASA/Goddard Space Flight Centre, December 1996.
    [55] Michael Werman, and Daniel Keren. A Bayesian Method for Fitting Parametric and Nonparametric Models to Noisy Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. 5: 528-534.
    [56] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian Network Classifiers. Machine Learning 29(2-3), pp. 131-163, 1997.
    [57] Nir Friedman, Michal Linial, Iftach Nachman, Dana Pe'er. Using Bayesian Networks to Analyze Expression Data. A technical report describing this work, Submitted to RECOMB 2000.
    [58] Paolo Giudici, Robert Castelo. Association Models for Web Mining. Data Mining and Knowledge Discovery, 5, pp. 183-196, 2001.
    [59] Paul T. Troughton, and Simon J. Godsill. Bayesian Model Selection for Time Series Using Markov Chain Monte Carlo. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. Ⅴ, pp. 3733-3736, April 1997.
    [60] Petri Myllymaki, Tomi Silander, Henry Tirri, Pekka Uronen. Bayesian Data Mining on the Web with B-Course. Proceedings of The 2001 IEEE International Conference on Data Mining, pp. 626-629, 2001.
    [61] R. Chen, K. Sivakumar, and H. Kargupta. Collective Mining of Bayesian Networks from Distributed Heterogeneous Data. Knowledge and Information Systems, 2002.
    [62] R. Chen, K. Sivakumar, H. Kargupta. Learning Bayesian Network Structure from Distributed Data. Proceedings of the SIAM International Data Mining
    
    Conference. San Franciso, USA, 2003.
    [63] R.Sterritt, A.H.Marshall, C.M.Shapcott, etc. Exploring Dynamic Bayesian Belief Networks for Intelligent Fault Management Systems. Proc. IEEE Int. Conf. Systems, Man And Cybernetics, Ⅴ, pp. 3646-3652, Sept 2000.
    [64] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, Sep 25th, 1993.
    [65] Rakesh Agrawal, Roberto Bayardo, and Ramakrishnan Srikant. Athena: Mining-Based Interactive Management of Text Databases. EDBT 2000, LNCS 1777, pp.365-379, 2000.
    [66] Richard O. Duda, Peter E. Hart and David G. Stork. Pattern Classification and Scene Analysis. New York: John Wiley & Sons, 1973.
    [67] Robert A. van Engelen. Approximating Bayesian Belief Networks by Arc Removal. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, No.8, pp. 916-920, 1997.
    [68] Russel J.Kennett, Kevin B.Korb, and Ann E. Nichoison. Seabreeze Prediction Using Bayesian Networks. PAKDD 2001, LNAI 2035, pp.148-153, 2001.
    [69] S.Russell, J.Binder, D.Koller, K.Kanazawa. Local learning in probabilistic networks with hidden variables. In Proc. 14th Joint Int. Conf. On Artificial Intelligence (IJCAI'95), vol. 2, pp. 1146-1152, Auguest 1995.
    [70] Shashi Shekhar, Pusheng Zhang, Yah Huang, Ranga Raju Vatsavai. Trends in Spatial Data Mining, USA: AAAI/MIT Press, 2003.
    [71] Simon J. Godsill, and Christohe Andrieu. Bayesian Separation and Recovery of Convolutively Mixed Autoregressive Sources. SPTM 17.1, vol. 3, pp. 1733-1736, 1999.
    [72] Souad Souafi-Bensafi, Marc Parizeau, Franck Lebourgeois, Hubert Emptoz. Bayesian Networks Classifiers applied to Documents. Proc. of the International Conference on Pattern Recognition (ICPR), 2002.
    [73] Tanzeem Choudhury, James M. Rehg, Vladimir Pavlovic, etc. Boosting and
    
    Structure Learning in Dynamic Bayesian Networks for Audio-Visual Speaker Detection. Proceedings of the 16th International Conference on Pattern Recognition, Volume 3, pp.789-794, 2002.
    [74] Thomas Ragg. Bayesian learning and evolutionary parameter optimization. AI Communications 15, pp.61-74, 2002.
    [75] Tomas Kocka, Robert Castelo. Improved Learning of Bayesian Networks. UAI'01, pp. 269-276, 2001.
    [76] Volker Tresp. The Generalized Bayesian Committee Machine. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD2000, pp. 130-139, 2000.
    [77] Vladimir Pavlovic, Brendan J. Frey, and Thomas S. Huang. Time-Series Classification Using Mixed-State Dynamic Bayesian Networks. IEEE Conf. Computer Vision and Pattern Recognition. 1999.
    [78] W.T.Freeman, and D.H.Brainard. Bayesian Decision Theory, the Maximum Local Mass Estimate, and Color Constancy. IEEE Intel. Conf. On Computer Vision, Combridge, MA, pp. 210-217, June 1995.
    [79] Wai Lam. Bayesian Network Refinement Via Machine Learning Approach. IEEE Transactions on Patern Analysis and Machine Intelligence, VOL. 20, NO. 3, pp.240-251, 1998.
    [80] Wai Lam, Alberto Maria Segre. A distributed learning algorithm for Bayesian inference networks. IEEE Transactions on Knowledge and Data Mining, Vol. 14, No. 1, pp.93-105, 2002.
    [81] Wray L. Buntine. Operations for learning with graphical models. Journal of Artificial Intelligence Research 2, pp. 159-225, 1994.
    [82] Xiaodong Wang, Rong Chen, Jun S. Liu. Monte Carlo Bayesian Signal Processing for Wireless Communications. Journal of VLSI Signal Processing 30, pp.89-105,2002.
    [83] Yanlei Diao, Hongjun Lu, and Dekai Wu. A Comparative Study of Classification Based Personal E-mail Filtering. PAKDD 2000, LNAI 1805,
    
    pp.408-419, 2000.
    [84] Yi Lin. Support Vector Machines and the Bayes Rule in Classification. Data Mining and Knowledge Discovery, 6, pp.259-275, 2002.
    [85] Yulan Liang, King-Ip Lin, Arpad Kelemen. Adaptive Generalized Estimation Equation with Bayes Classifier for the Job Assignment Problem. PAKDD 2002, LNAI 2336, pp.438-449, 2002.
    [86] Zhiping Xie, Wynne Hsu, Zongtian Liu, etc. SNNB: A Selective Neighborhood Based Nave Bayes for Lazy Learning. PAKDD 2002, LNAI 2336, pp.104-114, 2002.
    [87] Zhong Su, Hongjiang Zhang, Stan Li, etc. Relevance Feedback in Content-Based Image Retrieval: Bayesian Framework, Feature Subspaces, and Progressive Learning. IEEE Transactions on Image Processing, Vol.12, NO.8, pp. 924-937, August 2003.
    [88] Zhong Xue, Stan Z. Li, Juwei Lu, etc. A Bayesian Model for Extracting Facial Features. In Proceedings of Sixth International Conference on Control, Automation, Robotics and Vision (ICARCV2000), 2000.
    [89] Zijian Zheng, and Geoffrey I. Webb. Lazy Learning of Bayesian Rules. Machine Learning, 41, pp. 53-87, 2000.
    [90] Jiawei Han,and Micheline Kamber.数据挖掘:概念与技术。机械工业出版社,2001年8月第1版。
    [91] Samuel Kotz,吴喜之。现代贝叶斯统计学。中国统计出版社,2000年10月第1版。
    [92] S.James.Press.贝叶斯统计学:原理、模型及应用。中国统计出版社,1992年2月第1版。
    [93] 葛新权。应用数理统计。中国铁道出版社,1995年8月第1版。
    [94] 白伟,何晨,诸鸿文。MCMC方法及其在移动通信中的应用。通信技术,2002.8:6-8。
    [95] 宫野。计算多重积分的蒙特卡罗方法与数论网格法。大连理工大学学报,2001.1:20-23。
    
    
    [96] 胡彩平,倪志伟,卢亦娟。Naive-Bayes模型及其在范例推理中的应用。微机发展,2003.5:23-25。
    [97] 吉根林,孙志挥。数据挖掘技术。中国图像图形学报,2001.8:715-721。
    [98] 刘毅辉,魏振军。保持隐私的朴素贝叶斯分类。信息工程大学学报,2003.3:86-89。
    [99] 石洪波,王志海,黄厚宽等。一种限定性的双层贝叶斯分类模型。软件学报,Vol.15,No.2,2004:193-199。
    [100] 吴莉莉,黄晖,保铮等。基于Gibbs采样的群盲多用户检测技术。电波科学学报,2003.8:363-366。
    [101] 周颜军,王双成,王辉。基于贝叶斯网络的分类器研究。东北师大学报自然科学版,第35卷第2期,2003.6:21-27。
    [102] Kong A., Liu J., and Wong W. Sequential imputation and Bayesian missing data problems. Journal of the American Statistical Association, No.89, pp.278-288. 1994.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700