基于信息论的数据挖掘算法

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

基于信息论的数据挖掘算法

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Some Data Mining Algorithms Based on Information Theory
作者：沙朝锋
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：信息论 ; 多样性模式 ; 相关模式 ; 特征选择 ; 非线性相关聚类 ; 隐私保护
英文关键词：Information theory ; diversity patterns ; correlation patterns ; feature selection ; nonlinear correlation clustering ; privacy preservation
学位年度：2008
导师：周傲英
学科代码：081202
学位授予单位：复旦大学
论文提交日期：2008-10-01

摘要

信息论中的多个概念可用于衡量所研究的对象之间的相关性、多样性,以及衡量分布之间的距离,这些技术已被广泛应用于计算机科学的各个领域。本文我们使用信息论技术定义了几个数据挖掘问题,提出了相应的挖掘算法。其中我们所处理的问题包括相关性模式的挖掘,多样性模式的挖掘,特征选择和相关聚类等。另外我们也讨论了将数据公开发布为数据挖掘的应用提供实际数据时可能面临的隐私泄露问题,继续了对t-相近性隐私保护模型的讨论。
     本文的主要贡献可以总结如下:
     1.基于衡量随机变量之间依赖性的条件熵,我们引入了对称的、满足三角不等式的信息距离,使用该距离定义了新的依赖树和相关模式,提出了相应的挖掘算法,还使用了该距离来衡量特征之间的相关性进行特征选择。
     2.基于衡量随机变量之间依赖性的联合熵,我们引入了二值型数据上的熵多样性模式挖掘问题。通过建立不同随机变量联合熵之间的联系,提出了基于这些上下界的快速多样性模式挖掘算法;在此基础上提出了一个改进的非冗余交互特征子集挖掘算法。
     3.基于衡量连续分布之间距离的Kullback-Leibler divergence,我们提出了一个新的非线性相关聚类算法。
     4.基于衡量离散分布之间距离的Kullback-Leibler divergence,我们引入了新的t-相近性隐私保护模型,该模型可以解决已有的方法所存在的缺陷,并讨论了和语义隐私之间的联系。
     在这些工作中,我们都依次给出了问题定义,对问题或性质进行分析,提出挖掘或实现算法。最后都通过人工或者真实数据上进行的实验,验证了我们的算法的效率或所挖掘出来的对象的效用。
Some notation in information theory can be used to measure the correlations, diversity in the researched objects, and the distance between probability distributions. Those techniques has found many applications in computer science areas. In this thesis, we propose some data mining problems based on information theory, and develop techniques for these tasks. The problem we address includes mining correlation patterns and diversity patterns, feature selection, and correlation clustering. We also discuss privacy preservation in the public data publishing for data mining applications, where we focus on the t-closeness privacy preservation model.
     The main contributions of this thesis can be summarized as follows:
     1. Based on the conditional entropy, we introduce a symmetric information distance which satisfying triangle inequality, define the problem of finding novel dependency trees and correlation patterns, and propose some algorithms for these mining tasks. We also propose a feature selection algorithm based on this new information distance which measures the correlation between features.
     2. Based on the joint entropy of random variables, we introduce the problem of finding entropy diversity patterns. By establishing serval bounds between entropy of different random variables, we propose some efficient algorithms to find these diversity patterns. We also develop an improved mining algorithm for non-redundant interacting feature subsets.
     3. Based on Kullback-Leibler divergence between continuous distributions, we develop a novel nonlinear correlation clustering algorithm.
     4. Based on Kullback-Leibler divergence between discrete distributions, we introduce a novel t-closeness privacy preservation model with Kullback-Leibler divergence, which addresses the drawback in the previous approaches. We also discuss the relationship between our new model with semantic privacy.
     In these work, we in turn present the problem definition, analyze the problem or the properties of researched objects, develop the mining or implementation algorithms. The efficiency and effectiveness of each technique is verified using simulations over both synthetic and real data sets.

引文

[1] E. Achtert, C. Bohm, H.-P. Kriegel, P. Kroger and A. Zimek. Deriving Quantitative Models for Correlation Clusters. In KDD 2006.

    [2] E. Achtert, C. Bohm,H.-P. Kriegel and A. Zimek. Mining hierarchies of correlation clusters. In Proc. SSDBM 2006.

    [3] E. Achtert, C. Bohm, H. Kriegel, P. Kroger and A. Zimek. Deriving Quantitative Models for Correlation Clusters. In KDD 2006.

    [4] E. Achtert and C. Bohm and H.-P. Kriegel and P. Kroger and A. Zimek.Robust, Complete, and Efficient Correlation Clustering. In SDM 2007.

    [5] E. Achtert, C. Bohm, H.-P. Kriegel and P. Kroger and A. Zimek. On Exploring Complex Relationships of Correlation Clusters. In Proc. SSDBM 2007.

    [6] F. Afrati, A. Gionis and H. Mannila. Approxmating a collection of frequent sets. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004.


    [7] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. In P. Buneman eds. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C.: ACM Press, 1993. 207-216.

    [8] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In J. Bocca eds. Proceedings of 20th International Conference on Very Large Data Bases, Santiago de Chile: Morgan Kaufmann 1994. 487-499.

    [9] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. Of the 2000 ACM SIGMOD, pages 439-450, Dallas, TX, May 14-19 2000.

    [10] C.C. Aggarwal and P.S. Yu. Finding Generalized Projected Clusters in High Dimensional Space. In Proc. SIGMOD 2000.
    [11] G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas,and A. Zhu. Achieving Anonymity via Clustering. PODS 2006.

    [12] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D.Thomas, and A. Zhu. Approximation Algorithms for k-Anonymity. Journal of Privacy Technology, 2005.

    [13] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. In P. Buneman eds. Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C.: ACM Press, 1993. 207-216.

    [14] M. Arenas and L. Libkin. A information-theoretic approach to normal forms for relational and XML data. Journal of the ACM: 52, 2005, 246-283.

    [15] R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization.In Proc. 21st Internataional Conference on Data Engineering, pages 217-228,Washington, DC, USA, 2005. IEEE.

    [16] J. Blanchard, F. Guillet, R. Gras, and H. Briand. Using Information-Theoretic Measures to Assess Association Rule Interestingness. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), IEEE Computer Society, 2005.

    [17] Avrim L. Blum and Pat Langley. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence,97(1-2):245-272, 1997.

    [18] C. Bohm, K. Railing, P. Koger and A. Zimek. Computing Clusters of Correlation Connected Objects. In Proc, SIGMOD 2004.

    [19] C. Bohm, P. Kunath, A. Pryakhin and M. Schubert. Querying Objects modeled by Arbitrary Probability Distributions. Proc. 10th Int. Symp. on Spatial and Temporal Databases (SSTD'07). 2007, pages 294-311.

    [20] P. Bradley, C. Reina and U. Fayyad. Clustering very large databases using em mixture models. In 15th International Conference on Pattern Recognition (ICPR' 00), 2000.

    [21] J. Brickell and V. Shmatikov. The Cost of Privacy: Destruction of Data-Mining Utility in Anonymized Data Publishing. To appear in KDD 2008.

    [22] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: generalizing association rules to correlations. In Proceedings of 1997 ACM-SIGMOD International Conference on Management of Data (SIGMOD'97).
    [23] H. Cheng, X. Yan, J. Han, and C. Hsu. Discriminative frequent pattern analysis for effective classification. ICDE 2007.

    [24] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:3, p. 462-467,1968.

    [25] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd Edition, MIT Press, 2001.

    [26] G. Cong, B. Cui, Y. Li and Z. Zhang. Summarizing Frequent Patterns Using Profiles. Database Systems for Advanced Applications, 11th International Conference, DASFAA 2006.

    [27] T. Cover and J. Thomas. Elements of Information Theory. Wiley Series in Telecommunications, Wiley Interscience, 1991.

    [28] M. Dalkilic and E. Roberston. Information dependencies. In Proceedings of the Nineteenth ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems. Dallas: ACM 2000. 245-253.

    [29] J. Davis and I. Dhillon. Differential Entropic Clustering of Multivariate Gaussians. In Proc. NIPS 2006.

    [30] A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B,39(1):1-38, 1977.

    [31] R. Duda, P. Hart and D. Stork. Pattern Classification (2nd). Wiley, 2001.

    [32] C. Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceedings of the Twentieth International Conference on Machine Learning (ICML'03),pp. 147-153.

    [33] M. Ester, H. Kriegel, J. Sander and X. Xu. A density based algorithm for discovering clusters in large spatial databases with noise. In Proc. KDD 1996.

    [34] Evfimievski, J. E. Gehrke, and R. Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium,on Principles of Database Systems (PODS 2003). San Diego, CA, June 2003.

    [35] U. Fayyad, G. Piatetsky-Shapiro, and P. Smith. The KDD Process for Extracting Using Knowledge Volumes of Data. Communications of the ACM,39(11), Nov. 1996.
    [36] F. Fleuret. Fast binary feature selection with conditional mutual information.Journal of Machine Learning Research. p. 1531-1555, 2004.

    [37] N. Friedman, D. Geiger and M. Goldszmid. Bayesian network classifiers. Machine Learning, 29, p. 131-163, 1997.

    [38] K. Fukunaga. Introduction to statistical pattern recognition, 2nd Edition.Academic Press, Boston, 1990.

    [39] Arianna Gallo, Tijl De Bie, and Nello Cristianini. MINI: Mining Informative Non-redundant Itemsets. Knowledge Discovery in Databases: PKDD 2007.pages 438-445.

    [40] M. Garofalakis, J. Hellerstein, and P. Maniatis, Proof sketches: verifiable multi-party aggregation, ICDE 2007.

    [41] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis. Fast Data Anonymization with Low Information Loss. In VLDB 2007, pages 758-769.

    [42] C. Glymour, D. Madigan, D. Pregibon and P. Smyth. Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery 1(1), 1997.

    [43] I. Guyon and A. Elisseeff. An introduction to variable and feature selection,Journal of Machine Learning Research, pages 1157-1182, 2003.

    [44] J. Han and M. Kamber. Data mining: concepts and techniques. elsevier, 2006.

    [45] J. Han, J. Pei, Y. Yin. Mining Frequent Patterns without Candidate Generation. In W. Chen eds. Proceedings of 2000 ACM-SIGMOD International Conference on Management of Data (SIGMOD'00), Dallas: ACM, May 2000.1-12.

    [46] T.S.Han. Nonnegative entropy measures of multivariate symmetric correlations. Inform. Contr.,36:133-156, 1978.

    [47] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer, 2001.

    [48] H. Heikinheimo, E. Hinkkanen, H. Mannila, T. Mielikainen and J. Seppanen.Finding low-entropy sets and trees from binary data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007. H. Heikinheimo, H. Mannila, J. Seppanen. Finding trees from unordered 0-1 data. Knowledge Discovery in Databases: PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin,Germany, September 18-22, 2007, p. 175-186.

    [50] D. Hochbam, and A. Pathria. Analysis of the greedy appraoch in problems of maximum k-coverage. Naval Research Quartely, 45:615-627, 1998.

    [51] K. Huang, I. King and M. Lyu. Constructing a Large Node Chow-Liu Tree Based on Frequent Itemsets. In Proceedings of the International Conference on Neural Information Processing, 2002.

    [52] Y. Ke, J. Cheng, and W. Ng. Mining Quantitative Correlated Patterns Using an Information-Theoretic Approach. In T. Eliassi-Rad eds, Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia: ACM, 2006. 227-236.

    [53] Daniel Kifer and J. E. Gehrke. Injecting Utility into Anonymized Datasets.In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD 2006).

    [54] J. Kivinen and M. Warmuth. Exponentiated Gradient versus Gradient Descent for Linear Predictors. Information and Computation 132, pages 1-63,(1997).

    [55] A. Knobbe and E. Ho. Maximally informative k-itemsets and their efficient discovery. In T. Eliassi-Rad eds, Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia: ACM 2006. 237-244.

    [56] Ron Kohavi and George H. John. Wrappers for Feature Subset Selection.Artificial Intelligence, 97:1-2, p. 273-324, 1997.

    [57] S. Kolahi and L. Libkin. On redundancy vs dependency preservation in normalization: an information-theoretic study of 3NF. Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2006.

    [58] D. Koller and M. Sahami. Toward optimal feature selection, 13th International Conference on Machine Learning, 284-292, 1996.

    [59] S. Kullback. Information theory and statistics. John Wiley and Sons, NY,1958.
    [60] Y. Lee, W. Kim, Y. Cai, and J. Han. CoMine: efficient mining of correlated patterns. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), IEEE Computer Society, 2003. 581-584.

    [61] Y. Lee, W. Kim, and J. Han. CCMine: efficient mining of confidence-closed correlated patterns. In H. Dai eds. Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Springer 2004.569-579.

    [62] C. Liu and H. Y. Shum. Kullback-Leibler boosting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 587-594, 2003.

    [63] C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.

    [64] Chang-Hwan Lee and Dong-Guk Shin. Using Hellinger distance in a nearest neighbour classifier for relational databases. Knowledge-Based Systems Volume 12, Issue 7, November 1999, Pages 363-370.

    [65] Kristen LeFevre, David DeWitt, and Raghu Ramakrishnan. Incognito: Efficient Full-Domain k-Anonymity. In ACM SIGMOD International Conference on Management of Data, 2005.

    [66] K. LeFevre, D. DeWitt, and R. Ramakrishnan, Workload-Aware Anonymization, KDD 2006.

    [67] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In International Conference on Data Engineering (ICDE), April 2007.

    [68] Y. Lindell and B. Pinkas. Privacy preserving data mining. In Crypto 2000,pages 26-54, Springer-Verlag, Aug. 20-24 2000.

    [69] B. Lindsay. Mixture models: Theory, geometry, and applications. MS NSF-CBMS Regional Conference Series, Hayward, 1995.

    [70] H. Liu and R. Setiono. A Probabilistic Approach to Feature Selection - A Filter Solution. ICML 1996.

    [71] Huan Liu and Lei Yu. Toward Integrating Feature Selection Algorithms for Classification and Clustering. TKDE,17:4, p. 491-502, 2005.
    [72]T.Lwuchukwu,and J.Naughton.K-anonymization as spatial indexing:Toward scalable and incremental anonymization.Proceedings of the 33rd International Conference on Very Large Data Bases,Vienna,Austria,2007.New York:ACM Press,2007:746-757.
    [73]S.Ma,and J.Hellerstein.Mining Mutually Dependent Patterns.In N.Cercone eds,Proceedings of the 2001 IEEE International Conference on Data Mining.San Jose:IEEE Computer Society 2001.409-416.
    [74]Ashwin Machanavajjhala,Johannes Gehrke,and Daniel Kifer,Muthu Venkitasubramaniam.l-Diversity:Privacy Beyond k-Anonymity.In Proceedings of the 22nd IEEE International Conference on Data Engineering(ICDE 2006),Atlanta,Georgia,April 2006.
    [75]W.J.McGill.Multivariate information transmission.IEEE Transaction on Information Theory,4(4):93-111,1954.
    [76]M.Meila.An accelerated Chow and Liu algorithm:fitting tree distrbutions to high-dimensional sparse data.In Proceedings of the Sixteenth International Conference on Machine Learning(ICML 1999).
    [77]A.Meyerson and R.Williams.On the complexity of optimal k-anonymity.Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,Paris,France 2004.New York:ACM Press,2004:223-228.
    [78]L.Molina,L.Belanche,and A.Nebot.Feature selection algorithms:a survey and experimental evaluation.Proceedings of 2002 IEEE International Conference on Data Mining.p.306-313,2002.
    [79]D.J.Newman,S.Hettich,C.L.Blake and C.J.Merz.UCI machine learning database Repository.http://www.ics.uci.edu/～mlearn/MLRepository.html.
    [80]Ni Weiwei,Chen Geng,Lu Jieping,Wu Yingjie,sun Zhuihui.Local Entropy Based Weighted Subspace Outlier Mining Algorithm[J].Journal of Computer Research and Development,2008,45(7):1189-1194.(倪巍伟,陈耿,陆阶平,吴英杰,孙志挥.基于局部信息熵的加权子空间离群点检测算法[J].计算机研究与发展,2008,45(7):1189-1194.)
    [81]E.R.Omiecinski.Alternative Interest Measures for Mining Associations in Databases.IEEE Transactions on Data Engineering,15(1):57-69,2003.
    [82] F. Pan, A. Roberts, L. McMillan, F. de Villena, D. Threadgill and W. Wang.Sample selection for maximal diversity. In Proceedings of the 5th IEEE International Conference on Data Mining, 2007.

    [83] Feng Pan, Wei Wang, Anthony K. H. Tung, and Jiong Yang. Finding representative set from massive data. Proceedings of the 5th IEEE International Conference on Data Mining (ICDM), pp. 338-345, 2005.

    [84] H. Park and K. Shim. Approximate algorithms for k-anonymity. In Proc. of ACM SIGMOD, pages 67-78, 2007.

    [85] F. Pereira, N. Tishby and L. Lee. Distributional Clustering of English Words. In Proc. ACL 1993, p. 183-190.

    [86] D. Pollard, Asymptopia, first ed., book in progress, http://www.stat.yale.edu/ pollard/, 2000.

    [87] P. Samarati. Protecting respondents' identities in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13:1010-1027 (2001).

    [88] G. Schwarz. Estimating the dimension of a model. In The Annals of Statistics,Vol. 6, No. 2, p. 461-464, 1978.

    [89] Marc Sebban and Richard Nock. A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognition, 35:4, 2002.

    [90] Hemant Sengar, Haining Wang. Duminda Wijesekera, and Sushil Jajodia.Detecting VoIP floods using the Hellinger distance. IEEE Trans. on Parallel and Distributed Systems, Vol. 19, No. 6, June 2008, pages 794-805.

    [91] A. Siebes, J. Vreeken and M. van Leeuwen. Item sets that compress. Proc.SIAM Conference on Data Mining. 2006.

    [92] L. Sweeney,k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (2002).

    [93] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (2002).

    [94] J. B. Tenenbaum, V. de Silva and' J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323,2000.
    [95] Kari Torkkola. Feature Extraction by Non-Parametric Mutual Information Maximization. JMLR, vol. 3, p. 1415-1438, 2003.

    [96] A. Tung, X. Xu and B. Ooi. CURLER: Finding and Visualizing Nonlinear Correlation Clusters. In Proc. SIGMOD 2005.

    [97] N. Ueda, R. Nakano, Z. Ghahramani and G. E. Hinton. SMEM algorithm for mixture models. Neural Computation, 12(9):2109-2128, 2000.

    [98] J. Vaidya and C. Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data, The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23 - 26, 2002, Edmonton,Alberta, Canada.

    [99] G. Van Dijck and M. Van Hulle. Speeding up feature subset selection through mutual information relevance filtering. Knowledge Discovery in Databases:PKDD 2007, 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2007.

    [100] M. van Leeuwen, J. Vreeken and A. Siebes. Compression picks item sets that matter. Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases,Berlin, Germany, September 18-22, 2006.

    [101] Nuno Vasconcelos. Feature Selection by Maximum Marginal Diversity. NIPS 2002.

    [102] P. Viola and M. J. Jones. Robust real-time object detection. COMPAQ Cambridge Research Laboratory, 2001.

    [103] G. Wang and F. Lochovsky. Feature selection with conditional mutual information maximin in text categorization. Proceedings of the Thirteenth ACM conference on Information and knowledge management, p. 342-349, 2004.

    [104] S. Watanabe. Information theoretical analysis of multivariate correlation.IBM Journal of Research and Development, 4:66-82, 1960.

    [105] Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil,Tomaso Poggio, and Vladimir Vapnik. Feature Selection for SVMs. Advances in Neural Information Processing Systems 13. p. 668-674, 2001.

    [106] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques, Morgan kaufmann, 2nd edition, 2005.
    [107]R.Wong,A.Fu,D.Wang,and J.Pei.Minimality attack in privacy preserving data publishing.Proceedings of the 33rd International Conference on Very Large Data Bases,Vienna,Austria,2007.New York:ACM Press,2007:543-554.
    [108]X.Xiao and Y.Tao.Personalized Privacy Preservation.Proceedings of ACM Conference on Management of Data(SIGMOD),pp.229-240,Chicago,USA,June 26-29,2006.
    [109]X.Xiao and Y.Tao.Anatomy:Simple and Effective Privacy Preservation.Proceedings of the 32nd Very Large Data Bases conference(VLDB),pp.139-150,Seoul,Korea,September 12-15,2006.
    [110]D.Xin,J.Han,X.Yah and H.Cheng.Mining compressed frequent-patterns sets.Proceedings of the 31st International Conference on Very Large Data Bases,Trondheim,Norway,August 30 - September 2,2005.
    [111]H.Xiong,P.Tan,and V.Kumar.Hyperclique Pattern Discovery,Data Mining and Knowledge Discovery Journal,Vol.13,No.2,pp.219-242,September,2006.
    [112]G.Xuan,P.Chai and M.Wu.Bhattacharyya distance feature selection,ICPR 1996,p.195-199.
    [113]X.Yan,H.Cheng,J.Han and D.Xin.Summarizing itemset patterns:a profile-based approach.Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2005.
    [114]Yiming Yang and Jan O.Pedersen.A Comparative Study of Feature Selection in Text Categorization.ICML '97:Proceedings of the Fourteenth International Conference on Machine Learning.p.412-420,1997.
    [115]X.Yang,Y.Wang,B.Wang,and G.Yu.Privacy Preserving Approaches for Multiple Sensitive Attributes in Data Publishing.In Chinese Journal of Computers,Vol 31,No.4,pp.574-587,April 2008.
    (杨晓春,王雅哲,王斌,于戈.数据发布中面向多敏感属性的隐私保护方法.《计算机学报》,第4期(574-587),2008.)
    [116]R.W.Yeung.A first course in information theory.Springer,2002.
    [117]L.Yu and H.Liu.Feature selection for high-dimensional data:a fast correlation-based filter solution.Proceedings of the Sixteenth International Conference on Machine Learning,2003.
    [118] Lei Zhang, Sushil Jajodia, Alexander Brodsky. Information disclosure under realistic assumptions: Privacy versus optimality. Proc. 14th ACM Conf. on Computer and Communications Security (CCS), Alexandria, VA, October 29-November 2, 2007, pages 573-583.

    [119] X. Zhang, F. Pan, W. Wang, and A. Nobel. Mining non-redundant high order correlation in binary data. In Proceedings of the 34rd International Conference on Very Large Data Bases, Vienna, Austria, 2008. Auckland, New Zealand.

    [120] http://www.cs.cornell.edu/database/privacy/code/l-diversity/incognito-ldiversity.tgz

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700