面向隐私保护的关联规则挖掘研究

英文题名：Research on Privacy Preserving Association Rules Mining
作者：盛荣华
论文级别：硕士
学科专业名称：管理科学与工程
中文关键词：隐私保护 ; 数据挖掘 ; 关联规则 ; 矩阵变换 ; 知识共享
英文关键词：Privacy Preservation ; Data Mining ; Association Rule ; Matrix Transformation ; Knowledge Sharing
学位年度：2012
导师：郑建国
学科代码：1201
学位授予单位：东华大学
论文提交日期：2011-12-01

摘要

数据挖掘目前是数据库研究中最活跃的分支之一,不论科学研究还是商业应用,数据挖掘都取得了可喜的成果。但与此同时,数据挖掘也面临着很多问题的挑战。其中,数据挖掘的个人隐私与信息安全问题尤其得到关注。误用、滥用数据挖掘可能导致用户数据特别是敏感信息的泄漏,越来越多的人们对此表示担忧,甚至拒绝提供真实的数据。如何在不暴露用户隐私的前提下进行数据挖掘,也就成了人们非常感兴趣的课题。
     本文针对关联规则挖掘中的隐私保护问题进行研究。首先介绍了相关背景知识,对现有的隐私保护关联规则挖掘作了分析和介绍。接着详细阐述并分析了典型的Apriori算法。以及对隐私保护关联规则挖掘算法MASK算法作了详细介绍,并且对MASK算法和Apriori算法在运行时间上作了个比较；针对MASK算法其存在的问题及其原因进行了详细分析。
     在此基础上,从隐私保护对象为原始数据集的角度出发,针对关联规则挖掘中如何保护隐私数据信息的问题,首先从数据存储结构角度进行改进,利用数学集合理论,改变数据存储方式,从而减少了重构原数据支持度过程中的扫描数据库的数目,消除了重构原数据项支持度的指数复杂度,并给出了其描述；其次从概率变换矩阵角度出发,采用随机参数扰动方法对数据进行歪曲,然后对概率矩阵进行变换,再进行关联规则的挖掘,并使用传统隐私保护度评价方法与矩阵变换的方向隐私保护度相结合的方法评价变换的隐私保护度。有效地解决了按照一般的隐私保护度的评价方法会产生一些特殊值与实际值不符的情况,以及在数据集容量很大的情况下运算量大的问题。通过理论分析和实验论证,证明了该方法具有很好的隐私性、高效性和适用性。
     本文最后将基于改进的隐私保护关联规则挖掘算法应用到协同商务知识共享中,分析了算法的应用背景,然后详细说明了算法的应用过程,并对算法的应用情况作出了初步的评价。
Data mining has long been an active area of database research.In the field of science research or business application, data mining both has gained pleasing achievement, however, accompanying such benefits are concerns about information privacy.Because of these concerns, some people might decide to give false information in fear of privacy problem, or they might simply refuse to divulge any information at all.So privacy is an important issue in data mining and knowledge discovery.Design and analysis of privacy preserving data mining is meaningful and has attracted much interest in this field.
     In this thesis, the author studies privacy preserving association rules mining. First introduces the relevant background knowledge and analyzes and introduces the existing typical privacy preserving association rules mining, and then analyzes the characteristics and limitations of a typical algorithm called Apriori. As well as describing privacy protection association rule mining algorithm called MASK algorithm in detail, and makes a comparison between MASK algorithm and Apriori algorithm in running time; MASK algorithm for the existence of problems and their causes are analyzed in detail.
     On this basis, Object from the privacy point of view of the original data set for mining association rules on how to protect data privacy issues, first from the perspective of improving the data storage structure, the use of mathematical set theory, changes the data storage means, thereby reducing the reconstruction of the original data support the process of scanning the number of databases, eliminating data entry support reconstruction of the original complexity of index, and its description is given. Then, in the transformed dataset, first performs cluster analysis to obtain normalized data, and then mines association rules, and evaluates the privacy preserving degree of the matrix transformation using the combination of the traditional evaluation method and the direction privacy preserving degree. The algorithm has solved some problems effectively, such as the problem that some special values are inconsistent with the actual values according to the traditional evaluation method and the computational problems when handling large data sets. Theoretical analysis and demonstrations show that the method in this paper has very good privacy, efficiency and applicability.
     The thesis concludes with an application of the algorithm in this paper in knowledge sharing of collaborative commerce. It analyzes the application background of this algorithm, and then presents elaborate detailed application process of the algorithm, puts up a preliminary evaluation of the results.

引文

[1]任静涵,张保稳,陈晓桦.隐私保护数据挖掘研究进展[J].信息安全与通信保密,2008,05：77-80.
    [2]Du Wenliang,Zhan Zhijun.Using randomized response techniques for privacy-preserving data mining.The 9th ACM SIGKDD Int'l Conf.Knowledge Discovery in Databases and Data Mining,Washinton,D.C.,2003:505-510.
    [3]罗永龙,黄刘生,荆巍巍等.一个保护私有信息的布尔关联规则挖掘算法[J].电子学报,2005,(10)：900-903.
    [4]Alexandre Evfimievski,Ramakrishnan Srikant,Rakesh Agrawal,Johannes Gehrke.Privacy preserving mining of association rules.Information Systems 29(2004):343-364.
    [5]Paillier P. Composite-residuosity Based Cryptography:An Overview. CryptoBytes, 2002,5(1):20-26.
    [6]吴媛媛,沈雪明.基于隐私保护的决策树构造[J].计算机工程,2006,32(03)：96-98.
    [7]Kaya S V, Pedersen T B. Efficient Privacy Preserving Distributed Clustering Based on Secret Sharing. Emerging Technologies in Knowledge Discovery and Data Mining,2007,4819(07):280-289.
    [8]蒋栋栋,孙志挥,汪晓刚等.水平分布数据集的隐私保护关联挖掘算法[J].计算机工程,2009,35(2)：60-62.
    [9]Mchmcd K.数据挖掘-概念、模型、方法和算法[M].闪四清.第一版.北京：清华大学出版社,2003.
    [10]韩家炜,堪博.数据挖掘：概念与技术[M].范明,孟小峰译.北京：机械工业出版社.2007,8：23-41.
    [11]Chen, M. S, Han, J. and Yu, P. S. Data Mining:An Overview from a Database Perspective[J], IEEE Transactions on Knowledge and Data Engineering,1998,8(6): 866-883.
    [12]Park J-H,Kanitkar V,Delis A. Logically Clustered Architec-tures for Networked Databases[J].Distributed and Parallel Databases,2001,10(2):161-198.
    [13]冯玉才,万春.基于集群的数据库系统原型[J].计算机工程与学,2005,27(3)：56-57.
    [14]邹志文,朱金伟.数据挖掘算法研究与综述[J].计算机工程与设计,2005,26(9)：2304-2307.
    [15]黄伟伟,柏文阳.聚类挖掘中隐私保护的几何数据转换方法[J].计算机应用研究,2006,(6)：180-184
    [16]黄毅群,卢正鼎,胡和平等.分布式环境下保持隐私的关联规则挖掘算法[J].计算机工程,2006,32(13)：12-14.
    [17]邵峰晶,于忠清编著.数据挖掘原理与算法[M].北京：中国水利水电出版社,2003.91-92.
    [18]姚瑶,吉根林.一种于隐私保护的分布式聚类算法[J].计算机科学.2009,36(3)：100-102.
    [19]方炜炜,胡健,杨炳儒等.分布式决策树挖掘的隐私保护研究[J].计算机科学,2009,36(4)：239-242.
    [20]Estivill-Castro,V. Private Representative-Based Clustering for Vertically Partitioned Data [C]. In Proceedings of the 5th Mexican International Conference in Computer Science (ENC'04), Colima, Mexico, September 2004,pp.160-167.
    [21]温晗.保护隐私的数据挖掘方法研究[D].杭州：浙江大学,2006.
    [22]高源,马朝红.隐私保持关联规则挖掘方法[J].燕山大学学报(自然版),2004,28(1)：74-79.
    [23]杨晓春,刘向宇,工斌等.支持多约束的K-匿名化方法[J].软件学报,2006,17(5)：1222-1231.
    [24]周水庚,李丰,陶宇飞等.面向数据库应用的隐私保护研究综述[J].计算机学报.2009,32(5)：847-861.
    [25]V.S.Verykios,E.Bertino,I.N.Fovino,Y.Saygin,Y.Th eodoridis.State-of-the-art in privacy preserving data mining.ACM SIGMOD Record.2004,vol.33:50-57.
    [26]焦李成,刘芳,缑水平,刘静,陈莉.智能数据挖掘与知识发现[M].西安电子科技大学出版社,2006.
    [27]杨晓春,刘向宇,王斌,于戈.支持多约束的K一匿名化方法[J].软件学报,2006,27(5)：1222-1231.
    [28]吕品,陈年生,董武世.面向隐私保护的数据挖掘技术研究[J].计算机技术与发展,2006,16(7)：147-149.
    [29]Agrawal, R. and Srikant, R. Fast Algorithms for Mining Association Rules in Large Database[C]. in Proceedings of the 20th International Conference on Very Large Data Bases,1996:487-499.
    [30]Han, J, Pei, J, Yin, Y and Mao, R. Mining Frequent Patterns without Candidate Generation:a Frequent-Pattern Tree Approach[J]. Data Mining and Knowledge Discovery,2004,8(1):53-87.
    [31]胡可云,田凤占,黄厚宽.数据挖掘理论与应用[M].清华大学出版社,北京交通大学出版社,2008.
    [32]Holt, J. D. and Chung, S. M. Mining Association Rules Using Inverted Hashing and Pruning[J]. Information Processing Letters,2002,83:211-220.
    [33]Li, Z. C., He, P. L. and Lei, M. A High Efficient AprioriTid Algorithm for Mining Association Rule[C]. in Proceedings of the Fourth International Conference on Machine Learning and Cybernetics,2006:1812-1815.
    [34]Tsay, Y. J. and Chang-Chien, Y. W. An Efficient Cluster and Decomposition Algorithm for Mining Association Rules[J]. Information Sciences,160,161-171.
    [35]刘华婷,郭仁祥,姜浩.关联规则挖掘算法的研究与改进[J].计算机应用与软件,2009,26(1)：146-149.
    [36]李新良,陈湘涛.数据挖掘中关联规则算法的研究[J].计算机工程与科学,2007,29(12)：111-116.
    [37]韩秋明,李微,李华锋.数据挖掘技术应用实例[M].机械工业出版社,2009,4：254-258.
    [38]Verykios VS, Bertino E, Fovino IN, Provenza LP, Saygin Y, Theodoridis Y. State-of-the-Art in privacy preserving data mining[J].SIGMOD Record, 2004,33(1):50-57.
    [39]Johnsten T, Raghavan V. A methodology for hiding knowledge in databases[C]/ /Proc of IEEE International Conference on Privacy,Security and DataMining. Darlinghurst:Australian Computer Society,2004:9-17.
    [40]Lindell Y, Pinkas B. Privacy preserving data mining[J]. Journal of Crypto logy, 2005,15(3):177-206.
    [41]Kantarcioglu M, Clifton C. Privacy-Preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans.on Knowledge and Data Engineering,2006,16(9):1026-1037.
    [42]Agrawal R,Srikant R.Privacy-preserving Data Mining[C]//Proc.of ACM SIGMOD Intl.Conf.on Management of Data,2000-05.
    [43]Saygin Y,Verykios V,Elmagarmin A.Privacy Preserving Association Rule Mining[C]//Proc.of 12th Intl. Workshop on Research Issues in Data Engineering(RIDE),2004-02.
    [44]Rizvi S J,Haritsa J R.Maintaining Data Privacy in Associotion Rule Mining[C]//Proc.of 28th Intl.Conf.on Very Large Data Bases(VLDB),2004-09.
    [45]郭宇红,童云海,唐世渭.数据库中的知识隐藏综述[J].软件学报,2007,18(11)：2782-2799.
    [46]KAB IR SM A, YOUSSEF AM, ELHAKEEM A K. On data distortion for privacy p reserving data mining[C]//Proc of Canadian Conference on Electrical and Computer Engineering.2007:308-311.
    [47]Verykios V S, Elmagarmid A, Bertino E, et al. Association Rule Hiding[J]. IEEE Trans. on Knowledge and Data Engineering,2006,16(4):434-447.
    [48]Oliveira S R M, Zaiane O R. Algorithms for Balancing Privacy and Knowledge Discovery in Association Rule Mining[C]//Proc. of the 7th Int'l Database Engineering and Application Symp.. Hong Kong,China:IEEE Computer Society, 2003.
    [49]Oliveira S R M, Zalane O R. A Unified Framework for Protecting Sensitive Association Rules in Business Collaboration[J]. Int'l Journal of Business Intelligence and Data Mining,2006, 1(3):247-287.
    [50]李霞.陈子军.吕庆春.基于移项的隐私保护关联规则挖掘算法[J].计算机工程,2009.6.35(12)：59-61.
    [51]张鹏,童云海,唐世渭,杨冬青,马秀莉.一种有效的隐私保护关联规则挖掘方法[J].软件学报,2006,8(8)：1764-1774.
    [52]郑利荣,印鉴.一种基于隐私保护的关联规则挖掘算法[J].现代计算机,2009,6：10-14.
    [53]WANG S L, MASKEY R, JAFAR IA, et al. Efficient sanitization of informative association ruleswith updates [C]//Proc of International Conference on Information and Automation.2006:331-336.
    [54]欧阳金亮,陆黎明.基于隐私保护的关联规则挖掘算法[J].计算机与数字工程,2010,38(8)：55-58.
    [55]AGRAWAL R, SR IKANT R. Privacy-preserving data mining [J].ACM SIGMOD Record,2000,29 (2):439-450.
    [56]Evfimievski A,Srikant R,Agrawal R,et al.Privacy Preserving Mining of Association Rules[C]//Proc of 8th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining.Edmonton,Alberta,Canada,2002,217-228.
    [57]李乃乾,沈钧毅,田絮资.基于隐私保护的跨表关联规则挖掘[J].模式识别与人工智能,2005,12.16(4)：418-422.
    [58]Rizvi SJ, Haritsa JR. Maintaining Data Privacy in AssociationRule Mining[C]. In: Bernstein PA, Ioannidis YE, Ramakrishnan R, Papadias D, eds. Proc. of the 28th Int'lConf. on Very Large Databases. Hong Kong:Morgan KaufmannPublishers, 2002:682～693.
    [59]陈芸,张伟,邹汉斌.基于多参数随机扰动的布尔关联规则挖掘[J].计算机工程,2006,32(10)：63-65.
    [60]GUO Yu-hong, TONG Yu-hai, TANG Shi-wei, et al. A FP-tree-based method for inverse frequent set mining[C]//Proc of the 23 rd British National Conference on Databases. Berlin:Springer-Verlag,2006:152-163.
    [61]陈晓明,李军怀,彭军,刘海玲,张璟.隐私保护数据挖掘算法综述[J].计算机科学,2007,34(6)：183-187.
    [62]桂琼,程小辉.一种隐私保护的分布式关联规则挖掘方法[J].微电子学与计算机,2009.9,26(9)：57-60.
    [63]Oliveria S R M,Zaiane O R. Privacy Preserving Frequent Itemset Mining. In:Workshop on Privacy,Security,and Data Mining at The 2002 IEEE International Conference on Data Mining (ICDM'02),Maebashi City,J apan,December 2002.
    [64]Vassilios S.Veyrkios,Elisa Bertino,Igor Nai Fovino,Loerdana Parasiliti Porvenza,YueeI Syagin,Yannis Theodoridis. State-of-the-art in Privacy Persevring Data Mining. ACM SIGMOD Record, v.33n.I,March 2006.
    [65]王恩彬,吴陈,曾庆军.分布式环境下保持隐私的关联规则挖掘算法[J].科学技术与工程,2009.1,9(1)：135-140.
    [66]张瑞,郑诚.基于隐私保护的关联规则挖掘算法[J].计算机工程,2009.2,35(4)：78-82.
    [67]李学明,刘志军,秦东霞.隐私保护数据挖掘[J].计算机应用研究,2008.12,25(12)：3550-3555.
    [68]李云峰等.关联规则挖掘的研究及对Apriori算法的改进[J].计算机工程与科学,2002,24(6)：176-179.
    [69]熊励,陈子辰,梅益.协同商务理论与模式.上海社会科学出版社,2006.6：15-23.
    [70]Ling Qiu, Yingjiu Li, Xintao Wu:An Approach to Outsourcing Data Mining Tasks while Protecting Business Intelligence and Customer Privacy. ICDM Workshops 2006:551-558.
    [71]范力.协同商务平台数据交换系统安全性应用研究[D].西南交通大学,2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700