多参数扰动的隐私保护关联规则挖掘算法研究

英文题名：Research on Multi-parameters Perturbation Privacy Preserving Association Rules Mining Algorithm
作者：李威
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 关联规则 ; 隐私保护 ; 多参数扰动
英文关键词：Data mining ; Association rule ; Privacy preservation ; Multi-parameters perturbation
学位年度：2010
导师：刘杰
学科代码：081202
学位授予单位：哈尔滨工程大学
论文提交日期：2010-01-01

摘要

随着信息技术、网络技术、数据存储技术和高性能处理器技术的进步,数据资料的规模急速膨胀,从而促进了数据挖掘(Data Mining,DM)技术的产生和飞速发展。数据挖掘在不断的挖掘出知识和规律,为政府、企业和个人带来便利的同时,也不可避免的涉及到人们的隐私问题。同时,随着社会的进步,人们对隐私的重视程度也越来越高,这也给数据挖掘带来了新的困难。隐私保护数据挖掘就是为了解决隐私保护和数据挖掘之间的矛盾而产生和发展的。
     本文首先阐述了数据挖掘、关联规则挖掘的基本理论和隐私保护数据挖掘的主要技术。在此基础上,对隐私保护关联规则挖掘的经典算法MASK算法进行了深入浅出的介绍和分析,并对多参数扰动算法做了详细的研究。
     与MASK算法相比,多参数扰动算法提高了隐私保护度和数据挖掘的准确度,但多参数扰动算法的频繁项集还原部分仍存在时间效率不高的问题,尤其是随着项集的增大,这个问题变的越来越严重。针对这个问题,本文对多参数随机扰动算法进行了较深入的研究,并根据该算法频繁项集还原模型的特点提出了两个改进的方法。方法一把求解转换矩阵逆矩阵的过程由两步变为一步,从而提高了时间效率。方法二由要求出转换矩阵逆矩阵的所有元素变为只求出转换矩阵逆矩阵的首行元素,从而又进一步提高了时间效率。最后通过理论分析和实验结果,表明方法一改进后的算法的时间效率高于原算法的时间效率,方法二改进后的算法的时间效率高于方法一改进后的算法的时间效率。另外,方法二改进后的算法在空间效率方面比原算法也有一定的改进。因为各种多参数扰动算法的频繁项集还原模型是一样的,所以对多参数随机扰动算法的改进也可以应用到别的多参数扰动算法上。
As information technology, network technology, data storage technology and high-performance processor technology advance, the size of data expands quickly, and all these contribute to the generation and the rapid development of the data mining technology. While data mining helps government, enterprises and individuals to get knowledge and rules, and brings benefits to them, it inevitably involves people’s privacy. Meanwhile, with the progress of society, people pay more and more attention to privacy, and new difficulties was brought to data mining. In order to get across the gap between data privacy and data mining, privacy preserving data mining emerged and has been developing quickly.
     Firstly, the basic theory of data mining, association rules mining and the main technology of privacy preserving association rules mining is described in this thesis. Then, one of the classic privacy preserving association rules mining algorithms-MASK algorithm is introduced simply, and multi-parameters perturbation algorithms are studied carefully.
     Compared with MASK algorithm, multi-parameters perturbation algorithms improve the degree of privacy preserving and data mining accuracy. However, the time-efficiency of restoring the frequent itemsets in multi-parameters perturbation algorithms is still not high, and the problem becomes more and more serious with the increase of itemset. Addressing this issue, multi-parameters randomized perturbation algorithm is dissected carefully. Two methods are proposed in this thesis to improve the time efficiency of multi-parameters randomized perturbation algorithm according to the characteristics of the model to restore frequent itemsets. The first method improves the time efficiency by merging the process of inversing the transformation matrix from two steps into one step. The second method improves the time efficiency further by getting the elements of the first line of the inversed matrix of transformation matrix, while the first method need get all the elements of the inversed matrix of transformation matrix. Finally, both theoretical analysis and experimental results indicate that the first improved algorithm is more efficient than the original algorithm and the second improved algorithm is more efficient than the first improved algorithm. In addition, the second improved algorithm is more space-efficient than the original algorithm. Because the models of restoring frequent itemsets for a variety of multi-parameters perturbation algorithms are same, so the improvements can also be applied to other multi-parameters perturbation algorithms.

引文

[1] L.F.Cranor,J.Reagle,M.S.Ackerman.Beyond concern :Understanding net users’attitudes about online privacy.Technical report,AT&T Labs-Research.1999:6-7
    [2]任静涵,张保稳,陈晓桦.隐私保护数据挖掘研究进展.信息安全与通信保密.2008年05期:77-79
    [3] S.L.Warner.Randomized response:A survey technique for eliminating evasive answer bias.Journal of the American Statistical Association.1965,vol.60:63-69
    [4] Du Wenliang,Zhan Zhijun.Using randomized response techniques for privacy-preserving data mining.The 9th ACM SIGKDD Int’1 Conf.Knowledge Discovery in Databases and Data Mining,Washinton,D.C.,2003:505-510
    [5]罗永龙,黄刘生,荆巍巍,等.一个保护私有信息的布尔关联规则挖掘算法.电子学报,2005,(10):900-903
    [6] Saygin Y,Verykios S V,Elmagarmid K A.Privacy preserving association rule mining.In Proceedings of the 12th International Workshop on Research Issues in Data Engineering(2002):151-158
    [7] Stanley R,Oliveira M,Za lane R O.Achieving Privacy Preservation When Sharing Data For Clustering.In Proc. of the International Workshop on Secure Data Management in a Connected World(SDM'04) in conjunction with VLDB,Canada,2004:67-82
    [8]黄伟伟,柏文阳.聚类挖掘中隐私保护的几何数据转换方法.计算机应用研究,2006,(6):180-184
    [9]张国荣,印鉴.应用正交变换保护数据中的隐私信息.计算机应用研究,2006,(10):95-97
    [10] Agrawal R,Srikant R.Privacy-preserving data mining.In Proceedings ofthe ACM SIGMOD Conference on Management of Data,2000:439-450
    [11]葛伟平,汪卫,周皓峰,等.基于隐私保护的分类挖掘.计算机发展与研究,2006,43(1):39-45
    [12] Alexandre Evfimievski,Ramakrishnan Srikant,Rakesh Agrawal, Johannes Gehrke.Privacy preserving mining of association rules. Information Systems 29(2004):343-364
    [13] Clifton C,Kantarcioglou M,Zhu Y M.Tools for privacy preserving distributed data mining.SIGKDD Explorations 4,2002,(2):28-34
    [14] Du Wenliang,Zhan Zhijun.Building decision tree classifier on private data.In Proceedings of the IEEE ICDM Workshop on Privacy,Security and Data Mining,2002:1-8
    [15] Lindell Y,Pinkas B.Privacy preserving data mining.In Advances in Cryptology-CRYPTO 2000,2000:36-54
    [16] Kantarcioglou M,Clifton C.Privacy-preserving distributed mining of association rules on horizontally partitioned data.In Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery,2002:24-31
    [17]宋宝莉,覃征.分布式环境下关联规则的安全挖掘算法.计算机工程,2006,32卷(21):35-37
    [18] V.S.Verykios,E.Bertino,I.N.Fovino,L.P.Provenza,Y.Saygin,Y.Theodoridis.State-of-the-art in privacy preserving data mining.ACM SIGMOD Record. 2004,vol.33:50-57
    [19]焦李成,刘芳,缑水平,刘静,陈莉.智能数据挖掘与知识发现.西安电子科技大学出版社,2006:2-28
    [20]迟晓明.基于数据仓库与数据挖掘的航空货运分析CRM应用研究.中国海洋大学硕士学位论文.2009:16-17
    [21]王锐,刘杰.隐私保护关联规则挖掘算法的研究.计算机工程与应用.2009,45(26):126-128
    [22]胡可云,田凤占,黄厚宽.数据挖掘理论与应用.清华大学出版社,北京交通大学出版社,2008:1-30
    [23]彭斌.基于关联规则的基因芯片数据挖掘与应用.第三军医大学硕士学位论文.2008:28-35
    [24] Agrawal R.,Imielinski T.,Swami A.Mining association rules between sets of items in large databases.In Proceedings of 1993 ACM-SIGMOD International Conference on Management of Data, 1993:207-216
    [25] Savasere A,Omiecinski E,Navathe S.An efficient algorithm for mining association rules in large database.In:Proc.of 21Int. Conf.on Very Large Data Base. Zurich,Switzerland,1995:432-444
    [26] Park J.S.,Chen M.,Yu P.S.An effiective hash-based algorithm for mining association rules.ACM-SIGMOD International Conference Management of Data,1995:175-186
    [27] Jiawei Han,Jian Pei,Yiwen Yin,Runying Mao.Mining Frequent Patterns without Candidate Generation.Data Mining and Knowledge Discovery.2004,8(1):53-87
    [28] Cheung DW.,Han J..A Fast Distributed Algorithm for Mining Association Rules.Proceeding of 1996 International Conference on parallel and distributed information system,Miami Beach,Florida,USA.1996:1-43
    [29] H.Toivonen.Sampling large database for association rules.In Proc.of the 22nd International Conference on Very Large Database,Bombay,India.1996:134-145
    [30]李锋.面向数据挖掘的隐私保护方法研究.上海交通大学博士学位论文.2008:19-22
    [31] S.Oliveira,O.R.Zaiane.Privacy preserving clustering by data transformation.Proc.of the 18th Brazilian Symposium on Databases.2003:304–318
    [32] S.Agrawal,V.Krishnan,J.R.Haritsa.On addressing efficiency concerns in privacy-preserving mining.Proc.of 9th Intl.Conf.on Database Systems for Advanced Applications(DASFAA).2004:113–124
    [33] Au Wai-Ho,Keith C.C.Chan.Mining changes in association rules:afuzzy approach.Fuzzy Sets and Systems,2005,25(149):87-104
    [34] S.Oliveira , O.R.Zaiane.Privacy-preserving clustering by object similarity-based representation and dimensionality reduction transformation.Proceedings of the Workshop on Privacy and Security Aspects of Data Mining(PSDM’04)in conjunction with the Fourth IEEE International Conference on Data Mining(ICDM’04).2004:21-30
    [35]周志纯.隐私保护数据挖掘研究.合肥工业大学硕士学位论文,2008:17-19
    [36] Wang K , Fung B.Anonymizing sequential release.Proc of KDD 2006.New York:ACM,2006:414-423
    [37]张鹏,童云海,唐世渭,杨冬青,马秀莉.一种有效的隐私保护关联规则挖掘方法.软件学报.2006,17(8):1765-1774
    [38] Garofalakis M.Mining sequential patterns with regular expression constraints.IEEE Transactions on Knowledge and Data Engineering.2002,14(3):120-136
    [39] Yi Xun,Zhang Yanchun.Privacy-preserving Distributed Association Rule Mining via Semi-trusted Mixer.Data & Knowledge Engineering. 2007,63(2):550-567
    [40] Zhu Ping,He Yanxiang,Xiang Guangli.Homomorphic Encryption Scheme of the Rational.Wireless Communications,Networking and Mobile Computing,2006:1-4
    [41] Z.Huang , W.Du , B.Chen.Deriving private information from randomized data.Proceedings of the 2005 ACM SIGMOD international conference on Management of data.2005:37-48
    [42] K.Liu,H.Kargupta,J.Ryan.Random Projection-Based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining.IEEE Transactions on Knowledge and Data Engineering.2006,18(1):92-106
    [43] Paul Bunn , Rafail Ostrovsky.Secure two-party k-means clustering.Proceedings of the 14th ACM conference on Computer and communications security,Alexandria,Virginia,USA,2007:486-495
    [44] S.R.M.oliveira , O.R.Zaiane.Privacy preserving frequent itemset mining.In Proc of the IEEE international conference on Privacy,Security,Data mining.2002:43-54
    [45] J.Vaidya,C.Clifton.Privacy preserving association rule mining in vertically partitioned data.In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Minig.2002:639-644
    [46] Chang Li Wu,Moskowitz I S.Parsimonious downgrading and decisions trees applied to the inference problem.In:Proceedings of the 1998 New Security Paradigms Workshop.1998:82-89
    [47] H.Kargupta,S.Datta,Q.Wang.On the privacy-preserving properties of random data perturbation techniques.IEEE International Conference on Data Mining.Melbourne,Florida USA:IEEE.2003:99-106

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700