Efficient algorithms for data mining with federated databases .

详细信息

作者：Young ; Barrington R. St. A.
学历：Doctor
年：2007
导师：Bhatnagar, Raj
毕业院校：University of Cincinnati
专业：Computer Science.
ISBN：9780549027539
CBH：3264456
Country：USA
语种：English
FileSize：1556038
Pages：137

文摘

The Internet era and high speed networks have ushered in the capabilities to have ready access to large amounts of geographically distributed data. Individuals, businesses, and governments recognize the value of this available resource to those who can transform the data into information. These databases, though valuable as individual entities, become significantly more valuable when they function as parts of a federated database and their data can be aggregated for collective mining or computations. This requires data mining algorithms to shift their focus from working with single databases to efficiently working with federated databases. Any such methodology must address issues relating to security and privacy of the data, cost of communication among the database nodes and the overlap of attributes among the databases. Most of the algorithms designed for mining single databases are not easily adaptable for federated databases.;Some existing approaches for mining horizontally and vertically partitioned datasets work by first sampling local databases, and then transferring the distorted versions of samples to a centralized node. The goal is to add noise to the data in such a way that some desirable statistical properties of the data remain very close to their original values. Addition of noise helps protecting the data privacy but there is some loss of information and communication cost can be large. If the data is not distorted and secure multi-party protocols are used to communicate for joint computation, the protocols are exponential, and they result in extremely heavy network traffic. To minimize the network traffic, it is desirable to maximize the amount of local computations on each of the participating nodes. We present in this work a new perspective on these two competing demands. In our approach, scalable and secure computation can be achieved on a general vertical partitioning of an implicit global database among the explicit federated components. We use an implicit aggregation space with the necessary data privacy designed in this space. Our algorithms efficiently compute first and second order statistical quantities, and inter-tuple distance based clusterings for data in the federated databases. We demonstrate these by computing the covariance matrix and the k-nearest neighbor and other clusterings for databases.;The information exchanged between the nodes hosting the databases is minimized and consists of summaries derived from multiple tuples of the databases. Privacy of the individual data tuples is preserved and secure multi party computation protocols may be applied to secure the sharing of summaries. Our algorithms are more capable than the other research in this area in that they can handle the more naturally occurring vertical partitions of an implicit global database with an arbitrary overlap in their attribute sets.;Some pattern recognition techniques use Principal Components Analysis (requiring covariance matrices) to identify features in the data and use these features in either classifier building or clustering. Some other pattern recognition techniques, such as hierarchical clustering or K-nearest neighbor algorithm, require computation of inter-tuple or tuple-cluster distances. Our algorithms enable both of these types of algorithms to be efficiently and privately computed in the context of federated databases.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700