Cloud Computing : Data Privacy Preserving Methods

for Outsourced Computation

Under the assumption of untrusted cloud providers, outsourcing data intensive computation

such as database services and data mining to public clouds raises privacy concerns.

The

primary purpose of this assignment is to understand the data privacy problem with

outsourced computation in public clouds and implement several methods to mitigate this

problem. This project consists of two parts. (1) Multiplication perturbation methods for

outsourced data mining; and (2) Crypto-index for outsourced database services.

Part 1: Multiplication perturbation for outsourced data mining.

The idea of data perturbation is to change the original values of a dataset that is prepared

for outsourced data mining, so that the untrusted cloud provider cannot figure out any

original value while working on the perturbed data. The basic principle is to preserve both

data privacy and data utility – the data owner can still obtain high quality data mining

models with the perturbed data. In this experiment, we will be looking at one convenient

method – geometric data perturbation (GDP). GDP is designed to preserve the utility for

Euclidean-distance based data mining models such as kNN classifiers, linear classifiers,

Support Vector Machine (SVM) classifiers (with certain kernels), and clustering with

Euclidean distance as the similarity measure. In this project, we consider only classification

modeling.

Let’s define GDP first. The data for mining is often a table with k columns and n rows,

denated as a k x n matrix. Let x be any record of the table, which is represented as a kelement

column vector. Let R be a secret random orthogonal matrix (so that R times R’s

transpose is the identity matrix I) and t be a secret random k-element column vector, both

of which are shared by all original records. Let d be a randomly generated k-element noise

vector, individually for each record, drawn from a normal distribution N(0, σ2

), where the

variance σ2

is determined by the user. Let y be the perturbed vector for the original vector x.

The geometric data perturbation is defined as

y=Rx+t+d

Classification modeling is to find the best function that can predict the label of each

record, by learning from a set of given examples in the form (record, label) – we call it

training data. The given training datasets are in the SVM format. Each line of the data file

represents a record, which looks like

where index:value indicates the value for the specific dimension “index” of the record. We

will use three datasets in the experiments: breast cancer, diabetes, and heart disease. You

can use this Python program or this Java program for reading the data.

When you apply GDP to protect classification modeling, you perturb only the records but

keep the label unchanged. As a result, each row the outsourced training data looks like

(perturbed record, label). In this way, the cloud side can still work on the perturbed records

label index:value index:value ..to train a classification model, which should have close accuracy as the one trained with the

original data (the accuracy is dependent on the setting of sigma). However, the cloud

provider cannot observe the original record, neither can they figure out the original

classification model, as long as the perturbation parameters R and t are not disclosed (under

certain security assumptions).

You will need to implement the GDP perturbation method and conduct some experiments to

understand the important features. Specifically, you need to finish the following tasks.

You need to implement the GDP perturbation program that can be run with the

following command line format.

where sigma_seting is the sigma value for generating the noise vector d, input_file is

one of the given data in SVM format, output_file is the perturbed data. Python and

Java are recommended for implementing the perturbation method.

A few classification methods: k nearest neighbor, SVM, and AdaBoost are given for

experiments, which are located at nimbus17:/home/hadoop/project4/. They can be

used in the following way.

The program has simplified all the parameter settings for training the classifiers. For

example, it uses five-fold cross-validation; the setting k of knnc is limited to less than

21; adaboost uses 500 trees; and svm parameters are not optimized. In the output of

the algorithms, you will find the accuracy and standard deviation, which will be

needed in your experiments. You will need to run several experiments to see the

accuracy of the models trained with (1) the original data, (2) the perturbed datasets

generated with sigma value of the noise distribution N(0, σ2

) set to 0.05, 0.1, 0.15,

0.2, 0.25, respectively. Replace the “training_data” file with the corresponding files

when you run the tests.

Answer the questions:

Question 1.1 Implement the GDP algorithm and paste your GDP implementation code here.

Question 1.2 Conduct the accuracy evaluation on the three datasets and report your

experimental results with the following table for each dataset. Each cell represents the

accuracy and standard deviation (e.g., in the form 0.82+/-0.02) for the corresponding

classification method and the sigma setting. Note that SVM output does not have standard

deviation, which is fine. Intepret the table and write down your observations and

conclusions.

gdp sigma_setting original_data_file perturbed_data_file

./knnc training_data

./adaboost -o training_data

./svm-train -v 5 training_datamethods

no

perturbatio

n

Sigma=0.0

5

Sigma=0.

1

Sigma=0.1

5

Sigma=0.

2

Sigma=0.2

5

kNN

SVM

AdaBoos

t

Question 1.3 The provided classificaiton program has hidden the details of “crossvalidation”.

The default setting is set to five folds. Find the definition of cross-validation from

the Web. Describe the procedure of cross-validation with your own words, and answer why

we need cross-validation.

Question 1.4 Under what assumptions the security of the GDP perturbation method is

guaranteed? What are the potential attacks? – describe a few that you can think of.

Question 1.5 How much time did you spend on Question 1.1 and 1.2, respectively?

Question 1.6 How useful is this task to your understanding of the problem of privacy

preserving outsourced mining in the cloud?

Part 2: Crypto-index for range query processing

Database management is resource consuming. An appealing solution is to export database

services to the cloud, which raises privacy concerns, too. Among the existing approaches,

crypto-index is a simple solution (not so secure) that transforms a domain to a set of

random IDs. It is easy to process queries on transformed data. However, the result may

have low precision. In this task, you will implement a simple crypto-index method and

experiment wtih single-dimensional range query on transformed data.

For simplicity, let’s work on the continuous data. Assume one dimensional of the data, for

example, the three datasets you have seen, is in the domain [min, max]. The crypto-index

method partitions the domain into a few bins and use a random ID to represent each bin.

Correspondingly, when you process a range query, say finding records with dimension i in

the range [u, v], you will transform the query to the transformed data domain as “finding

records with dimension i in the set of bin IDs [id1, id2,…]“.Let’s assume the input data is in the SVM format. The following transformation program will

transform one dimension of the data. Its incomplete python implementation is given as a

reference (you should revise test the code before using it, and you can implement your own

Java version as well if you prefer Java).

where dimension_i is the selected dimension for transformation (starting from 0), bin_width

is a percentage of the domain, i.e., 0.1 or 10%, indicating the whole domain is partitioned

into 10 equi-width bins, bin_encoding_file contains the bin encoding information for

dimension_i:

and output_data is the file with values in dimension_i replaced with random_IDs according

to the bin encoding scheme. Now you need to implement other programs as follows.

Question 2.1 Develop a query evaluation program, the usage of which is

where [start, end] of dimension_i is the original range for search, bin_encoding is the one

generated from the previous program, and transformed_data is the output_data from the

previous program. The program will print the precision of the query, which is defined as (the

number of records exactly in [start, end])/(the number of records returned by query

processing on transformed data). Post the code here.

Question 2.2 Develop a simple script that processes a number n, e.g., n=10, of randomly

generated ranges for your queries. Note these ranges should be within the domain of

dimension_i. Compute the average query result precision of the n ranges, for different

bin_width settings: 0.02, 0.04, 0.06, 0.08, 0.1. You should try your script on the three

datasets. Report the result precision and standard deviation (e.g., 0.82+/-0.05) in the

following table, and also post your script here.

bin_width 0.02 0.04 0.06 0.08 0.1

breast_cancer

diabetes

heart

crypto-index-transdata input_data dimenion_i bin_width bin_encoding_file output_data

start_1 end_1 random_ID1

start_2 end_2 random_ID2

…

crypto-index-query original_data transformed_data bin_encoding_file dimenion_i start endQuestion 2.3 Discuss the potential attacks to the crypto-index method that you can come

up with.

Question 2.4 How much time did you spend on each of the tasks 2.1-2.3, respectively?

Deliverables Turn in the PDF report that answers all the questions, including the code for

some questions.