Amazon Web Service (AWS) Research Grants Proposal


Who are we?

Dr. Anirban Basu

Post-doctoral researcher (Department of Communication and Network Engineering)
Tokai University
2-3-23 Takanawa
Minato-ku
Tokyo 108-8619
Japan

Dr. Anirban Basu is a Post-doctoral Researcher at Kikuchi lab at Tokai University working on a Japanese Ministry of Internal Affairs and Communications funded project in collaboration with Waseda University, Hitachi, NEC and KDDI. He is also a Visiting Research Fellow at the University of Sussex working with the Foundations of Software Systems group. He holds a Ph.D. in Computer Science and a Bachelor of Engineering (Hons.) in Computer Systems Engineering from the University of Sussex. His research interests are in computational trust management, privacy and security and peer-to-peer networks. He has several years of experience with academic research at the University of Sussex where he was involved with two EPSRC funded and one EU IST FP5 funded research projects alongside his doctoral research.

Dr. Jaideep Vaidya

Associate Professor (Management Science and Information Systems Department)
Rutgers, The State University of New Jersey
1 Washington Park
Newark
New Jersey 07102
USA

Dr. Jaideep Vaidya is an Associate Professor of Computer Information Systems at Rutgers University. He received his Masters and Ph.D. in Computer Science from Purdue University and his Bachelors degree in Computer Engineering from the University of Mumbai. His research interests are in Privacy, Security, Data Mining, and Data Management. He has published over 60 papers in international conferences and archival journals, and has received three best paper awards from the premier conferences in data mining, databases, and digital government research. He is also the recipient of a NSF Career Award and a Rutgers Board of Trustees Research Fellowship for Scholarly Excellence.

Dr. Hiroaki Kikuchi

Professor (Department of Communication and Network Engineering)
Tokai University
1117 Kitakaname
Hiratsuka
Kanagawa 259-1292
Japan

Dr. Hiroaki Kikuchi received B.E., M.E. and Ph.D. degrees from Meiji University. He is currently a Professor in the Department of Communication and Network Engineering, School of Information and Telecommunication Engineering, Tokai University. He was a visiting researcher of the School of Computer Science, Carnegie Mellon University in 1997. His main research interests are fuzzy logic, cryptographic protocols, and network security. He is a member of the Institute of Electronics, Information and Communication Engineers of Japan (IEICE), the Information Processing Society of Japan (IPSJ), the Japan Society for Fuzzy Theory and Systems (SOFT), IEEE and ACM. He is an IPSJ Fellow.

Other collaborators

Theo Dimitrakos and Srijith Nair from British Telecom are our external industry collaborators, supported in part by the EU IST Framework Programme 7 integrated project OPTIMIS, which is partly funded by the European Commission.

What are we doing?

The wide-spread adoption of services implemented on cloud computing infrastructures have posed interesting privacy and security challenges, that are very different from those faced in non-cloud based services. We are looking into various privacy preserving data processing and data publishing techniques in the context of cloud computing, e.g. privacy of data in virtualised server environments, privacy-preserving collaborative filtering, amongst others.

Why do we need AWS?

One of the difficulties facing the several new theoretical models (in privacy preserving data mining and publishing) is their applicability in cloud computing. In particular, most theoretical researchers do not have the resources to test their models in real world cloud computing platforms such as the Amazon Web Services (AWS).

We have started testing some of our models on another cloud platform: the Google App Engine (GAE). However, our preliminary experimental results are heavily affected by the various restrictions of the free quota on the GAE. As a result, we feel that the results are not necessarily indicative of realistic deployment (either on the GAE platform or on others). 

In comparison with the GAE (which is a specialised SaaS construction PaaS offering), AWS provides a much larger, more flexible set of cloud services consisting of EC2, S3, Beanstalk and MapReduce amongst others. However, the free quota of AWS is also very restrictive, and thus experimental results will fall short of showing what real world deployments of our theoretical models can achieve.

While there are high-performance cloud computing testbeds such as the Open Cirrus, testing our theoretical models with implementations on a widely adopted real world cloud computing platform (i.e. the AWS) will go an extra mile in bridging the chasm between theoretical research and actual implementation. Since our focus is on development of privacy preserving data processing techniques on cloud computing, we feel that it is absolutely essential to show the applicability of any theoretical model on a real cloud computing platform. Further to that, a real world prototype implementation will also be extremely useful, during dissemination of our research results, to implementers who may plan to adopt such techniques. Additionally, it may spur the development of new techniques based on the identification of other related problems in this realistic environment.

What do we need from AWS?

We plan to use a variety of billable services in the AWS IaaS cloud. We foresee the following usage in particular. In order to minimise network latencies adding unaccountable performance degradation, it is important to note that most of our experiments will be based in Tokyo / Kanagawa area and the New York / New Jersey area.

Elastic Compute Cloud (EC2)

We will need EC2 instances as basic building blocks for running our experiments on. For certain scalability tests, we may require to subject our experimental setup to high incoming requests, say through Elastic Beanstalk. We also would like to test with a number of different configurations of EC2 instances. For example, a typical scenario could need up to 50 EC2 instances at any point of time for processing a MapReduce parallelised task while during an off-peak time, there may be only one EC2 instance serving all the requests.

In terms of the type of instances, we will be particularly interested in the standard, extra large and extra large high CPU instances. As of now, we are not looking into the GPU and the clustered instances.

In terms of CPU hours, we may have some experiments running for several hours while others lasting only few seconds. In fact, it is impossible to estimate the computing need in advance without knowing the type of load our experiments will be subject to. Thus, we will require unlimited CPU hours and I/O to each instance. 

Simple Storage Service (S3)

Although we plan to experiment initially with reasonably sized datasets, we will eventually consider large datasets. Besides, the user data stored in our experiments is likely to grow over time. Keeping this in mind, we will require unlimited I/O access to S3 and a minimal storage space of at least 50GB to start with, growing up as necessary.

Relational Database Service (RDS)

A lot of our user specific data is to be stored in the RDS. Storage will vary depending on the type of data stored. For example, for many encrypted storage, we may have to resort to arbitrary precision big integers. Thus, we will require an initial 100GB storage and unlimited network I/O.

In terms of instances, we will require access to small and large DB instances with standard and multi-AZ deployments.

Since we are not comparing relational database management systems, we will require access to MySQL instances only.

Elastic Beanstalk (EBS)

We plan to use EBS heavily because it gives us a point of comparison with the Google App Engine for Java. Also, it also enables us to build easy-to-use web-based user interfaces for extensive user testing and data collections. At any point of time, we expect to experiment with no more than 10 applications (i.e. each application = an implementation of one of our models and we will upload only one version at a time).

MapReduce

Some of our high performance dataset processing tasks will require high computing power. We plan to optimise them through the MapReduce parallelism. Depending on the number of concurrent tasks our experimental tasks are broken into, we may require a reasonably large number of parallel EC2 instances, say 50 at any point in time.

What is the outcome of this research?

Our published academic work in the public domain will benefit from the advantage of being actually tested on AWS. This will inform future researchers of the advantages as well as limitations of AWS, which is a well-known cloud computing suite of services. Our work will also inform future implementors on the feasibility of certain theoretical approaches on real world cloud computing environments. Our experimental implementations can also serve as starting points for rapid prototyping given that the experimental setups will be on realistic cloud computing platforms.

We also plan to publish exhaustive comparative test results on various test settings using AWS as well as other competitor platforms, such as the Google App Engine.

What is the time frame for completion?

We expect to have preliminary but conclusive results by March 2013. Based on the value demonstrated by our results, we will attempt to make a convincing case for having continued access to Amazon Web Services, and explore further problems in this field.