Elastic MapReduce (EMR) Encryption
Start course

The use of Big Data is becoming commonplace within many organizations that are using Big Data solutions to perform large scale queried data analysis with business intelligence toolsets to gain a deeper understanding of data gathered.

Within AWS, this data can be stored, distributed and consumed by various different services, many of which can provide features ideal for Big Data analysis. Typically, these huge data sets often include sensitive information, such as customer details or financial information.

With this in mind, security surrounding this data is of utmost importance, and where sensitive information exists, encryption should be applied against the data.

This course firstly provides an explanation of data encryption and the differences between symmetric and asymmetric cryptography. This provides a good introduction before understanding how AWS implements different encryption mechanisms for many of the services that can be used for Big Data. These services include:

  • Amazon S3
  • Amazon Athena
  • Amazon Elastic MapReduce (EMR)
  • Amazon Relational Database Service (RDS)
  • Amazon Kinesis Firehose
  • Amazon Kinesis Streams
  • Amazon Redshift

The course covers encryptions options for data when it is at both at-rest and in-transit and contains for the following lectures:

  • Introduction: This lecture introduces the course objectives, topics covered and the instructor
  • Overview of Encryption: This lecture explains data encryption and when and why you may need to implement data encryption
  • Amazon S3 and Amazon Athena Encryption: This lecture dives into the different encryption mechanisms of S3, from both a server-side and client-side perspective. It also looks at how Amazon Athena can analyze data sets stored on S3 with encryption
  • Elastic MapReduce (EMR) Encryption: This lecture focuses on the different methods of encryption when utilizing EMR in conjunction such as EBS and S3. It also looks at application-specific options with Hadoop, Presto, Tez, and Spark
  • Relational Database Service (RDS) Encryption: This lecture looks at the encryption within RDS, focusing on its built-in encryption plus Oracle and SQL Server Transparent Data Encryption (TDE) encryption
  • Amazon Kinesis Encryption: This lecture looks at both Kinesis Firehose and Kinesis Streams and analyses the encryption of both services.
  • Amazon Redshift Encryption: This lecture explains the 4 tiered encryption structure when working with Redshift and KMS. It also explains how to encrypt when working with CloudHSM with Redshift.
  • Summary: This lecture highlights the key points from the previous lectures

Resources mentioned throughout this course

Cloud Academy Courses:

AWS Resources:



Hello and welcome to this lecture where I'll discuss different encryption options available for the Amazon Elastic MapReduce Service, EMR. EMR is a managed service by AWS and is comprised of a cluster of EC2 instances that's highly scalable to process and run big data frameworks such Apache Hadoop and Spark.

From EMR version 4.8.0 and onwards, we have the ability to create a security configuration specifying different settings on how to manage encryption for your data within your clusters. You can either encrypt your data at rest, data in transit, or if required, both together. The great thing about these security configurations is they're not actually a part of your EC2 clusters.

They exist as a separate entity within EMR and therefore you can reuse the same security configuration for both existing and future clusters created. One key point of EMR is that by default, the instances within a cluster do not encrypt data at rest. The instances used within EMR are created from pre-configured AMIs, Amazon Machine Images that have been published and released by AWS.

However, if you need to ensure that the EBS root device volume is encrypted for your EC2 instances within a cluster, then you must use Amazon EMR version 5.7.0 or later and specify a custom AMI which will allow you to encrypt this volume. You may need this additional level of encryption at root volume level for specific compliance reasons.

Although EMR does not encrypt data at rest by default, there are a number of mechanisms you can use to enforce encryption. If you decide to use Elastic Block Store, EBS as persistence storage rather than S3 or DynamoDB, then there a number of options that can work together if you enable local disk encryption at rest in your EMR security configuration.

However, these are not possible for EBS root device volumes. Once enabled, the following features are available. Linux Unified Key Setup. EBS cluster volumes can be encrypted using this method whereby you can specify AWS KMS to be used as your key management provider, or use a custom key provider.

Open-Source HDFS encryption. This provides two Hadoop encryption options. Secure Hadoop RPC which would be set to privacy which uses simple authentication security layer, and data encryption of HDFS Block transfer which would be set to true to use the AES-256 algorithm.

If S3 was used, you could use S3's own encryption tools discussed in a previous lecture. As a result, EMR supports the use of SSE-S3 or SSE-KMS to form the encryption service side at rest. Alternatively, you could encrypt your data using your client before storing on S3 using CSE-KMS or CSE-C where it would remain stored in an encrypted form.

From an encryption in transit perspective, you could enable open source transport layer security encryption features and select a certificate provider type which can be either PEM where you will need to manually create PEM certificates, bundle them up with a zip file and then reference the zip file in S3 or custom where you would add a custom certificate provider as a Java class that provides encryption artifacts.

Once the TLS certificate provider has been configured in the security configuration file, the following encryption applications specific encryption features can be enabled which will vary depending on your EMR version. Hadoop. Hadoop might reduce encrypted shuffle which uses TLS. Both secure Hadoop RPC which uses Simple Authentication Security Layer, and data encryption of HDFS Block Transfer which uses AES-256, are both activated when at rest encryption is enabled in the security configuration.

Presto. When using EMR version 5.6.0 and later, any internal communication between Presto nodes will use SSL and TLS. Tez. Tez Shuffle Handler uses TLS. And Spark. The Akka protocol uses TLS. Block Transfer Service uses Simple Authentication Security Layer and 3DES. External shuffle service uses the Simple Authentication Security Layer.

When using encryption at rest using KMS Customer Master Keys, you need to ensure that the role assigned to your EC2 instances within your cluster has the relevant permissions to enable access to the Customer Master Key. This is done by adding the relevant role to the Key users for the CMK. Finally, EMR has the option of implementing Transparent Encryption in HDFS.

This offers end to end encryption, applying both encryption at rest and in transit. When implemented, data is encrypted and decrypted transparently without requiring any change to application code. This is made possible by using HDFS encryption zones, each having its own KMS key. By default, EMR uses the Hadoop KMS, but you can select an alternative if required.

Each file within the encryption zone is encrypted by a different data key which are then encrypted by the HDFS encryption zone keys. With this in mind, it is not possible to move files between encryption zones as the data key and encryption zone key will not match. For details on how to configure this method of encryption, see the AWS documentation link here.

This lecture has covered the encryption mechanisms that you can choose to apply encryption across EMR, which remember is not provided by default. For more information on how to set up these different methods of encryption in detail, I recommend you visit the relevant AWS documentation pages on EMR. Coming up on the next lecture, I will look at encryption options when using the relational database service RDS.


About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.