Elastic MapReduce (EMR) Encryption
Start course

Outline: Designing Secure Applications and Architectures.
In this course, we learn to recognize and explain what encryption is at a high level. We will then cover the various encryption options provided by AWS and how we can secure application tiers by applying encryption to AWS services such as Amazon S3, Amazon Athena, Amazon Kinesis, Amazon Elastic Map Reduce, Amazon RDS and Amazon Redshift

If you are new to Amazon Athena, Amazon Kinesis, Amazon EMR or CloudHSM please see our existing courses here:

Amazon Athena

Amazon Kinesis

Amazon EMR

Amazon CloudHSM


Hello and welcome to this lecture where I'll discuss different encryption options available for the Amazon Elastic MapReduce Service, EMR. EMR is a managed service by AWS and is comprised of a cluster of EC2 instances that's highly scalable to process and run big data frameworks such Apache Hadoop and Spark.

From EMR version 4.8.0 and onwards, we have the ability to create a security configuration specifying different settings on how to manage encryption for your data within your clusters. You can either encrypt your data at rest, data in transit, or if required, both together. The great thing about these security configurations is they're not actually a part of your EC2 clusters.

They exist as a separate entity within EMR and therefore you can reuse the same security configuration for both existing and future clusters created. One key point of EMR is that by default, the instances within a cluster do not encrypt data at rest. The instances used within EMR are created from pre-configured AMIs, Amazon Machine Images that have been published and released by AWS.

However, if you need to ensure that the EBS root device volume is encrypted for your EC2 instances within a cluster, then you must use Amazon EMR version 5.7.0 or later and specify a custom AMI which will allow you to encrypt this volume. You may need this additional level of encryption at root volume level for specific compliance reasons.

Although EMR does not encrypt data at rest by default, there are a number of mechanisms you can use to enforce encryption. If you decide to use Elastic Block Store, EBS as persistence storage rather than S3 or DynamoDB, then there a number of options that can work together if you enable local disk encryption at rest in your EMR security configuration.

However, these are not possible for EBS root device volumes. Once enabled, the following features are available. Linux Unified Key Setup. EBS cluster volumes can be encrypted using this method whereby you can specify AWS KMS to be used as your key management provider, or use a custom key provider.

Open-Source HDFS encryption. This provides two Hadoop encryption options. Secure Hadoop RPC which would be set to privacy which uses simple authentication security layer, and data encryption of HDFS Block transfer which would be set to true to use the AES-256 algorithm.

If S3 was used, you could use S3's own encryption tools discussed in a previous lecture. As a result, EMR supports the use of SSE-S3 or SSE-KMS to form the encryption service side at rest. Alternatively, you could encrypt your data using your client before storing on S3 using CSE-KMS or CSE-C where it would remain stored in an encrypted form.

From an encryption in transit perspective, you could enable open source transport layer security encryption features and select a certificate provider type which can be either PEM where you will need to manually create PEM certificates, bundle them up with a zip file and then reference the zip file in S3 or custom where you would add a custom certificate provider as a Java class that provides encryption artifacts.

Once the TLS certificate provider has been configured in the security configuration file, the following encryption applications specific encryption features can be enabled which will vary depending on your EMR version. Hadoop. Hadoop might reduce encrypted shuffle which uses TLS. Both secure Hadoop RPC which uses Simple Authentication Security Layer, and data encryption of HDFS Block Transfer which uses AES-256, are both activated when at rest encryption is enabled in the security configuration.

Presto. When using EMR version 5.6.0 and later, any internal communication between Presto nodes will use SSL and TLS. Tez. Tez Shuffle Handler uses TLS. And Spark. The Akka protocol uses TLS. Block Transfer Service uses Simple Authentication Security Layer and 3DES. External shuffle service uses the Simple Authentication Security Layer.

When using encryption at rest using KMS Customer Master Keys, you need to ensure that the role assigned to your EC2 instances within your cluster has the relevant permissions to enable access to the Customer Master Key. This is done by adding the relevant role to the Key users for the CMK. Finally, EMR has the option of implementing Transparent Encryption in HDFS.

This offers end to end encryption, applying both encryption at rest and in transit. When implemented, data is encrypted and decrypted transparently without requiring any change to application code. This is made possible by using HDFS encryption zones, each having its own KMS key. By default, EMR uses the Hadoop KMS, but you can select an alternative if required.

Each file within the encryption zone is encrypted by a different data key which are then encrypted by the HDFS encryption zone keys. With this in mind, it is not possible to move files between encryption zones as the data key and encryption zone key will not match. For details on how to configure this method of encryption, see the AWS documentation link here.

This lecture has covered the encryption mechanisms that you can choose to apply encryption across EMR, which remember is not provided by default. For more information on how to set up these different methods of encryption in detail, I recommend you visit the relevant AWS documentation pages on EMR. Coming up on the next lecture, I will look at encryption options when using the relational database service RDS.


About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.

Covered Topics