Hello and welcome to this lecture where I'll discuss different encryption options available for the Amazon Elastic MapReduce Service, EMR. EMR is a managed service by AWS and is comprised of a cluster of EC2 instances that's highly scalable to process and run big data frameworks such Apache Hadoop and Spark.
From EMR version 4.8.0 and onwards, we have the ability to create a security configuration specifying different settings on how to manage encryption for your data within your clusters. You can either encrypt your data at rest, data in transit, or if required, both together. The great thing about these security configurations is they're not actually a part of your EC2 clusters.
They exist as a separate entity within EMR and therefore you can reuse the same security configuration for both existing and future clusters created. One key point of EMR is that by default, the instances within a cluster do not encrypt data at rest. The instances used within EMR are created from pre-configured AMIs, Amazon Machine Images that have been published and released by AWS.
However, if you need to ensure that the EBS root device volume is encrypted for your EC2 instances within a cluster, then you must use Amazon EMR version 5.7.0 or later and specify a custom AMI which will allow you to encrypt this volume. You may need this additional level of encryption at root volume level for specific compliance reasons.
Although EMR does not encrypt data at rest by default, there are a number of mechanisms you can use to enforce encryption. If you decide to use Elastic Block Store, EBS as persistence storage rather than S3 or DynamoDB, then there a number of options that can work together if you enable local disk encryption at rest in your EMR security configuration.
However, these are not possible for EBS root device volumes. Once enabled, the following features are available. Linux Unified Key Setup. EBS cluster volumes can be encrypted using this method whereby you can specify AWS KMS to be used as your key management provider, or use a custom key provider.
Open-Source HDFS encryption. This provides two Hadoop encryption options. Secure Hadoop RPC which would be set to privacy which uses simple authentication security layer, and data encryption of HDFS Block transfer which would be set to true to use the AES-256 algorithm.
If S3 was used, you could use S3's own encryption tools discussed in a previous lecture. As a result, EMR supports the use of SSE-S3 or SSE-KMS to form the encryption service side at rest. Alternatively, you could encrypt your data using your client before storing on S3 using CSE-KMS or CSE-C where it would remain stored in an encrypted form.
From an encryption in transit perspective, you could enable open source transport layer security encryption features and select a certificate provider type which can be either PEM where you will need to manually create PEM certificates, bundle them up with a zip file and then reference the zip file in S3 or custom where you would add a custom certificate provider as a Java class that provides encryption artifacts.
Once the TLS certificate provider has been configured in the security configuration file, the following encryption applications specific encryption features can be enabled which will vary depending on your EMR version. Hadoop. Hadoop might reduce encrypted shuffle which uses TLS. Both secure Hadoop RPC which uses Simple Authentication Security Layer, and data encryption of HDFS Block Transfer which uses AES-256, are both activated when at rest encryption is enabled in the security configuration.
Presto. When using EMR version 5.6.0 and later, any internal communication between Presto nodes will use SSL and TLS. Tez. Tez Shuffle Handler uses TLS. And Spark. The Akka protocol uses TLS. Block Transfer Service uses Simple Authentication Security Layer and 3DES. External shuffle service uses the Simple Authentication Security Layer.
When using encryption at rest using KMS Customer Master Keys, you need to ensure that the role assigned to your EC2 instances within your cluster has the relevant permissions to enable access to the Customer Master Key. This is done by adding the relevant role to the Key users for the CMK. Finally, EMR has the option of implementing Transparent Encryption in HDFS.
This offers end to end encryption, applying both encryption at rest and in transit. When implemented, data is encrypted and decrypted transparently without requiring any change to application code. This is made possible by using HDFS encryption zones, each having its own KMS key. By default, EMR uses the Hadoop KMS, but you can select an alternative if required.
Each file within the encryption zone is encrypted by a different data key which are then encrypted by the HDFS encryption zone keys. With this in mind, it is not possible to move files between encryption zones as the data key and encryption zone key will not match. For details on how to configure this method of encryption, see the AWS documentation link here.
This lecture has covered the encryption mechanisms that you can choose to apply encryption across EMR, which remember is not provided by default. For more information on how to set up these different methods of encryption in detail, I recommend you visit the relevant AWS documentation pages on EMR. Coming up on the next lecture, I will look at encryption options when using the relational database service RDS.
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.