hands-on lab

Transforming Data With Apache Spark and Amazon EMR

Up to 1h 30m
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.


Amazon EMR (formerly known as Amazon Elastic Map Reduce) is a big data platform that supports many popular open-source data processing frameworks, including Apache Spark. Amazon EMR simplifies the configuration, provisioning, and scaling of clusters for data analysis and processing workloads.

Learning how to use Amazon EMR will help anyone looking to understand how to perform big data processing in the real world.

In this hands-on lab, you will tour an Amazon EMR cluster, place data and a script in a location accessible to Amazon EMR, submit a workload to an Amazon EMR cluster, and examine the results.

Please note an Amazon EMR cluster takes approximately ten minutes to create and become usable. Please ensure you have enough time available before starting the lab.

Learning objectives

Upon completion of this beginner-level lab, you will be able to:

  • Understand the configuration of an Amazon EMR cluster
  • Upload a script and data file to an Amazon S3 bucket
  • Submit work to a cluster by adding a step
  • Inspect the results of an Amazon EMR step

Intended audience

  • Candidates for AWS Certified Data Engineer Associate certification
  • Cloud Architects
  • Data Engineers
  • DevOps Engineers
  • Machine Learning Engineers


Familiarity with the following will be beneficial but is not required:

  • Amazon EMR
  • Amazon Simple Storage Service (S3)
  • The Python scripting language
  • The JavaScript Object Notation (JSON) data format

The following content can be used to fulfill the prerequisites:

Environment before

Environment after

About the author

Learning paths

Andrew is a Labs Developer with previous experience in the Internet Service Provider, Audio Streaming, and CryptoCurrency industries. He has also been a DevOps Engineer and enjoys working with CI/CD and Kubernetes.

He holds multiple AWS certifications including Solutions Architect Associate and Professional.

Covered topics

Lab steps

Logging In to the Amazon Web Services Console
Touring an Amazon EMR Cluster
Uploading Files to Amazon S3
Submitting a Job to an Amazon EMR Cluster
Examining the Results