In this course for the Big Data Specialty Certification, we learn how to identify the appropriate data processing technologies needed for big data scenarios. We explore how to design and architect a data processing solution, and explore and define the operational characteristics of big data processing.
Learning objectives
- Recognize and explain how to identify the appropriate data processing technologies needed for big data scenarios.
- Recognize and explain how to design and architect a data processing solution.
Intended audience
This course is intended for students wanting to extend their knowledge of the data processing options available in AWS.
Prerequisites
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud computing services. If you would like to gain a solid foundation in compute fundamentals, then check out our Compute Fundamentals For AWS course.
This Course Includes
75 minutes of high-definition video.
What You'll Learn
- Course Intro: What to expect from this course
- Amazon Elastic MapReduce Overview: In this lesson, we discuss how EMR allows you to store and process data
- Amazon Elastic MapReduce Architecture: In this lesson, you’ll learn about EMR’s clustered architecture.
- Amazon Elastic MapReduce in Detail: In this lesson, we’ll dig deeper into EMR storage options, resource management, and processing options.
- Amazon Elastic MapReduce Reference Architecture: Best practices for using EMR.
- Amazon Lambda Introduction: This lesson will kick off our discussion of Lambda and how it’s used in Big Data scenarios.
- Amazon Lambda Overview: This lesson discusses how Lambda allows you to run code for virtually any type of application or backend service with no administration.
- AWS Lambda Architecture: In this lesson, we’ll discuss generic Lambda architecture and Amazon’s serverless service.
- AWS Lambda in Detail: In this lesson, we’ll dig into Events and Service Limits.
- AWS Lambda Reference Architecture: In this lesson, we'll look at a real-life scenario of how lambda can be used.
Welcome to Big Data on AWS. We're looking at processing data with Amazon EMR. At the end of this module you'll be able to describe in detail how Amazon EMR can be used to process data within a big data solution. So we've had a look at the storage options within AWS big data, so let's have a look at the options for processing data within Amazon.
And we'll start off by looking at Amazon EMR and then we'll move on to a module to look at Amazon Lambda. Amazon EMR is primarily designed to process data and make this data accessible to users such as data scientists. Amazon EMR allows you to store as well as process data and it's underpinned by the Apache Hadoop ecosystem, so it is often used as the core service within a big data analytics solution.
Amazon EMR is targeted at providing processing patterns at a speed and scale that relational databases cannot achieve. When choosing a big data processing solution from within the available AWS service offerings, it is important to determine whether you need the latency of response from the process to be in seconds, minutes, or hours.
This will typically drive the decision on which AWS service is best for processing that pattern of data or use case. Amazon EMR is primarily designed to deliver batch orientated processing. As well as processing data, Amazon EMR can also store data. When choosing a big data storage solution from within the available AWS service offerings, it is important to determine whether the data sources we are primarily storing contain structured, semi structured, or unstructured data. This will typically drive the decision on which AWS service is the best for that data pattern or use case.
Amazon EMR is primarily designed to manage semi structured data, and it is designed for schema on read. Schema on read is where you apply the structure to the data you are using as you read it, so we're effectively creating and applying the structure within your code rather than defining the structure in the database before you load it. Amazon EMR provides a framework that allows you to easily create, customize, and manage big data processing clusters based on the Apache Hadoop ecosystem. EMR stands for Elastic MapReduce. Underlying your EMR environment is a cluster of Amazon EC2 instances that house the Hadoop ecosystem of open source applications you need to access, process, and manage large volumes of data.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.