In this course for the Big Data Specialty Certification, we learn how to identify the appropriate data processing technologies needed for big data scenarios. We explore how to design and architect a data processing solution, and explore and define the operational characteristics of big data processing.
Learning objectives
- Recognize and explain how to identify the appropriate data processing technologies needed for big data scenarios.
- Recognize and explain how to design and architect a data processing solution.
Intended audience
This course is intended for students wanting to extend their knowledge of the data processing options available in AWS.
Prerequisites
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud computing services. If you would like to gain a solid foundation in compute fundamentals, then check out our Compute Fundamentals For AWS course.
This Course Includes
75 minutes of high-definition video.
What You'll Learn
- Course Intro: What to expect from this course
- Amazon Elastic MapReduce Overview: In this lesson, we discuss how EMR allows you to store and process data
- Amazon Elastic MapReduce Architecture: In this lesson, you’ll learn about EMR’s clustered architecture.
- Amazon Elastic MapReduce in Detail: In this lesson, we’ll dig deeper into EMR storage options, resource management, and processing options.
- Amazon Elastic MapReduce Reference Architecture: Best practices for using EMR.
- Amazon Lambda Introduction: This lesson will kick off our discussion of Lambda and how it’s used in Big Data scenarios.
- Amazon Lambda Overview: This lesson discusses how Lambda allows you to run code for virtually any type of application or backend service with no administration.
- AWS Lambda Architecture: In this lesson, we’ll discuss generic Lambda architecture and Amazon’s serverless service.
- AWS Lambda in Detail: In this lesson, we’ll dig into Events and Service Limits.
- AWS Lambda Reference Architecture: In this lesson, we'll look at a real-life scenario of how lambda can be used.
Okay, so let's start by having a look at the Amazon Lambda architecture. The first thing we need to understand is that Lambda is both a generic architecture and a serverless processing service from Amazon. It is important to not get the two mixed up. Let's look at the generic Lambda architecture first to get an idea of what is it trying to achieve. Nathan Marz came up with the term Lambda Architecture for a generic, scalable, and fault-tolerant data processing architecture. Based on his experience working on distributed data processing systems at BackType and Twitter. Incidentally, he was also heavily involved in the creation of Apache Storm, as part of the Twitter team. The Lamda Architecture is a data processing framework that handles a massive amount of data and integrates batch and real-time processing within a single framework. It is split into three layers: the batch layer, the serving layer, and the speed layer. The batch layer is responsible for two things. The first is to store the immutable, constantly-growing master dataset, and the second is to compute new derived values from this dataset. These new derived values need to be calculated using data from the entire dataset, and therefore the batch layer is not typically able to update the calculated values frequently. As we have seen in the previous module, that is where Amazon EMR comes in, providing the capability to provide this batch processing at scale. However, even with Amazon EMR, depending on the size of your dataset in the cluster, each calculation iteration could still take hours. The output from the batch layer is a set of data containing the precomputed values. The serving layer is responsible for indexing and exposing the data so that it can be queried. An example of this would be to provision Impala inside your Amazon EMR cluster, and then users would then be able to use Impala to query the data immediately. Often, the serving layer will have views defined to make accessing the underlying data easier for users. Unfortunately, the batch and serving layers alone do not satisfy any real time requirement because met produced by design is a batch process with high latency and it could take a few hours for new data to be represented in the derived values and made available via the serving layer. This is why we need the speed layer. The speed layer is similar to the batch layer in that it creates new derived values based on the data it receives, but unlike the batch layer, the speed layer is designed to derive this data with as minimal latency as possible. It would be used where there is a requirement for the user to get access to the derived values and milliseconds. While the batch layer will continuously recompute the derived data from scratch, the speed layer uses an incremental model, where there real time views are incremental as and when new data is received. These real time views that are created in the speed layer are intended to be transient, not persisted permanently. As soon as the new data and the resulting derived values are propagated through the batch and the serving layers, the corresponding results and the real time views can be discarded. The speed layer uses the concept of a sliding window. It calculates and holds the derived data for as long as the processing window the batch layer needs to derive it. One of the challenges is managing consistency for the users when they need to query data that contains both the data persisted permanently in the batch layer in conjunction with data that is temporarily stored in the speed layer. When we met the Lambda architecture against the Amazon big data services, we see that we can actually use one of the components in Amazon EMR to deliver it. A typical solution will be to use map reduce with EMR used for the batch layer and parlor for the serving layer and Apache Storm for the speed layer. Or, we can use Amazon Lambda as part of the speed layer. Let's have a look at an example of what a Lambda architecture in Amazon using Amazon Lambda could look like. Here is an example of an architecture from an Amazon customer smart news that combines both the Lambda architecture approach of a batch and a speed layer together while also using Amazon Lambda as part of the speed layer solution. As well as many of the other Amazon big data services we have already discussed, or will discuss shortly. In fact, this is a typical, for using big data within Amazon, it is a case of coupling together multiple services to achieve your specific goal. In this case, smart news are using Amazon Lambda to stream alerts based on log data. Let's have a look at how the Amazon Lambda service itself is architected. AWS Lambda is a serverless computer service that runs your code and responds to events and automatically manages the underlying compute resources for you. As such, Amazon Lambda is based on a platform as a service star architecture where you determine the size of the capacity you're required to execute you're code and the architecture and the components are automatically provided. You have no need or ability to change the way these architectural components are deployed and you pay only for the compute time you consume. There is no charge when your code is not running. AWS Lambda runs your code on a high availability compute infrastructure and performs all of the administration of your computer resources, including server and operating system maintenance. Capacity provisioning and automatic scaling, as well as code monitoring and logging. All you need to do is supply your code in one of the languages that AWS Lambda supports currently no .js, java, mpls, and Amazon Lambda does the rest. When using Amazon Lambda, you are responsible only for your code. AWS Lambda manages the compute fleet that offers a balance of memory, CPU, network, and other resources. This is an exchange for flexibility, which means you cannot log into the compute instances or customize the operating systems or the language run times. Amazon Lambda has built-in fault tolerance that maintains compute capacity across multiple availability zones in each region to help protect your code against individual machine or data center facility failures. There are no maintenance windows or scheduled down times. AWS Lambda invokes your code only when needed and automatically scales to support the rate of incoming requests without requiring you to configure anything. An Amazon Lambda, Lambda functions and event sources are the core components and the Amazon Lambda architecture. An event source is the entity that publishes events and a Lambda function is the custom code that processes the event. The code you run on Amazon Lambda is called a Lambda function. After you create your Lambda function, it is always ready to run as soon as it is triggered, similar to a formula in a spreadsheet. An interesting fact is that Amazon Lambda actually stores your Lambda function code and Amazon is three and encrypts it at risk. Your Lambda functions can be triggered by many event sources. For example, htp requests with changes to data in an Amazon S3 bucket or the insert of data into an Amazon dynamo DB table. When you provision a Lambda function within your Amazon account, in the background, Amazon Lambda creates the ability to execute this code within a container. When it's executing your code, Amazon code executes the function code in this container, which isolates it from the other functions and provides the defined resources, such as memory specified in the Lambda's functions configuration. You have no control or visibility over this container. It is all handled in the background by Amazon Lambda. It is also important to note that this container is not persistent, so you need to ensure your function code is written to be managed in a stateless way. Before we go into each of the options in detail, let's have a quick look at how AWS makes things easier for you. One of the great things about AWS is that they always try to make things easy for you, so when you go to create a new Amazon Lambda in the console, there are a number of premade blueprints that will help you with scenarios where a Lambda is commonly used. Blueprints provide example code to do some minimal processing. Most blueprints process events from specific event sources such as Amazon S3, DynamoDB, or custom applications. If you select the blank function, the Amazon Lambda wizard will take you through the steps you need to do to get your code executing. The first step is to decide what will trigger the execution of your code. If we choose S3 as the course of the trigger event, then we'll be asked for data of the S3 bucket, which will be monitored for the trigger and the S3 event that will cause the trigger. In this case, an object being removed from the S3 bucket. Next, we need to define the code that will become the Lambda function and identify the run time engine the code is based upon. For example, node.js or picin. You also need to define the amount of resources you wish this Lambda function to construe. As Amazon Lambda is a services service, you define the maximum amount of memory the function can consume and the Amazon Lambda service takes care of everything else. AWS Lambda allocates CPU power proportionally to the memory by using the same ratio as a general purpose Amazon EC2 instance type, such as an M3 type. For example, if you allocate 256 megabytes of memory, your Lambda function will receive twice the CPU shear than if you allocated only 128 megabytes. The other key configuration on the screen is the timeout. You pay for the Amazon resources that are used to run your Lambda function. To prevent your Lambda function from running indefinitely, you specify a timeout and when the specified timeout is reached, Amazon Lambda will automatically terminate your Lambda function.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.