In this course for the Big Data Specialty Certification, we learn how to identify the appropriate data processing technologies needed for big data scenarios. We explore how to design and architect a data processing solution, and explore and define the operational characteristics of big data processing.
Learning objectives
- Recognize and explain how to identify the appropriate data processing technologies needed for big data scenarios.
- Recognize and explain how to design and architect a data processing solution.
Intended audience
This course is intended for students wanting to extend their knowledge of the data processing options available in AWS.
Prerequisites
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud computing services. If you would like to gain a solid foundation in compute fundamentals, then check out our Compute Fundamentals For AWS course.
This Course Includes
75 minutes of high-definition video.
What You'll Learn
- Course Intro: What to expect from this course
- Amazon Elastic MapReduce Overview: In this lesson, we discuss how EMR allows you to store and process data
- Amazon Elastic MapReduce Architecture: In this lesson, you’ll learn about EMR’s clustered architecture.
- Amazon Elastic MapReduce in Detail: In this lesson, we’ll dig deeper into EMR storage options, resource management, and processing options.
- Amazon Elastic MapReduce Reference Architecture: Best practices for using EMR.
- Amazon Lambda Introduction: This lesson will kick off our discussion of Lambda and how it’s used in Big Data scenarios.
- Amazon Lambda Overview: This lesson discusses how Lambda allows you to run code for virtually any type of application or backend service with no administration.
- AWS Lambda Architecture: In this lesson, we’ll discuss generic Lambda architecture and Amazon’s serverless service.
- AWS Lambda in Detail: In this lesson, we’ll dig into Events and Service Limits.
- AWS Lambda Reference Architecture: In this lesson, we'll look at a real-life scenario of how lambda can be used.
Let's have a look at some of the Amazon Lambda Core Concepts in more detail. The code you run on Amazon Lambda is uploaded as a Lambda Function. Each function has associated configuration information such as its name, description, entry point and resource requirements.
To create a Lambda Function, you first package your code and dependencies in a deployment package.Then you upload the deployment package to Amazon Lambda to create your Lambda Function. When you create Lambda Functions using the console, the console creates the deployment package for you and then uploads it to create your Lambda Function. You can author your Lambda Function code in any of the languages that are supported by Amazon Lambda.
At this time, this includes node.js, Java, CShark and Python. The function code must be written in a stateless style which means it should assume there is no affinity to the underlying compute infrastructure. Objects such as local file system access, chart processes and similar artifacts may not extend beyond the lifetime of the request and any persistent state should be stored in Amazon S3, Amazon DynamoDB or another persistent and accessible storage service. When executing your code, Amazon Lambda launches a container and executes the function code in that container.
The container isolates the executing function from other functions and provides the resources such as memory specified in the function's configuration. Think of it as a non-persistent EC2 or a docker container. The first time a function executes after being created or having it's code or resource configuration updated a new container with the appropriate resources will be created to execute it, and the code for the function will be loaded into the container.
The second time your function gets executed, Lambda may reuse the current container or it may create a new container all over again. If you have changed your code Amazon Lambda will definitely create a new container. If you haven't changed the code and not too much time has gone by, Amazon Lambda may reuse the previous container. It takes time to set up a container and do the necessary bootstrapping which adds some latency each time the Lambda Function is invoked.
We typically see this latency when a Lambda Function is invoked for the first time or after it's been updated. Amazon Lambda tries to use the container for subsequent invocations of the Lambda Function to reduce this latency. After a Lambda Function is executed, Amazon Lambda maintains the container for some time in anticipation of another Lambda Function invocation. In effect, the service freezes the container after Lambda Function completes and forwards the container for reuse if Amazon Lambda chooses to reuse the container when the Lambda Function is invoked again.
The source it provides some potential performance benefits for your functions. For example, if your Lambda Function establishes a database connection instead of reestablishing a connection, the original connection is used in subsequent invocations. You can add logic in your code to check the connection already exists before creating a new one. However, when you write your Lambda Function code do not assume the Amazon Lambda will always reuse the container because Amazon Lambda may choose not to. Depending on various other factors, Amazon Lambda may simply create a new container instead of reusing an existing container. In Amazon Lambda, Lambda Functions and Event Sources are the core components in the architecture.
An Event Source is the entity that publishes events and a Lambda Function is the custom code that processes the event. There are a number of supported Amazon Event Sources that are preconfigured to work with AWS Lambda. The configuration is referred to as an Event Source mapping which maps an Event Source to a Lambda Function. It enables automatic invocation of your Lambda Function when the events occur.
Each of Event Source mapping identifies the type of events to publish and the Lambda Function to invoke when the event occurs. The specific Lambda Function then receives the event information as a parameter. Your Lambda Function code can then process the data generated by the event. Event Sources can either be another Amazon big data service, for example, S3 or DynamoDB or a custom application. Within the Amazon big data services there are two ways your Lambda Function can be invoked.
One is by an event trigger which pushes the event data to Lambda and invokes the function, and the other is by a stream where Amazon Lambda pulls a stream, pulling the event data and invoking the function. The Amazon service you mapped to the Lambda Function determines if a push or pull paradigm is used. For the stream based services Amazon Lambda maintains the Event Source mapping and invokes the function when it detects the relevant event. For non-stream based services, the Event Source invokes the Lambda Function when the Event Source detects the event.
In addition to invoking Lambda Functions using Event Sources you can also invoke your Lambda Function on demand. You don't need to preconfigure any of it's source mapping in this case. However, make sure that the custom application has the necessary permissions to invoke your Lambda Function. Amazon Lambda supports synchronous and asynchronous invocation of a Lambda Function. The invocation type that these Event Sources use when invoking a Lambda Function is preconfigured. For example, Amazon S3 always invokes a Lambda Function asynchronously, and Amazon Cognito invokes a Lambda Function synchronously.
For stream based Amazon big data services such as Amazon Kinesis Streams and Amazon DynamoDB Streams, Amazon Lambda pulls the stream and invokes your Lambda Function synchronously. You can also manually invoke a Lambda Function. For example, using the Amazon command line interface for testing purposes. You can control the invocation type when you manually invoke a Lambda Function.
By default a failed Lambda Function invokes asynchronously retried twice, and then the event is discarded. Using dead letter queues you can indicate to Lambda that unprocessed events should be synced to an Amazon SQS queue or an Amazon SNS topic instead where you can take further action. There are a large number of Amazon big data services you can use to provide an event trigger that will instruct your Lambda Function to execute.
After configuring the Event Source mapping when you register your Lambda Function, your Lambda Function will automatically get invoked when these Event Sources detect the event. With Amazon S3 you can route Lambda Functions to process S3 bucket events such as object created or object deleted events. For example, when a user uploads a photo to a bucket, you might want Amazon S3 to invoke your Amazon S3 function so that it reads the image and creates a thumbnail for the photo.
You can use Amazon Lambda Functions as a trigger for your Amazon DynamoDB table. Triggers are custom actions you can take in response to updates made to the DynamoDB table. To create a trigger, first enable your Amazon DynamoDB Streams for your table, and then make Amazon Lambda pull the stream and your Lambda Function processes any dub updates published to that stream.
You can also configure Amazon Lambda to automatically pull your Amazon Kinesis stream and process any newer records such as social media feeds. Once configured Amazon Lambda will pull the stream periodically, in fact multiple times per second for new records that have arrived in that stream. You can write Lambda Function to process Amazon Simple Notification Service's notifications. When a message is published to a Amazon SNS topic the service can invoke your Lambda Function by passing the message payload as a parameter. Your Lambda Function code can then process the event.
For example, publish the message to another Amazon SNS topic or send the message to another Amazon big data service. When you use Amazon Simple Email Service to receive messages, you can configure Amazon SES to call your Lambda Function when the new message arrives. This service can then invoke your Lambda Function by passing in the incoming email event as a parameter.
The Amazon Cognito events feature enables you to run Lambda Functions in response to events in Amazon Cognito. For example, you can invoke a Lambda Function for the sent trigger events that is published each time a data set is synchronized. As part of deploying AWS CloudFormation stacks, you can specify a Lambda Function as a custom resource to execute any custom commands.
Associating a Lambda Function with a custom resource enables you to invoke your Lambda Function whenever you create, update, or delete AWS CloudFormation stacks. You can also use AWS Lambda Functions to perform custom analysis on Amazon CloudWatch Logs using CloudWatch Logs subscriptions. CloudWatch Logs subscriptions provide access to a real time feed of log events from your CloudWatch Logs and delivers it to your AWS Lambda Function for customers processing analysis or loading into other systems.
The Amazon CloudWatch Events help you to respond to stack changes in your AWS resources. When your resources change state, they automatically send events into the event stream. You can create rules that match the length of events in the stream and route them to your Amazon Lambda Function to take action. For example, you can automatically invoke an Amazon Lambda Function to log the state of an EC2 instance or an autoscaling group.
You can also create a trigger for AWS code connect repositories, so that events in the repository will invoke a Lambda Function. For example, you can invoke a Lambda Function when a branch or a tag is created or when a push is made to an existing branch. You can use Amazon Lambda Functions to evaluate whether your AWS resource configurations comply with your custom code rules.
As resources are created, deleted or changed, AWS Config records these changes and sends the information to your Lambda Functions. Your Lambda Functions can evaluate the change and reports results to AWS Config. You can then use AWS Config to assess the overall resource compliance. You can learn which resources are noncompliant and which configuration attributes are the cause of the noncompliance.
You can use Amazon Lambda Functions to build services that give new skills to Alexa, the voice assistant on Amazon Echo. The Alexa skill kits provides the API's tools and documentation to create these new skills powered by your own service running as a Lambda Function. Amazon Lex is AWS service for building conversational interfaces into applications using voice and text. Amazon Lex provides pre-built integration with Amazon Lambda.
Allowing you to create Lambda Functions for use as code hooks or if you are Amazon Lex bot. When you're on tent configuration you can identify your Lambda Function to perform initialization and validation fulfillment or both. You can also invoke an Amazon Lambda Function over https. You can do this by defining a custom risk API and endpoint using Amazon Gateway.
You met individual API operations such as get and put to the specific Lambda Function. When you send an https request to the API endpoint, the Amazon API Gateway service invokes the corresponding Lambda Function. In addition to invoking Lambda Functions using Event Sources, you can also invoke your Lambda Function on demand.
You don't need to preconfigure any Event Source mapping in this case. However, make sure that the custom application has the necessary permissions to invoke your Lambda Function. As with all Amazon big data services, there are a number of limits within the Amazon Lambda service you need to be aware of. As I've outlined you choose the amount of memory you want in Lambda Function to have when creating it. You can sit your memory in 64 megabyte increments from 128 megabytes to 1.5 gigabytes.
Each Lambda Function receives 512 megabytes of nonpersistent space in its own slash team directory. This cannot be extended, so you need to manage the temporary output requirements of your Lambda Functions carefully. All calls made to Amazon Lambda must complete within 300 seconds. The default time out is three seconds, but you can set the time out to any value between one and 300 seconds. If the Lambda Function does not complete its execution within this time frame, it will be terminated.
There are a number of limits within the Amazon Lambda service you need to be aware of. When these limits have been reached, attempts to add additional resources will fail with an exception. Amazon Lambda is designed to run many instances of your function in parallel. However, Amazon Lambda has a default safety throttle of 600 concurrent executions per account per region. Any synchronous invocation that causes your function's concurrent execution to exceed the safety limit is throttled. The invoking application receives the 429 error and the Amazon Lambda does not execute your function.
Lambda Function being invoked asynchronously can absorb reasonable bursts of traffic for approximately 15 to 30 minutes. After which incoming events will rejected and throttled. In case the Lambda Function is being invoked in response to S3 events, events rejected by Amazon Lambda may be retained and retried by S3 for 24 hours. Events from Kinesis Streams and DynamoDB Streams are retried until the Lambda Function succeeds or the data expires. Amazon Kinesis and Amazon DynamoDB Streams retain data for 24 hours. When deploying a Lambda Function there are also limits on what you deploy.
For example, your function code cannot be larger than 50 megabytes. Functions that exceed any of the limits listed will fail with an exceed limit exception. Apart from the concurrent executions limit, these limits are fixed and cannot be changed at this time. For example, if you receive an error message similar to code storage limit exceeded from Amazon Lambda, you need to reduce the size of your code storage.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.