In this course for the Big Data Specialty Certification, we learn how to identify the appropriate data processing technologies needed for big data scenarios. We explore how to design and architect a data processing solution, and explore and define the operational characteristics of big data processing.
- Recognize and explain how to identify the appropriate data processing technologies needed for big data scenarios.
- Recognize and explain how to design and architect a data processing solution.
This course is intended for students wanting to extend their knowledge of the data processing options available in AWS.
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud computing services. If you would like to gain a solid foundation in compute fundamentals, then check out our Compute Fundamentals For AWS course.
This Course Includes
75 minutes of high-definition video.
What You'll Learn
- Course Intro: What to expect from this course
- Amazon Elastic MapReduce Overview: In this lesson, we discuss how EMR allows you to store and process data
- Amazon Elastic MapReduce Architecture: In this lesson, you’ll learn about EMR’s clustered architecture.
- Amazon Elastic MapReduce in Detail: In this lesson, we’ll dig deeper into EMR storage options, resource management, and processing options.
- Amazon Elastic MapReduce Reference Architecture: Best practices for using EMR.
- Amazon Lambda Introduction: This lesson will kick off our discussion of Lambda and how it’s used in Big Data scenarios.
- Amazon Lambda Overview: This lesson discusses how Lambda allows you to run code for virtually any type of application or backend service with no administration.
- AWS Lambda Architecture: In this lesson, we’ll discuss generic Lambda architecture and Amazon’s serverless service.
- AWS Lambda in Detail: In this lesson, we’ll dig into Events and Service Limits.
- AWS Lambda Reference Architecture: In this lesson, we'll look at a real-life scenario of how lambda can be used.
So, let's have a look at the Amazon EMR options in a little more detail. This outline details Amazon EMR has three storage options, HDFS, EMRFS, and Local File System. The Hadoop Distributed File System or HDFS is a distributed, scalable file system for Hadoop.
HDFS distributes the data that creates stores it across the nodes within the EMR cluster storing multiple copies of data on different nodes to insure that no data is lost if an individual node fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster. So when you terminate your cluster, you lose this data. HDFS is very useful for caching intermediate results during processing or for workflows that have significant random failure. Using the EMR file system, or EMRFS, Amazon EMR extends Hadoop to add the ability to directly access data stored on Amazon ES3 as if it were a file system like HDFS. When you use EMRFS, an emulator store is transparently built in dynamo DB to help manage the interruptions with Amazon S3 and allows you to have multiple EMR clusters easily use the same EMRFS metadata and storage on Amazon S3.
The local file system refers to our locally connected disc. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance and that comes with a pre-configured block of pre-attached to storage code and instance store. Data on an instance store volumes exist only during the lifecycle of its Amazon EC2 instance. With the local file system, you lose the data clustering benefits of HDFS storage and you also lose the persistent storage that you gain if you use EMRFs.
Amazon do not recommend using the local file system option. Note that previously Amazon EMR used ES3 native file system with the URI scheme is 3E. While the store works, Amazon recommends that you use the ES3 URL scheme for this performance, security and reliability. Amazon recommend you use either HDFS or the EMR file system, HMEMRFS, which is effectively Amazon S3 as the file system in your cluster. An advantage of HDFS is the data awareness it delivers between the Hadoop node managing the cluster.
Another advantage is that it's very fast. A disadvantage is that it's ephemeral storage which is reclaimed when the cluster ends. So once you terminate the EMR cluster, then the data disappears along with the cluster's EC2 instance. HDFS is best used for caching results produced by intermediate job flow slips. EMRFS provides the convenience of storing persistent data of Amazon S3 for use for Hadoop while also providing features like Amazon S3 server side encryption, read after write consistency and list consistency.
Most often, Amazon S3 is used to store and input/output data and the intermediate results are stored in HDFS. The resources management layer is responsible for managing cluster resources and scheduling the jobs for processing data.
Now part of the core Hadoop project, YARN stands for Yet Another Resources Negotiator and is the architectural center of Hadoop that allows multiple data processing engines to execute their code. By default, Amazon EMR uses YARN to centrally manage cluster resources from multi data processing frameworks. Amazon EMR also has an agent on each node which administers YARN components, keeping the clusters healthy and communications with the Amazon EMR service.
The Apache Hadoop MapReduce provides three core components; one, the end-user MapReduce API for programming the desired MapReduce duplications, two, the MapReduce framework which is the runtime implementation of various phases such as the met phase, the sort, shuffle, merge, aggregation and reduces; and three, the MapReduce system which is the back end of the structure required to run the user's MapReduce duplication, manage the cluster resources and schedule thousands of concurrent jobs.
In the initial releases of Hadoop, the MapReduce component handled the resources management requirements using the job tracker and parse tracker effectively creating a typical master and slave architecture. The job tracker was responsible for resource management, managing the work nodes called task trackers. Task trackers track resource consumption and availability as well as also job life cycle management such as scheduling individual tasks of the job, tracking progress and providing fault tolerance for tasks. The task tracker had simple responsibilities, launching or tearing down tasks on order from the job tracker and to provide task status information to the job tracker periodically.
This distributed implementation or resource management within MapReduce is in the original version of Hadoop version one and is sometimes referred to as MRV1. In Hadoop version two, often referred to as MRV2, the resource management responsibility was moved to YARN. The fundamental idea of YARN is to spin out the two major responsibilities of the job tracker, I.e. resource management and the job scheduling monitoring into separate domains, a global resource manager and a per application master or onion. The resource manager and peer node slave, the node manager, or form the new engineering system for managing applications in a distributive manner.
The simplest way to understand the difference between MRV one and MRV2 is that the architecture went from being a single use system that only handled bench jobs to a multiple purpose and multiple tendency system that could not only run the batch jobs that Hadoop I did but also more interactive, online or streaming jobs as well. YARN keeps track of all the resources across the cluster and it insures these resources are dynamically allocated to accomplish the task in your processing job. YARN is able to manage Hadoop as well as upper distributive frameworks such as Apache Spark.
However, there are other frameworks and applications that offered in Amazon EMR that do not use YARN as a resourced management. Amazon EMR also has an agent on each node which administers the YARN components keeping the clusters healthy and communication with the Amazon EMR service. As we saw on the Amazon EMR provisioning screen, there are four main processing engines that you can employ through EMR cluster, Hadoop MapReduce, Presto, Hbase and Apache Spark.
Hadoop MapReduce is an open source programing model for distributing computing that has been the defacto standard for big data processing for awhile. It is at the heart of the Hadoop ecosystem along with HDFS. MapReduce simplifies the process of writing parallel distributive applications by handling all the logic while HDSF handles the data storage.
MapReduce has been popularized by Google as useful for processing large data seeds in a highly parallel and scalable way. Hadoop MapReduce produces work loads using a framework that breaks down jobs into smaller pieces of work that can be distributed across your nodes in your Amazon EMR cluster. They are built with the expectation that any given machine in a cluster good file at any time and is designed for fault tolerance.
If a server running a task fails, Hadoop reruns the task on another machine until completion. MapReduce provides a processing technique and a programming model for distributive computing based on JAVA. The MapReduce algorithm contains two important tasks; mainly map and reduce, hence the name. The map function maps data sets of key value p's called intermediate results which can be seen to each of the EMR slave nodes for processing and enabling the processing to be distributed across the cluster and therefore return the result faster than processing on the single node.
The reduce function combines the intermediate results, applies additional algorithms and produces the final output. There's an analogy you can think of map and reduce tasks as the way a census was conducted in the Roman times. The census bureau would dispatch its people to each of the cities in the empire. Each census taker in each city would be tasked to count the number of people in that city and in returning the results to the capital city. The end results from each city would be reduced to a single count, a sum of the cities, to determine the overall population in the empire. This mapping of people to cities in parallel and then combining the results, effectively reducing them, is much more efficient than sending a single person to count every person in the serial mission. TEZ is an extensible framework for building high performance batch and interactive data processing applications.
It is designed to improve the performance of MapReduce. TEZ improves the MapReduce paradigm by improving its speed while maintaining MapReduce's ability to scale to terabytes of data. YARN manages the resources in Hadoop cluster based on the cluster's capacity load. The TEZ execution engine shortly acquires resources from YARN and reuses every component in the pipeline such that no operation is duplicated unnecessarily.
Importantly, Hadoop ecosystem project like Hide and Peck use TEZ as do a growing number of third-party data access applications developed for the broader Hadoop ecosystem. Hbase is a columnar orientated data management system that runs on top of Hdatabase. It is well suited for sparse data sets which are common in many big data use cases. Unlike relational databases, Hbase does not support a structured query language like Sickle. In fact, Hbase isn't a relational database store at all.
An Hbase system comprises a set of tables. Each table contains rows and columns much like a traditional database. Each table must have an element defined as a primary key and allto Hbase tables must use this primary key. An Hbase column represents an attribute of an object. For example, the table storing diagnostic logs from servers in your environment where each room might be a log report. A typical column is such a table would be the timestamp of when the log report was written or perhaps the server name where the record originated. Hbase works as seamlessly with Hadoop, sharing its file system and serving as a direct input and output to the MapReduce framework and execution engine. Hbase also integrates with an Apache hive enabling cycle like queries over Hbase tables joined within hive based tables and support for JDVC connectivity.
Presto is a distributed SQL query engine optimized for ad hoc analysis. It supports anti-sickle standards including complex queries, aggregations, joins and window functions. Presto can process data from multiple data sources including HTFS and Amazon S3. The Presto processing framework is fundamentally different from that of MapReduce. Presto has a custom query and execution engine where the stages of execution are pipelined and all processing occurs in memory to reduce disk pile.
This pipeline execution model can run multiple stages in parallel and streams data from one stage to another as the data becomes available. This reduces end-to-end latency and makes Presto a great tool for ad hoc data expiration like a large data sets. While MapReduce is very effective for summarizing, querying and summarizing large sets of data, the computations Hadoop enables on MapReduce are relatively slow and limited which is where Spark comes in.
Developed at UC Berkeley's A&P lab in 2009 and open sourced in 2010, Apache Spark is a powerful Hadoop data processing engine designed to execute both batch and streaming workloads faster than MapReduce. Apache Spark utilizes memory cation and optimization execution to deliver this improved performance and it can be used for general batch processing, streaming and logics, machine learning, craft databases and in help queries. Spark also supports multiple interactive query modules such as Spark SQL. Apache Spark or Hadoop YARN is natively supported in Amazon EMR and when you run Spark on Amazon EMR, you can use the EMRFS data stores to directly access your Amazon S3 data.
As well as different options to store and process data, Amazon EMR also supports a variety of other popular application tools in the Hadoop ecosystem to query and access the data. These access components allow you to submit code to your Amazon EMR cluster by a range of different code, languages and options. Apache Hive allows you to query Hadoop in your Amazon EMR cluster using SQL. Hive provides a JVVC connection which means you can access Hadoop by Hive using your standard SQL based tools.
Apache Pig provides a coding language called Pig Latin which allows you to create a sequence of code slips that can be executed on your Amazon EMR cluster to process and transform the data. Once you've created your Pig Latin code, it is compiled to met produce jobs and then execute them on the cluster. One of the key features is that this code is being executed in a distributive manner across all the EMR modes. Spark SQL, Apache Spark, provides a component on top of its core called Spark SQL which provides access to both structured and semi-structured data and allows access to EMR by ODBBC or JDVC.
Spark also provides you the ability to write applications quickly in Java, Scaler, Python or R and then execute via Spark and allow you to process and access the EMR data. Impala provides a directive and help querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing or MPP engine similar to that found in professional RDMVSs. With Impala, you can query your data in HDFS or Hbase tables leveraging Amazon's EMR ability to process diverse data types and provide schemer. This typically enables you to use Impala to undertake interactive load latency Impala also offers access to Amazon EMR data via standard SQL queries and ODBC and JDVC connections.
Apache Flink is a dataflow engine that is targeted at leveraging your Amazon EMR cluster to transform data that is strict. Hue which stands for Hadoop User Experience is browser-based graphical user interface which can be used with Amazon EMR. Hue provides quey access to a number of the Hadoop ecosystem components, for example, it allows you to run interactive queries by Mahout is a machine learning library with tools for clustering, classification and several types of recommenders including tools to calculate to calculate my similar items or build recommendations for users. Mahout employs the Hadoop framework to distribute calculations across a cluster and now includes additional work distributing methods including Spark. Sqoop is a tool designed for officially transferring bulk data between Apache and Hadoop and structured data sources such as relational databases. Hcatalog is a tool that allows you to access high, mid and low score tables within Pig, Spark SQL or via custom MapReduce applications.
A number of the Hadoop ecosystem applications you can install on your Amazon EMR cluster will publish user interfaces as websites posted on the master load. For security reasons, these websites are only available on Master Node's local web server and are not publicly available over the internet. The Hadoop applications can also publish user interfaces as websites hosted on the core and the task slave nodes. These websites are also only available on local web servers on the nodes.
As these web interfaces are isolated to the servers rather than being publicly accessible, you will typically need to implement SSH tunneling via aconnection to your local client to allow access to these interfaces via a standard web browser. You can create your Amazon EMR cluster as either a long running cluster or a transient cluster. Think of a long running cluster as a typical database server that sits there and it waits for jobs and when it receives a job, it processes it and then waits for the next job.
With a transient cluster, once the processing you've requested that cluster to run is completed, the cluster will automatically terminate itself. You define a long running cluster by disabling the auto-terminate function or by using the keep alive parameter by the API. Typically, you will have Hadoop applications such as Hive or Pig installed on the long running clusters to submit jobs either scripted or interactively.
There are two ways to process data in your Amazon EMR cluster. One is by submitting jobs or queries directly to the applications that are stored on your cluster. The other is by running steps in the cluster. For long running clusters, you submit jobs by interacting directly with the software that is installed on your Amazon EMR cluster, for example, Hive.
To do this, you typically connect with master node by a secure connection and access the interfacers and tools that are available for the software that runs directly on your software. For transient clusters, you can submit one or more ordered steps. Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.
An example of a four step process would be first submitting a input data set for processing, second processing the submitted data using a Pig program, third, processing a second input data that's it by using a Hive program and lastly, writing out the results in data sets to Amazon S3. The steps are submitted with a specific sequence defined and each step waits for the previous step to change to a state of completed before it will run.
It is important to remember that with a transient cluster, once the cluster is terminated and all data stored in the cluster itself is deleted information stored in other locations such as your Amazon S3 buckets of course persist. It is important to understand the lifecycle Amazon EMR takes in provisioning a cluster. First Amazon EMR positions a cluster within chosen applications such as Hadoop or Spark.
During this phase, the cluster is started. Next, any user defined actions called bootstrap actions such as installing additional applications run on the cluster. During this phase the cluster state is bootstrapping. After all the boostrapping actions are successfully completed, the cluster state is run. The cluster sequentially runs all steps during this phase. After the steps run successfully, the cluster either goes into a waiting state or a shutting down state dependent on whether you've defined it as a long running or a transient cluster. There are a number of use cases where Amazon EMR is the perfect processing solution and a number where an alternate Amazon solution would potentially provide a better solution.
If you are dealing with large volumes of data or data that is primarily semi-structured, then Amazon EMR makes a good choice for your big data service. If you have traditionally structured data where you need the data to be accessed by traditional Bi tools and the data volumes are not too large, then Amazon Redshift is probably a more suitable data service. Typically, you would combine the two and have the large volume of dating going to Amazon EMR and a subset of the data flowing to Amazon Redshift.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.