In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.
We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.
As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:
- As we are talking about technology and computing services, general IT knowledge is necessary, that is, the basics of programming logic, algorithms, and learning or working experience in the IT field.
- We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
- It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.
The following two courses from our portfolio can help you better understand the basics of AWS if you are just starting out:
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Welcome to the AWS Analytics Fundamentals course. In this video, we are going to cover the Amazon ElasticSearch Service. In the end of this video, you'll be able to understand the basics from ElasticSearch, and help set up a simple cluster.
Amazon ElasticSearch is one of the newest services from AWS, launched in October 2015. It's a managed service that makes easy to deploy, operate, and scale ElasticSearch cluster in the AWS Cloud. But what exactly is ElasticSearch? ElasticSearch is a very popular open-source service for search analytics. A common use case for ElasticSearch are log analytics, log forensics, secured log analytics, and events in sync, real-time applications monitoring, and mission critical applications alerting, and click-stream processing.
Now we are going to see a little bit more about ElasticSearch. ElasticSearch was not developed from scratch by AWS. It is not an AWS product. What AWS has done was the automation from the cluster configuration and infrastructure tasks for you. This is similar from what AWS has done with the Hadoop framework, encapsulating it onto Elastic MapReduce Service. So ElasticSearch was not created by AWS, but as it was a highly used service and several customers asked for an easier way to deploy it, AWS listened to these demands and provided the automated and fully orchestrated ElasticSearch service for you.
ElasticSearch usually is combined with its brothers, the Logstash and Kibana, providing a full stack for analytics, as we have Logstash acting as a data ingesting mechanism, ElasticSearch for analytics and index and searching, and Kibana for visualization. This is commonly called ELK stack. Among several use cases, ElasticSearch help us to solve full text search, intrusion detection, and batch data analytics problems. And ElasticSearch works as clusters for distributed analytics. As we have already seen, the ElasticSearch was launched on AWS to reduce the complexity from setup and operation tasks on infrastructure, usually one of the most time consuming tasks when talking about analytics. This allows data scientists and IT engineers to focus on the real problems, not the infrastructure itself. ElasticSearch also integrates well with IAM, CloudWatch for detail and monitoring, and CloudTrail for auditing. These are native AWS services.
We can also scale Amazon ElasticSearch clusters by adding new nodes and storage on the fly. The main features from Amazon ElasticSearch are its simple cluster and infrastructure deployment and configuration, the support for the ELK stack. As I said, ElasticSearch, Logstash, and Kibana, keeping support with Logstash, and being fully integrated with Kibana. Improved security with integration with IAM, CloudWatch and CloudTrail, and its integration with AWS services, such as S3, Kinesis streams, and DynamoDB.
The image here shows an ElasticSearch domain. And ElasticSearch domain encapsulates the Amazon ElasticSearch engine. The instance is that process Amazon ElasticSearch requests, indexes the data that you want to search, snapshots of the domain, index the data, acts as policies and to metadata. You can create an Amazon ElasticSearch domain by using the Amazon console or to AWS SDK. So we have basically the major log internalization for a domain. We have the instances, which are integrated with CloudWatch to present monitoring metrics, and CloudTrail to record all API calls and actions performed to the domain for auditing.
In front of it, we have an Elastic Load Balancing to balance the load in the instances and also to provide an endpoint for access as well. In from of the ELB we have IAM for access control, so we can control in a very granular way the access to the cluster, the API calls, indexing access, and so on, assuring compliance and security. Optionally, we can configure Route 53 as a DNS service to provide a customized hostname for the cluster.
Now let's go to the practical part, where I show you the creation process for an ElasticSearch domain and the cluster itself, and provide some more detailed information. Okay, first off all we're going to open the AWS console. Okay, now we are in the console. We are going to go to analytics areas and find ElasticSearch service. Click on it and we go to the Getting Started page. If this is your first time, this is the page you will get. Let's click here and we are directed to a create ElasticSearch domain.
First thing, we have to type the domain name. The domain name has some naming restrictions. Let's type anything here. Go next. It will probably throw, yes, an error. Why? Because we are using a first letter in capital letter. This is not allowed, as says here, the domain must start with a lowercase alphabet, and be at least 3 and no more than 28 characters long. Valid characters are A to Z lowercase only, 0 to 9, and hyphen. So we will create a cloudacademy-test. This should be allowed.
Now we are going to configure cluster area. Here we can configure all the settings from our cluster. We can select from a set of EC2 instances from different sites. As you can see, they are especially developed for ElasticSearch. We are going to get the small three nodes cluster. We can have up to 10 instances per cluster. This is another important hint. The dedicated master. The master is responsible for the metadata and to control the stability for the cluster, and as you commented, we have at least three dedicated masters for each prediction domain. Okay, enabled indicated master was set to two. Enable zone awareness will place your instances on different availability zones.
The start configuration here you can set after instance (default). In fact, we selected as small, as small has no instance store. Let's change it to the medium so we have an instance store. These also affect the cost, so in some cases, it's better to get a larger instance that you have an instance stored than to have to use the EBS storage, which will inflict on additional costs. And a snapshot, we can select the snapshot when you want to get an automated snapshot for our cluster. Let's keep the default.
The access policy. For our purpose, we'll allow access from everywhere, allow open access to the domain. Okay, this is not recommended for a prediction cluster, as is will be open to the world. And he we have the review page. We can change the settings if you want now or, if everything is okay, we can confirm and create. The domain status is loading, and here we are in the dashboard, where we can see we have our only domain here in the loading state. After a while, our cluster is ready for indexing and searching. As you can see the domain status changed to active and we received two endpoints, one for the ElasticSearch itself, and the other one for the visualization Kibana platform.
Let's explore a bit here what we have. This first tab we have the cluster health, the number of nodes is five. As you remember we have three data nodes and two master nodes to control the stability. We are not entering the sharding information, as this is an advanced topic. We have the indices. I have created a new index column movies with two documents into it and the mappings contains these [inaudible 0006:59] and so default Kibana index also. And for monitoring here, we have integrated with CloudWatch. The cluster overall status, the nodes count, space available, CPU utilization, memory, and so on. So you can have, at a glance, an overview on an entire cluster now.
Now let's play a little bit with the ElasticSearch itself. Remember in the preview when I created our cluster, we have kept it open to the world, so everybody could make queries or searches to our cluster. Here in this console screen, you can see that I have made some Corel inputs so using Corel I am sending data to our ElasticSearch cluster, as it's open to the world, it does not require any kind of authentication. As I said before, this is highly not recommended for prediction loads, but for testing it's okay.
What have I done here? I have inserted to JSON documents. These JSON documents contain movies information. Here are the two requests. So if I click on the endpoint from our ElasticSearch cluster, I just received some information from the cluster itself. And remember we have the Kibana visualization platform. Here is where I see the results and export the results from the ElasticSearch indexing. For example, here I have already an index spotter for movies, and I'll just show you a very simple visualization. I choose a metric count for the movies, and remember, we have two movies, so the count is two. We will not explore here deeper the Kibana visualization, as this will require a longer explanation about ElasticSearch indexing and then the search for Kibana, but this topic will be covered in the future CodeAcademy course.
Last but not least, how we delete the cluster. We delete here our domain, so we will not incur further charges for this cluster. If we have snapshots manually created, these snapshots should be deleted manually.
Okay, that was the focus for this session. I hope you have learned a bit about the ElasticSearch service. Thank you for watching, and see you in the next video.
Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.