The course is part of this learning path
In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.
We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.
As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:
- As we are talking about technology and computing services, general IT knowledge is necessary, that is, the basics of programming logic, algorithms, and learning or working experience in the IT field.
- We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
- It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.
The following two courses from our portfolio can help you better understand the basics of AWS if you are just starting out:
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Welcome to the AWS Analytics Fundamentals course. In this video, we will cover fundamentals concepts needed for a better understanding of the AWS Analytics architecture and services. In the end of this video, you'll be able to identify different data types and categories, classify big data problems, and select the best AWS services for a problem resolution.
Basically, analytics or data analytics is the science of data transformation, transforming data or raw data into meaningful information and insights. Here we refer to data as any input you have like a spreadsheet, a CSV file, historic sales information, a database with tables about your college grades, and all [inaudible 00:00:42] structured database, raw research data, text file, book search and so on. With this basic concept in mind let's explore a bit further the analytics concept.
Everything starts with the questions to the problems you have. We'll talk a lot during this course about making the right questions to the problem we want to solve so we can choose the right tools and collect or clean the data in a properly fashion. As an input for our analytics, we have the data which can be organized into different categories and times. For example, we have qualitative and quantitative data, and classification in the manner of structured and unstructured data. Don't panic. Now these concepts might appear a little bit strange, but we'll explain them a minute.
Then we have the analytics process itself which takes this data as input, enters statistical and mathematical model, classifies, extract correlations, and organizes the data in order to answer your questions. We will cover the analytic solutions from AWS in the next videos.
So let's talk a little bit about the input. We have two basic data types where the data is organized. The quantitative and qualitative. Quantitative as it says, refers to numbers, amount of certain value like number of citizens in a given geographical area. Qualitative data refers to attributes of the population not expressed in direct numbers like eye color, satisfaction levels and so on which qualifies the attribute. We have also three main classifications for the data format, structured, semi-structured, or unstructured data. We will see in detail each of these kinds.
Structured data as the name says refers to data with a defined data model like SQL databases, where I have tables and a fixed DB model and schema. On AWS for example, the RDS or Relational Database Services is a very complete example of a structured store.
In the semi-structured data, we have basically a flexible data model or tagging mechanism that allows us semantic organization and some kind of hierarchy discovery from the data without having the fixed and rigid rules from a SQL database for example. XML, JSON, and CSV files are good examples of semi-structured data. Non-SQL databases can also be structured, but usually, they are used in a flexible way to complement the limitations from S3 to SQL database. DynamoDB allows each record to have a different number of columns but gives fixed indexes for searching. This provides a very flexible schema, and we have the unstructured data where all kinds of text information without a data model is classified. Here we have all kind of documents without a proper data model or a schema, like books, natural language processing, and all sort of text processing.
Data generation in the last years is getting exponential. We generate data from the moment we wake up to the moment we go to bed, and even sleeping, sensors can be collecting data from our body and environment to improve a series of apps and services. And it's just the beginning of the internet of things era. We're generating too much data, much more than we can probably analyze.
And now, please, give a careful look to the graphic below and pay attention to the information in the dashed line. The growth in the data scientists and analyst field is much lower than the amount of data generated that needs a proper analyst. What a nice opportunity for us, right? So let's move forward and explore the possibilities.
To be classified as a big data problem, usually, a problem must pass the three Vs challenge. This is not a general rule, but it helps them define if our problems need to be solved by big data tools, or if we can do it with the traditional data parsing and programming.
The first is the volume which refers to the size of the data set or as we usually call it the data set size. And size matters in order to decide the right tool to analyze it. Usually, a big data problem will go into scale from gigabytes to petabytes of data.
We have also what we call the velocity of the data which means how quick you need to get answers for your data and is also related to the age of your data. Like for example, historical records from past years, or real-time alerting and information. This has great influence in the toolset used to analyze it as depending on the need for answers, real-time or a waiting time is acceptable for you. We can decide which tool and which technique is the best for the problem.
And the last V, variety refers to the data classification, the source of your data, if structured or not structured. As more often big data problems will have sources from several types, like BI platform data, blogs, CSV data, texts and any sort of structured or non-structured.
The evolution of analytics is not only temporal but also in complexity. The answer's power grows when we move from batch analytics to predictive analytics, but as always your problem dictates the best method. Batch analytics process, historical data to a job and leaves after a certain time the result data. It's commonly used when we're doing reporting or some kind of BI analysis. So we have years of data in your data warehouse or in logs, in spreadsheets and we want to get reporting or correlate this data to try to find interesting patterns like potential sales, potential profits we can get, or potential insights from research data.
This is quite different from real time analytics where we need to get answers now. You and your application cannot wait hours or days to get an answer because if you lose the time you can get serious consequences like for example, production facilities alerting, fast reaction over security alerts from IDS systems, or reaction to an ad campaign.
And we have also the last type which is the predictive analytics which takes as input historical data, learn from the history, and leave us predictions for future behaviors. This is a common case for machine learning like spam detection where based on past behavior we identify malicious messages, predicting and avoiding spam message's dissemination.
First of all, when you have a problem you usually define it as a question to start our quest into the analytics field. With our problem ready we need the source data, our starting point. This can be a data warehouse database, relational database tables, or a noSQL store, a CSW file, books, text files, in short, every readable format we can use as an input. It will depend for our problem. If our problem is to count words in a book from Shakespeare, that's okay. If our problem is then analyze the DNA to find patterns, that's also okay. So the type of our problem will dictate the data and also the processing algorithm.
With our input ready, we need to store it in an accessible place to the tools, then process it, analyze and show back the results. The separation using process and analyze is based on the fact that some analytics solutions from AWS will require a previous cleaning or pre-processing from the data for better results and accuracy. Take in mind a diagram from the previous slides, AWS has structured its portfolio around the collect, store, analyze, and visualize methodology providing for each step loosely covered but integrated services.
The first step in our analytics process is to collect the data that we'll use as input. The data collection is also called ingestion which is the act of acquiring data and storing it for later usage. In the data collection, we have different types of ingested data. We can have transnactional data which is represented by traditional relational databases, reads, and writes. We can have also file ingestion reading data from file sources such like logs, texts, CSV files, book contents and so on, and you can have also streamed data represented by any kind of streamed content like a click stream and events on a website, internet of things devices, and so on.
The toolset AWS currently offers can ingest data from many different sources. For example, with Kinesis Streams or Firehose we can work easily with streamed data on any source, even if they are on-premises. AWS Direct Connect on the other hand, which is a dedicated network line connects our data center to AWS cloud which delivers reliable and consistent high-speed performance. Snowball is also a new service from AWS. We can have also Amazon Simple Queue Service. We can query our input data for further processing and analytics.
As we are focusing on the analytic services portfolio, we are just talking about Kinesis from all these other ingestion tools. After the data is generated or acquired, we need to store it in an accessible place for AWS. This is usually called a data lake. The big pool where your services go to get the source and to deliver back the results. S3 is one of the core storage service from AWS. As a highly durable object store integrated seamlessly with all other AWS analytic services for data loading and store. We'll make use from S3 along our next videos when demonstrating some features from the services. You can have also data on RDS if the data has a structured format, or on Redshift if it's there for data warehousing BI. If it has no fixed data model, but a basic structure, we can use DynamoDB, the noSQL solution from AWS to store it. And if your data has very infrequent access we can use Glacier, the archive service from AWS.
Remember what we have told you before, the right service or tool depends on the type of problem you have and the velocity of your replies. If you can wait for a while or if you need real-time answers to the problems, and the last option, if you want to predict future behaviors. With this said, we now make a quick tour to cover the AWS tools for process and analyze.
If your goal is to provide reports based on batch processing, historical data, or just identify patterns on large data sets without the need of real time answers, you can take benefit of batch analysis through EMR, the Amazon MapReduce Service based on the proven Hadoop framework.
If you need real-time replies for questions or the results must be displayed on live dashboards, then you might take benefit of stream-based processing with Amazon Kinesis, AWS Lambda, or Elasticsearch services. Kinesis provides streams to load, store, and easily consume live stream data and ASW Lambda can react to these streams events, executing functions you define. Elasticsearch on its end can index and give you insights about the data, allowing you to query on its flexible query language, or display on integrated Kibana visualization platform.
For predictive analytics where you need to forecast an event based on historical cases, you might take benefit of Amazon Machine Learning to build highly available predicting applications. And how about Data Pipeline? Data Pipeline we can use to orchestrate all these services. As a framework for data-driven workflows, Data Pipeline can be used to automate your database loads. And for the visualization, to get a nice overview and nice dashboard from your replies, you can make use of third party softwares, export results to S3, back to Redshift, or to any relational database.
Also in the visualization spectrum, AWS launched it's new BI tool, Amazon QuickSight, which allows you to create rich visualization from your data, for example, from the previous analytics tab. This services your demonstration as we talk. Elasticsearch also provides a visualization tool Kibana, which is integrated with it to allow highly flexible and graphic rich visualizations. Or you can build your own visualization graphs based on the data on an S3 bucket for example.
So we're going to talk in the next videos as we said in the beginning, this is just an overview for some concepts and an overview from the AWS portfolio. In the next video, we're going to go deeper into each of these analytic services explaining to you how they can be consumed. Thank you. See you in the next video.
About the Author
Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.