The course is part of this learning path
In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.
We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.
As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:
- As we are talking about technology and computing services, general IT knowledge is necessary, that is, the basics of programming logic, algorithms, and learning or working experience in the IT field.
- We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
- It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.
The following two courses from our portfolio can help you better understand the basics of AWS if you are just starting out:
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
Welcome to the AWS Analytics Fundamentals course. In this video, we will cover the Amazon Machine Learning Service. By the end of this video, you will have seen an overview from Machine Learning Service, as well as a demo presentation showing a complete Machine Learning model.
Amazon Machine Learning allows us to build and train predictive applications and host scalable predictive models in a cloud-based environment. Remember our analytics concepts video. In that video, we have shown some very important concepts, including the analytics evolution where we have see three categories of analytical methods. The batch analytics usually used for reporting and identifying patterns on the historical data, the real-time analytics focusing on alerting, and the predictive analytics whose focus is in predicting future events based on past occurrences, what we call forecasting. Amazon Machine Learning, or Amazon ML, is encircled on this third category, the predictive analytics, but how Amazon ML does it? Amazon ML is based on historical data. This means you must have past observations from the problems so we can predict future similar cases. We are going to see later that there are some best practices to get the best from your model, like processing processing the input and decisions about the training set.
We will now provide some basic general Machine Learning concepts so you can be familiar with the logic behind ML. As we have seen before, Amazon ML algorithms require, as input, historical data from similar situations that you want to predict. This means that it requires valid inputs and outputs for past situations. For example, if you want to predict spam detection for a mailbox, you require previous messages tagged as spam so the algorithm will learn the patterns, common wording, and common subjects from the valid spam messages, being able to detect future spam messages correctly. In fact, spam detection is one of the most traditional machine learning problems, and most, if not all, spam detection techniques used worldwide rely on machine learning.
Each historical event is called an observation. An observation contain attributes which are a unique named property within an observation. In tabular for metadata, such as spreadsheets or comma separated values files, CSV files, the column heading represent the attributes and the rows contain values for each attribute. The diagram in this slide shows in very high level the Machine Learning process, starting with historical data for a problem, and the right ML algorithm we view the prediction model and can be used to predict future events related to the learned data.
It's important to note that you cannot predict what you haven't learned before. This means if my input data refers to stock indexes for a specific set of stocks, I cannot predict the behavior of stocks that were not on the training set. This is very similar to the human behavior. You cannot say something from a matter from a subject that you have never seen before. As we already said, you can only use Amazon ML to problems where we have valid examples of questions and answers. This means valid observations. You cannot predict what you have not learned before.
Amazon ML is based on supervised learning. This means the learning algorithm needs a training set with clearly labeled inputs and desired replies/outputs based on this base the algorithm learns. The ML from engine from Amazon supports currently three types of supervised learned algorithms, the binary classification, which as the name says, predict between two valid values usually true or false, the multiclass classification, which predicts a set of outputs based on a limited subset. So I have more than two possible outputs. And a regression algorithm model, which predicts a numeric value without any specific set.
The cartoon below shows the two different learning models. As we have already tols you, the supervised requires labeled valid inputs, so this is represented by the teacher, the algorithm knows already what is a valid input and what is the expected output. The unsupervised learning on the other hand, you have all the information, the data, and algorithm needs to infer or cluster the information in a better way.
It's also important to know when Machine Learning is not a good option to solve your problem. This simple diagram helps us to explain it. As we said before, you need to have historical data about your problem, otherwise you cannot take the benefit of Amazon ML. With historical data in hand and the problem defined, you have to think if your problem is easily coded when you can get answers with a simple set of conditions and loops, and these questions do not grow with time or change too often, then you probably don't need to use Amazon ML. Also if the problem rules are not overlapping or too complex. In this case, it's not recommended to use Amazon Machine Learning.
On the other hand, if your problem requires too many conditions and often changes, then Machine Learning can be suitable for you. For example, credit card fraud detection to positively identify a set of transactions as fraud. A ML model require historical user behavior, so all previous credit card use and based on the past observations, the average price for the items, frequency of transaction, and location, the ML model can predict if a current request is possibly a fraud or not. This would be just too complex to code with deterministic roles. Also other important factors that supervised machine learning, as we have in Amazon ML, requires a constant set of observations. This means your historical data must be accurate, or the entire application will suffer from poor prediction performance, and you'll see a lot of false positives.
Now we are going to enter into the specifics from Amazon Machine Learning Service, starting with a simple overview. How can we get a highly effective Machine Learning application? First of all, we have to define a problem. If you watched our previous videos, we have already heard this sentence several times. And yes, a good problem definition really matters, as a problem you dictate the data we need and the questions we want to answer. After the problem's defined, we need to organize and prepare historical data. After this, we have to clean this data, remove the incomplete sets, and fixing inconsistencies.
Amazon Machine Learning provide a series of insights and hints to help you get a clean input set of observations. With a consistent input, we create a predicting model, creating first a data source, which is composed by the schema and the data model for our application, and split the input observations into training and evaluation. You'll see later more about this state split. And after the model is created and evaluated, we can start to use our model for new predictions with other observations.
Now we are going to investigate a bit deeper how we can create and take benefit of Amazon ML to our business problems. We will go through each of the three main steps, giving you an overview about every test required in order to achieve a highly effective ML model. The step one deals with the data source creation and data preparation. First, you have to clean and prepare your data. Input data's the data that you use to create a data source. You must save your input data in a comma separated value format, the traditional CSV file. Each row in the CSV file is a single data record or observation. Each column in the CSV file contains an attribute of the observation.
Take also care to have consistent information in your input. For example, if your input has an attribute named city, avoid to have different namings for the same city, like New York City in one record, then in another record you have just New York, and another just the letters NYC. This will be viewed to as three different cities. Make sure to clean it and properly format it, upload it to S3, then create a data source. A data source contains information about your schema and data model, as well it computes a series of descriptive statistics information. It doesn't contain your data, but a pointer to the S3 location.
In the second step, we are going to create the ML model and perform a supervised training after the training evaluation will take place. So for training and evaluation, your input data is split in two sets, one for training and the other for evaluation. The first set will be used on training and the second set of inputs for evaluation. Evaluation is the validation from the training. So you validate the training by sending observations and Amazon Machine Learning will verify if the training is satisfactory. Some important hints. Amazon Machine Learning can train up to 100 gigabytes in size for the training data set.
Other important remarks is about the errors. If your training set has more than 10,000 records of errors, or 10% from the total data set, it will stop the training. To avoid these kinds of errors, you have to clean your data before the data source creation, and rely on the data sites to verify inconsistencies, like for example empty fields or wrong data types.
In the third step, you have evaluated and trained ML model, so you're ready to start making predictions. They can be real-time or batch predictions. Real-time predictions are possibly due to an endpoint, where you can perform up to 200 transactions per second, with an average of 100 milliseconds of latency.
For batch predictions, you need to make your request through the API and form a bucket for the output. Now we're going to enter in our demo session. In this demo, we are going to build from scratch a Machine Learning prediction application. In this demo session, we will create a prediction model that identifies the average salary of individuals based on a series of attributes like age, scholarship, current work sector, and so on. This is based on real USA census data available in the University of California in Irvine. They have some very nice data sets for machine learning. So later, I'll post you the link.
Here we can see the raw data for this experiment. You might notice that the last row is represented not purely in numbers. We will change it to zero for an income lower than $50,000 per year, and one for incomes larger than $50,000 per year. This will enable the binary prediction for our model. The data set contains some missing values for other attributes, but they do not compromise the training, as they are less than 10%. In real world scenarios, try as much as possible to reduce and mitigate these kind of inconsistencies.
Now let's build our Machine Learning model. First we have to look into the AWS console and go to S3. In our example, we'll create inside our bucket a folder named ML to start our training data set and upload our CSV file containing the data. After the upload, we go to the Machine Learning console. If this is your first time, you'll see the get started page. You can click on it and follow. First we have to create our data source. To do this, we will first click on the Amazon Machine Learning menu and select data sources. Then we go to the create new button and select data source. We do this in a step-by-step basis so we first create a data source, then later the Machine Learning model, so we can have a complete overview from the process.
In the input data, you can select currently from two sources, S3 bucket CSV file or Redshift query result. We will use for our purposes the S3 bucket. In the S3 location field, you have to type the entire CSV file location. AWS helps us automatically suggesting the paths and buckets available. Then we have to type a data source name to define our data source. After this is done, we can click on verify. Amazon Machine Learning automatically validates the input data and discovers the data model.
As the validation was successful, we can click on continue to the schema discovery area. As said, Amazon Machine Learning discovers the data types and schema information for you. If the attributes and data types are correct, we can click on continue. It's important to set to yes the column names identification option. This will get the headers from your CSV and label the attributes accordingly. If the attributes and data types are correct, you can click on continue.
Here we select which attributes we want to predict. In our case, the salary year variable, which after click, is automatically identified as binary with the binary classification method assigned. We can then click on continue and our data source will be on a pending state. This means it's being created by Amazon Machine Learning.
While it's in pending, we can already go further and create our Machine Learning model. Go again to the Amazon Machine Learning menu and select ML models. In the ML models page, we will click on the create a new ML model button. First we have to select our input. In our case, we have already a data source created so we can select this option and find our new data source in the pane below. We can already use it even if it has still the pending marker. With the data source selected, we click continue. The summary pages from the data source appears and here we can just ignore it and click on continue again.
Then we arrive at the ML model settings. Here we can see that it detected the model type. In our case, binary. The model identified the field we selected for prediction, the salary year. And we can set a name for the ML model and evaluation. For training and evaluation, the data source will be split in two parts. 70% will be used for the training, and 30% for the evaluation model. You can also select to use default or customized. We proceed and click on review, where we can see the last page before model creation. Here we can review all the settings applied, and then hit finish. Now in normal situations, we could grab a coffee, walk around, because it takes a while to be created. But we will do this for you and go directly to the ML predictions.
Our model is ready. Now we can start making predictions with it. Wait, not yet. Let's first give a look in the model performance. We go to the ML model, then we select evaluations and explore performance. We will not enter here into details, as this is a fundamentals course. What's important to know here is the graphic. This graphic shows the prediction power. This marker in the middle you can select up or down according to your needs. This can be adjusted to define what's accepted as a false and what's accepted as a true reply. We will keep it as it is and move to make the predictions.
For our testing purpose, we will use the try real-time predictions link. For real world situations, you will have to create an endpoint for your time predictions or submit to the API a batch prediction job. In the try a real-time predictions page, we see a form containing the needed attributes for our prediction. We can also submit a record in the CSV format. In our case, we will submit a CSV formatted record. We will get a record that was not included in the training set. This is very important as our goal is to predict different or new cases, right? We get one record from the testing data set provided by the University of California at Irvine. This is totally separated from the training set used. We copy the CSV line, taking care to remove the last column, as this is what you want to predict, and paste it in the appropriate field. Then we click create prediction button. Well we get a reply. The predicted value is zero. As you can see the real value is a little bit larger than zero, zero dot something, and this something is more near zero than one. As we have defined before, zero or near zero means less than $50,000 per year. Let's confirm this by going back to the data set. Well, we have seen that here in the original data set we can see that for this record we have exactly less than $50,000 per year. This means our model is behaving correctly, replying the right answers to our observations. So we could build a successful and working ML model in a couple of minutes. That's quite easy, right?
Now it's your time to proceed and learned by practicing and checking out all the resources by CloudAcademy. Some recommendations are the block series, covering Machine Learning topics, and advanced training covering the same topic. Thanks for watching this video. I hope you have learned and enjoyed it. See you in the next one.
About the Author
Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.