Working with Data Sources
Data Manipulation Within Amazon Machine Learning
Working with Machine Learning Models
When we saw how incredibly popular our blog post on Amazon Machine Learning was, we asked data and code guru James Counts to create this fantastic in-depth introduction to the principles and practice of Amazon Machine Learning so we could completely satisfy the demand for ML guidance within AWS.
James has got the subject completely covered:
- What exactly machine learning can do
- Why and when you should use it
- Working with data sources
- Manipulating data within Amazon Machine Learning to ensure a successful model
- Working with machine learning models
- Generating accurate predictions
Welcome to our series on Amazon Machine Learning. In this lecture, we'll get a brief overview of the course as a whole, and then we'll start with a basic introduction to Machine Learning. We'll cover the types of problems that machine learning can solve, and we'll discuss situations when you should use machine learning and times when a simple solution is better.
Finally, we'll dive deep into the Machine Learning services offered by Amazon as part of AWS. We'll talk about the steps required to frame a Machine Learning problem, what you need to do to prepare your data for use with Amazon Machine Learning, how to construct additional data features with feature processing, how to create a learning model in Amazon ML, and how to evaluate that model's performance.
If our model doesn't perform well, we'll talk about ways to improve the model. And we'll top everything off by generating predictions using both options available in Amazon ML, Batch Predictions and Real-Time Predictions.
So what is machine learning? Machine Learning helps you to use historical data to make better business decisions. ML algorithms discover patterns in data and construct a mathematical model based on these discoveries. This model can be used to make predictions based on probability.
Machine learning lets you make your decisions based on what is likely to happen, not on what is already happened.
More formally, machine learning is the ability of computer systems to gain knowledge from experience.
A machine learning system consists of the model created by a machine learning algorithm and data. To create the model, you feed input data into statistical and data mining algorithms. In the past, building a machine learning solution required specialized knowledge and custom software. These factors combine to increase the expense of implementing these solutions. However, services like Amazon Machine Learning, another cloud-based offerings have open the door to low-cost accessible solutions.
Machine Learning differs from traditional business analytics and the type of questions that it can answer. A typical business analytic question concerns known information. For example, we might ask, what was the most common female baby name in the United States during the 21st century. It may take some time to gather, process and query this data, but in the end, you will find a definite answer to the question. Business analytics answer questions about past events.
On the other hand, machine learning models answer different types of questions including predictive analytics and classification. Predictive analytics aim to find answers to business questions based on probability. For example, we might ask, how likely is it that a newborn baby girl will be named Elsa in 2015.
However, you might want to be even more specific and include demographic data about the parents. A simple example would be how likely is it that a newborn baby girl with middle-class suburban parents in their early 20s will be named Elsa in 2015. Predictive analytics are used to analyze questions about expected or future events. Classification questions use experience with previous examples to classify new examples. For instance, you can take collections of animal pictures and divide them into three groups: cat, dog and other.
Then you can analyze these groups with a machine learning system to create a model that will be able to classify new pictures into these groups. Classification problems work with probabilities just like predictive problems do. In this case, a probability for each animal type is assigned to the new picture and the picture is categorized or labeled with the type that has the highest probability.
To create machine learning systems capable of answering these types of questions, we follow three steps: gather data, create model and perform predictions. Data can be gathered from our existing historical systems, or we can design a new data capturing system to feed into our ML algorithm. We usually need to format and clean our data to make it ready as a learning source. This includes removing incomplete records or filling in valid data for missing or incorrect variables.
We may also combine analyze or group variables in order to express relationships not readily apparent in the data such as sorting ages into bins or creating a product that combines job title and income.
Next, we select an algorithm to analyze the data and produce a model. We train a model by feeding the input data into the machine learning algorithm. In order to evaluate our model, we hold back some of the data where we already know the correct answer. Once the model was created, we can use the held back data to evaluate the model's predictions against the ground truth values. If our model does not perform well, we need to re-examine our data and algorithms and then try again after making adjustments.
Finally when we have a model that performs well, we can use it to make predictions against new data. The remainder of this series will focus on the specific steps necessary to build a model, and use it to make predictions using the Machine Learning web service offered by Amazon.
James is most happy when creating or fixing code. He tries to learn more and stay up to date with recent industry developments.
James recently completed his Master’s Degree in Computer Science and enjoys attending or speaking at community events like CodeCamps or user groups.
He is also a regular contributor to the ApprovalTests.net open source projects, and is the author of the C++ and Perl ports of that library.