Case Study: Heart Disease
Machine Learning Concepts & Models
The course is part of this learning path
This course explores the core concepts of machine learning, the models available, and how to train them. We’ll take a deeper look at what it means to train a machine learning model, as well as the data and methods required to do so. We’ll also provide an overview of the most common models you’re likely to encounter, and take a practical approach to understand when and how to use them to solve business problems.
In the second half of this course, you will be guided through a series of case studies that will show you how to apply the concepts covered in this course to real-life examples.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Understand the key concepts and models related to machine learning
- Learn how to use training data sets with machine learning models
- Learn how to choose the best machine learning model to suit your requirements
- Understand how machine learning concepts can be applied to real-world scenarios in property prices, health, animal classification, and marketing activites
This course is intended for anyone who is:
- Interested in understanding machine learning models on a deeper level
- Looking to enrich their understanding of machine learning and how to use it to solve complex problems
- Looking to build a foundation for continued learning in the machine learning space and data science in general
To get the most out of this course, you should have a general understanding of data concepts as well as some familiarity with cloud providers and their managed services, especially Amazon or Google. Some experience in data or development is preferable but not essential.
For a slightly more complex example, let's look at another common machine learning application, such as predicting if a patient is at risk for a certain disease. In this case, we want a simple yes or no of is a patient high risk for heart disease?
Notice how we set a simple yes or no. We could also say, what is their risk score? But in this case, we're asking for a categorical high risk or not high risk, rather than a score of 0 to 100% likelihood of developing heart disease.
So this shows you how the same type of problem, depending on the business requirement of yes, no, or a score can change the underlying engineering and data science approach.
In more complex issues, you'll often have a blend of lots of different types of features in order to predict your outcome, aka your label. And here, we have a good mix of categorical features such as gender, are they a smoker, and numeric features such as weight, cholesterol, and age.
This blend can make it a little overwhelming to start on machine learning projects so feel free to disregard some of them, but know that your model accuracy will suffer. Once again, don't be afraid to iterate in order to start building your understanding of the problem.
To put these features in a more structured format, you could see here we have the features on the left, gender, age, weight, cholesterol, and blood pressure, and the label on the right. Notice how we're saying, does the patient have heart disease? It is a categorical label. We're not measuring heart health or other factors. It's simply, do they have a heart disease diagnosis? And it is a binary decision.
To start again with our flow chart, we have a good business understanding. We want to know yes or no, is this patient high risk for heart disease? We have a labeled training set, which means we can go with supervised machine learning. And due to the business requirements, we have a binary classification. Do they or don't they have a high risk for heart disease?
Now remember, we might have to go with multi-classification or regression for different types of requirements from the business, but in this example, it's yes or no, so it's a binary classification problem. And finally, in terms of algorithm, we don't have a clear linear relationship to associate this with.
So in this case, we need to start thinking about different types of algorithms. K-nearest neighbor or KNN is an index-based algorithm that uses non-parametric methods for classification. Simply put, this means that the data is plotted and the distance between different data points on multiple axes is used to determine the quote K points that are closest to the sample point.
Basically, we're trying to predict how far away any test or production data is from the training data, and then simply associating it with that data point. K-nearest neighbor algorithms can get pretty complex quite quickly. So we're gonna avoid going into too much depth here, but it's very important to see K. This is what's called a hyperparameter, and it is a setting that fundamentally controls how the algorithm works.
Basically, by setting K equal to one or more than one, we control how much it values its nearest neighbors and how it handles clustering. This is often set by the data scientists themselves or with some of the more modern programs you have what's called hyperparameter tuning and the system will automatically try multiple values of K and report back to you which model has the best fit.
Just know that some machine learning models, in addition to training, need to have specific types of tuning through what's called hyperparameters. That's a little beyond the scope of this class. It's really higher level, level three and level four stuff to know how those hyperparameters play exact roles in machine learning.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.