Case Study: Animal Classification
Start course

This course explores the core concepts of machine learning, the models available, and how to train them. We’ll take a deeper look at what it means to train a machine learning model, as well as the data and methods required to do so. We’ll also provide an overview of the most common models you’re likely to encounter, and take a practical approach to understand when and how to use them to solve business problems.

In the second half of this course, you will be guided through a series of case studies that will show you how to apply the concepts covered in this course to real-life examples.

If you have any feedback relating to this course, feel free to contact us at

Learning Objectives

  • Understand the key concepts and models related to machine learning
  • Learn how to use training data sets with machine learning models
  • Learn how to choose the best machine learning model to suit your requirements
  • Understand how machine learning concepts can be applied to real-world scenarios in property prices, health, animal classification, and marketing activites

Intended Audience

This course is intended for anyone who is:

  • Interested in understanding machine learning models on a deeper level
  • Looking to enrich their understanding of machine learning and how to use it to solve complex problems
  • Looking to build a foundation for continued learning in the machine learning space and data science in general


To get the most out of this course, you should have a general understanding of data concepts as well as some familiarity with cloud providers and their managed services, especially Amazon or Google. Some experience in data or development is preferable but not essential.


For our third example, let's make things a little interesting and attempt to classify animals at a zoo. This time, imagine you're a zoologist or a researcher, and you've been told you need to fit animals into one of seven classifications. This could be a bird, a mammal, a reptile, a fish, amphibian, insect or invertebrate, and all of the animals in the zoo must be put into one of those. And you have access to no information beyond what you're able to observe.

So imagine you're actually just going to the zoo and having to build your own training set. It would make a lot of sense to look at the animals, note down their attributes. Maybe look at a plaque if there's one available and see what they're already classified as so you can get some labels.

On screen, we're showing a lot of different attributes and features, things such as, does the animal have hair? Do they have feathers? Do they lay eggs? Are they flying? Do they have teeth? Do they breathe? Do they have legs, a tail? How big are they? And then, very importantly, what is their classification?

Most animals in the zoo will have an information plaque that will tell you about them. So you could very quickly, through a combination of real world observations and some baseline research, build a really strong labeled training set.

Now, as we have a labeled training set, and we clearly understand the problem, this is a supervised learning problem. However, since there's one of seven classifications, it falls into the multi-class classification types of supervised learning.

One model you might pick is called XGBoost, or eXtreme Gradient Boosting. Weird name, but at its core, it's a cluster of decision trees. Basically, this is well-suited when a problem involves a small to medium amount of structured or tabular data, such as we built in our observations in the zoo.

Basically, XGBoost takes advantage of what's called a gradient boosting framework. In plain English, this simply means it has a collection or ensemble of many decision trees that will take the different attributes and make a prediction. And at the end of this, we view each decision tree's outcome as a vote, so perhaps one decision tree says, does it have more than six legs? Yes, no. If yes, it's an insect.

While another one simply says, does it have lungs? Yes, no. If no, then it's an insect. Now, neither one of those might be 100% accurate, but if they both say, "Is an insect," then we have two votes for insect, and that starts to build the amount of confidence.

So XGBoost uses clusters of decision trees, and then kind of takes a vote or a measure of confidence among the cluster of them to determine the correct classification.


Course Introduction - Explaining Concepts - Models - Understanding Training Data Sets - How to Choose? - Case Study: Home Prices - Case Study: Heart Disease - Case Study: Targeted Marketing

About the Author
Learning Paths

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.