Case Study - Labeling Houses
Start course

Welcome ​to Part Two of an introduction to using Artificial Intelligence and Machine Learning. As we mentioned in part one, this course starts at the ground up and focuses on giving students the tools and materials they need to navigate the topic. There are several labs directly tied to this learning path, which will provide hands-on experience to supplement the academic knowledge provided in the lectures.

In part one we looked at how you can use out-of-the-box machine learning models to meet your needs. In this course, we are going to build on that and look at how you can add your own functionality to these pre-canned models. We look at ML training concepts, release processes, and how ML services are used in a commercial setting. Finally, we take a look at a case study so that you get a feel for how these concepts play out in the real world.

For any feedback relating to this course, please contact us at

Learning Objectives

By the end of this course, you'll hopefully understand how to take more advanced courses and even a springboard into handling complex tasks in your day-to-day job, whether it be a professional, student, or hobbyist environment.

Intended Audience

This course​ is a multi-part series ideal for those who are interested in understanding machine learning from a 101 perspective; starting from a very basic level and ramping up over time. If you already understand concepts such as how to train and inference a model, you may wish to skip ahead to part two or a more advanced learning path.


It helps if you have a light data engineering or developer background as several parts of this class, particularly the labs, involve hands-on work and manipulating basic data structures and scripts. The labs all have highly detailed notes to help novice users understand them but you will be able to more easily expand at your own pace with a good baseline understanding. As we explain​ the core concepts, there are some prerequisites for this course.

It is recommended that you have a basic familiarity with one of the cloud providers, especially AWS or GCP. Azure, Oracle, and other providers also have machine learning suites but these two are the focus for this class.

If you have an interest in completing the labs for hands-on work, Python is a helpful language to understand.  



To begin to put a real world scenario around this, imagine that you work for a real estate company, and that you're tasked with verifying user submitted listings. Now in this example, we'll be discussing how images can be inspected with respect to real estate. But in practice, this type of pattern is very important with anything user submitted.

In this scenario, matching your professional data scientist, and you have been hired to handle the fact that your users are submitting garbage data to your database. This is a massive problem in that houses' descriptions do not match the pictures provided. Either they say it's a rural house when it's very clearly in the city. Or maybe it's labeled a single family home while the exterior shot is that of an apartment complex.

This type of data integrity issue is extremely common in the real world, both for users of the platform and internal metric tracking. A very common example, that's also a great one to start to understand machine learning on, is people getting Tudor and Victorian-style homes confused? So you might ask yourself, how can you begin to handle this at scale? How can I take advantage of my knowledge of training models to create a helper tool that allow me to quickly analyze the entirety of the user submitted listings? You would want to start thinking, how do I train the model? And then how do I influence the model, which is where this is going beyond just what a data scientist might do. And you might need a data engineer or an application engineer.

On top of this conclusion, you might realize that whatever approach you're going to need to do, will have to be semi autonomous in order to continuously verify all the images. So if we start to discuss how your model and application works, perhaps you feed a picture it says is this a Tudor house, is this a Victorian house? Or maybe it's just a skyscraper and completely off the mark and something we don't recognize? How can you match this to the user submitted description? In practice this type of problem where it's a multi prong issue of what picture are we using? What does the picture show us? How can we automate this job completely?

You might actually need multiple models. Perhaps you need one model that will determine if it's an external shot, while another model determines the type of house. Perhaps you need a natural language processing model that's capable of extracting keywords from the user description for a first pass, and then use those extracted words from the description to match it against the outputs of the image model.

To simplify this example, let's just focus on how can we make a model that automatically identifies the home from an exterior shot. How can we create some labeled data that we can train the machine learning algorithms with, in which we are able to feed it an external shot and it's able to kick back a label of Tudor, Victorian or other.

Now, if you remember the importance of creating good training data, you might realize you have to take multiple shots from multiple angles and multiple times of day. Perhaps you should take a picture from a slightly elevated position looking up or a slightly close in position looking up the side of the house.

Basically, you need to have a good feel for how the users of your particular application are submitting images. And make sure you select a representative set of training data that captures all the variability. Now, it's important to remember that you don't necessarily need all permeability.

Maybe you can get by with a good chunk of it, and you'll be able to detect what conditions does your model fail to correctly identify the house with. But get started with a limited training set, a number of houses, training identifier and measure model accuracy. That is how you can get started in creating a functional machine learning model to help you automate your job. And once this model is created, you can start to put it in production and see how many false positives false negatives are accurate matches and errors are detected in the user's flow.


About the Author
Learning Paths

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.