Custom Vision Overview
Custom Vision Overview

This course explores the Azure Custom Vision service and how you can use it to create and customize vision recognition solutions. You'll get some background info on what the service is before looking at the various steps for creating image classification and object detection models, uploading and tagging images, and then training and deploying your models.

Learning Objectives

  • Use the Custom Vision portal to create new Vision models
  • Create both Classification and Object Detection models
  • Upload and tag images according to your requirements
  • Train and deploy the configured models

Intended Audience

  • Developers or architects who want to learn how to use Azure Custom Vision to tailor a vision recognition solution to their needs


To get the most out of this course, you should have:

  • Basic Azure experience, at least with topics such as subscriptions and resource groups
  • Basic knowledge of Azure Cognitive Services, especially vision-related services
  • Some developer experience, including familiarity with terms such as REST API and SDK

Let us start with a general overview of what Custom Vision is and how it can help you create and customize vision recognition solutions. Azure Custom Vision is an Artificial Intelligence service, so I'd like to start by defining the AI framework that it belongs to.

Let's start with AI. Artificial intelligence allows computers to mimic capabilities of the human brain such as learning, understanding, and recognizing patterns without needing to be explicitly coded for that as we generally do with algorithms. Machine learning, which is a subset of AI, is the process of teaching or training a computer system to make predictions based on the available data. Using computer vision as an example, if you send enough pictures of cats and dogs, the training data, an ML model will eventually be able to differentiate between them just like we, humans, do.

Finally, there's deep learning, which is a subset of ML based on artificial neural networks. Deep learning is commonly used on applications such as computer vision, natural language processing, and speech recognition. Modern vision models tend to use a class of neural networks called convolutional neural networks or CNNs. The problem, however, is that creating these models from scratch might require a considerable amount of data, time, computing power, and expertise. And this is the beauty of Azure Cognitive Services. It allows you to leverage models already pre-created by Microsoft at a fraction of the cost that it would take you to build them yourself, and without requiring almost any AI or ML knowledge.

There are several cognitive services available in areas such as vision, speech, language, and decision making. For the purpose of this course, however, we're going to focus a little bit more on computer and Custom Vision. Both technologies can essentially do two things.

First, we have image classification, which categorize pictures based on a specific classes that you define. Basically, it will tell you what is on the picture. For example, you could create a model for a hotel that identifies if the pictures taken by a guest are from a bathroom, bedroom, reception, or pool area. Each class detected will also have a probability score attached to it, which goes from 0 to 1, according to how confident the model is on the class identified.

The other alternative is object detection. The main difference between this option and image classification is that object detection adds a bounding box with the coordinates of each object found in the picture. What this means is that not only will object detection tell you what is in the picture, but where each element is. Using the same hotel example, you could identify objects such as sinks, beds, lamps, or reception desks in a picture.

These technologies are being used here and now in areas such as self-driving cars, medical imaging, building safety, product identification, and much more. Whenever your company has staff manually inspecting visual content, there might be a use case for vision cognitive services. And the more images that need to be classified, the greater the benefit of using an AI Vision solution to automate this task.

Okay, but wait a second. If these two technologies can perform both classification and object detection, what's the difference between them? Well, computer vision is Microsoft's general-purpose vision technology. It can recognize a variety of elements on an image such as people, animals, landmarks, objects, handwriting, brands, colors, and much, much more. It can even generate captions that describe the image which is great if you want to produce image metadata.

Notice that in both pictures, computer vision was able to detect the bird. Custom Vision, on the other hand, allows you to interact with the model and train it for your specific purposes instead of just consuming the features that are available out-of-the-box. For example, if you want to train the model to spot a cockatoo, or to differentiate between a cockatoo and a parrot, you need to use Custom Vision.

Custom Vision uses a pretty interesting neural network technique called transfer learning, which applies knowledge gained from solving one problem to a different, but related situation. This can substantially decrease the time needed for creating the models. Let's see how this works.

As I mentioned on the previous slides, creating models from scratch requires a substantial amount of data, as well as time, computing power, and data science expertise. In theory, you'd need a similar amount of effort when developing every new model. However, with transfer learning, you start from a model that was already created, and add another layer with the specific needs of the new model. For example, if you have already created a model to recognize cars, you can use transfer learning to repurpose the original one, and also recognize trucks, motorcycles, and other kind of vehicles. All of this process is transparent to you, so you don't need to worry about configuring any extra steps to use transfer learning. But this does mean that creating a Custom Vision solution is considerably faster and simpler than writing your own model training code.

Custom Vision, as we already discussed, allows you to interact with the model and train that according to your needs. As is common with custom cognitive services, this model training is done by using a dedicated portal which can be accessed from The portal is your main way of configuring and interacting with Custom Vision, allowing you to do things such as create and configure projects, configure the labels that you want to identify, upload and tag your images, and train, evaluate, and deploy your models. You can also perform these same tasks programmatically either by using a REST API endpoint, or SDKs available for several languages such as .NET, Python, Go, Java, and Node.js.

Okay, so now that we have a general overview of the technologies behind Custom Vision, it's time to cover the creation of the Azure resources needed to use the service. There are two main APIs that you need to create. The first one is the training endpoint which allows you to perform the same tasks that you can do on the Custom Vision portal such as creating new projects, adding labels, and uploading and tagging the images. In a way, you can consider this to be your development endpoint. The other one is the prediction endpoint which will be used by your apps to make prediction from the images that you send. In a way, you can consider this to be your production endpoint.

In the wizard that creates the Custom Vision resource, you can select the options training, prediction, or both. The both option also creates two endpoints, one for training, one for prediction, so it's just a more practical way to create both of them in just one wizard. The prediction endpoint will have the -prediction suffix in the end of the name.

Alternatively, if you have several cognitive services in your company, you can consolidate all of them under a single cognitive service resource for vision, speech, language, and the decision APIs. This will considerably decrease the number of endpoints to be managed on Azure, but keep in mind the following drawbacks. It might make it more difficult for you to analyze costs or change settings individually, as all services are under the same umbrella. This consolidated resource does not have a free tier called F0 which allows you for a limited amount of operations and projects without any associated cost. It also does not distinguish between the training and the prediction endpoints.

Now, let's jump to a demo and see how we can create these resources on Azure.

About the Author

Emilio Melo has been involved in IT projects in over 15 countries, with roles ranging across support, consultancy, teaching, project and department management, and sales—mostly focused on Microsoft software. After 15 years of on-premises experience in infrastructure, data, and collaboration, he became fascinated by Cloud technologies and the incredible transformation potential it brings. His passion outside work is to travel and discover the wonderful things this world has to offer.