Introduction & Overview
Cognitive Services Features
10m 49s
9m 58s
Course Summary
1m 42s
Start course

Artificial Intelligence is not a future or distant concept; it is here and now, and being used by many companies of various sizes and industries. The foundational theory for AI was actually developed several decades ago, but recent advancements in big data, computing power, cloud, and algorithms have made it affordable and widespread today. With AI and Machine Learning, computers are now able to start reasoning, understanding, and interacting in ways that were never possible before.

Microsoft has created a predefined set of AI models available for companies of all sizes to start with called Cognitive Services, and best of all, they require little to no knowledge of data science. In this course, you will learn how to infuse your apps—on an architectural level—with the intelligence that Cognitive Services provide. We will cover what Cognitive Services are and how to use the various solutions they provide, including Vision, Speech, Language, Decision, and Web Search.

Learning Objectives

  • Understand the functionality provided by Azure Cognitive Services
  • Learn how to incorporate these services into your apps

Intended Audience

  • People who want to learn more about Azure Cognitive Services


  • Knowledge of Azure
  • Knowledge of at least one programming language
  • Experience using REST APIs

Vision is actually a category of several different APIs and each one of them performs one or more functions. The main APIs in the Vision category are computer and custom Vision, the Face API, and the Video Indexer.

Let's start with Computer Vision. This is Microsoft's general-purpose Vision technology and you can pass several types of images, such as JPG, GIF, PNG, or BMP, or even URLs to these files. It can do a lot of things, such as it can give one or several possible descriptions of a picture, along with the confidence level for each description. By the way, confidence level is something that you'll hear in a recurring basis in this course. It's an index from zero to one that tells how sure the model is about something. The closer to one, the more confident the model is.

It also gives tags and categories about the picture, which gives a little bit more of context about what the image is about. This is quite useful when you're uploading lots of images and you want to add some metadata to it. The main difference between tags and categories is that while you have over 2,000 tags available, there are only 86 categories, which helps you create a structured and well-defined taxonomy.

Recent developers on Computer Vision allows it to also identify objects or brands in a picture. Object identification works similarly to tagging except that it also gives the coordinates of the object found on the JSON response as a bounding box element.

Another handy feature on this API is that it can also detect faces on the picture and gives bounding box elements with each face's coordinates. It can even guess the gender and age of the person, however, the Face API can give far more information, as we'll see next.

The Computer Vision API can also identify colors in a picture and even tell if the picture is a clip art or a line drawing. Line drawing will return a Boolean true or false, whereas clip art will return a number from zero to three based on how sure the model is that this is a clipart.

The API has a thumbnail generating feature, which allows you to identify, as a bounding box, the focal point of a picture. You can also generate a thumbnail a small representation of a full-sized image based on that picture, allowing you for more meaningful thumbnails.

Another really cool feature of the API is the ability to do OCR, optical character recognition, to extract text from images, including handwriting. It supports 25 different languages with language auto-detection. The recognized text is, as usual, shown as a bounding box. You can also enable rotation correction in case the image is not on the right orientation.

Last, but not least, it can detect three types of adult content in the image; adult, such as full nudity or sexual acts, racy, which is less sexual, but possibly inappropriate for certain cultures or ages, such as someone in a bathing suit, and gory, which is more about violence. Keep in mind that Microsoft has an entire service for content moderation with a lot more features, which you'll see in the decision pillar later in this course.

It's worth mentioning that Computer Vision has two additional, more specialized models that you can call; celebrities and landmarks. Celebrities allows you to recognize over 200,000 famous people, whereas landmarks will make it very easy to recognize the Eiffel Tower in your family vacation album. You can still get celebrities and landmarks by calling the generic Computer Vision model, but your chances are increased if you call the specialized ones.

That was a lot, wasn't it? Don't worry, Computer Vision's probably the most feature-rich of the Cognitive Services APIs, so you won't have to remember so many things for the other services.

Now, let's talk about the Face API.

The Face API is a more specialized vision mode, specifically for faces. It essentially does two things: face detection and face recognition.

Let's talk about face detection first. This function builds a set of attributes about people's faces, such as age, facial hair, gender, glasses, et cetera, as well as a bounding box with the coordinates of the face on the picture.

Face detection can also identify emotions. The emotions detected are neutral, anger, contempt, disgust, fear, happiness, sadness, and surprise, and each one of them has a corresponding confidence level. The closer to one, the stronger is the emotion detected.

Finally, the detect function generates a set of 27 face landmarks and an associated FaceID, which is a unique identifier. This ID can be used later on in your code to identify that person. For privacy reasons, though, the picture sent is discarded after the landmarks are generated and the FaceID expires 24 hours after being generated unless you decide to add it to a persons group, which you'll see next.

The other function of this API is face recognition, which allows you to either identify people in pictures or use your face for authentication. One good example of this kind of technology is Windows Hello, which allows you to sign into Windows 10 using your camera instead of typing a password. The process for Face Recognition works like this:

  • First, you create a persons group, which you can think of as a faces database. The standard persons group can hold up to 10,000 persons. If you believe that you'll need more than that, you can create a large persons group, which increases the storage to a million persons. There's also a small cost for storage on persons groups. Although it's very, very cheap, at the time of this recording, it's one cent per thousand faces. You might want to delete persons groups when you no longer need them.
  • Next, you create the persons you want to identify inside the persons group.
  • After that, you start adding faces for each person by uploading pictures. For every person, you can upload up to 248 faces. Make sure that the pictures are front-facing, not profile angles. As with anything in machine learning, the more faces you add, the more accurate the model will be.
  • Then, you just need to train the model, which in this case means letting the service process the images to start recognizing the faces.

Once this process is completed, you can call the API for several operations, such as authenticating users, finding them in pictures, grouping together pictures that belong to the same person, and so on. As with the Computer Vision API, it can accept either images or URLs, and the JSON results also give a confidence level for each face detected.

The focus of both of these APIs, though, is on images. For video files, we need to use the Video Indexer instead.

The Video Indexer API applies several of the same functions of the Computer Vision and Face APIs, but works on video files instead, creating powerful indexes for them. As an example, it can identify people and highlight the parts of the video where the person appears.

It also inherits several of the capabilities from the language category, which you'll see later on, such as creating subtitles and translating them to different languages, as well as detecting sentiment. It can also generate keywords that allow you to jump directly to a certain topic.

All this enrichment can be used to give a better experience for your users and even help you monetize better your video channels by placing ads that are more relevant to the video being played.

One important note about this API, though, most Cognitive Services work as synchronous operations, which means that you immediately get the results back. Because video files take much longer to process, though, this API works asynchronously. You send the video to the service and wait for the processing to be completed.

The last service in the Vision category is Custom Vision. This service allows you to interact with the Vision model and train it to your purposes instead of just consuming the features that are available out of the box. 

For example, the Computer Vision API can probably detect birds in a picture, but if you want to train the model to spot a cockatoo or to differentiate between a cockatoo and a parrot, you need to use Custom Vision instead.

The beauty of Custom Vision, or most of the Custom Cognitive Service for that matter, is that you need very little knowledge of machine learning beyond concepts such as training and evaluating a model.

The process is roughly as follows:

  • You go to and create a new project. You can select either Object Detection to find a cockatoo or Classification to differentiate between cockatoos and parrots. You can also use the APIs to perform the same process through code if you want.
  • Then, you need to upload the pictures, tagging them appropriately. You can start with as little as 10 pictures. However, Microsoft recommends at least 50 images per tag, for example, 50 parrots. The more diverse they are in terms of background, size, illumination, angle, style, et cetera, and the more balanced is the number of pictures per tag, the higher the accuracy of the model. You might even want to add birds that are neither parrots nor cockatoos and tag them as negative, which improves the classification.
  • After that, you just need to train the model and publish that as a web service. From this moment on, you can call it the exact same way as you do with the other services, although you can still interact with the model through the portal.

To make sure your model is relevant and improves over time, it's important to periodically retrain it with new data. Using the Custom Vision Portal, you can interact with the predictions made and correct them as necessary, which improves future predictions.

Let's now switch to a demo and see these technologies in action!

About the Author

Emilio Melo has been involved in IT projects in over 15 countries, with roles ranging across support, consultancy, teaching, project and department management, and sales—mostly focused on Microsoft software. After 15 years of on-premises experience in infrastructure, data, and collaboration, he became fascinated by Cloud technologies and the incredible transformation potential it brings. His passion outside work is to travel and discover the wonderful things this world has to offer.