Evaluating Models


Building a Recommendation Engine on Azure

The course is part of this learning path

Start course

Building a Recommendation Engine on Azure is a course designed for teams interested in using artificial intelligence to add product recommendations to their websites.

A product recommendation engine is a valuable feature that helps drive sales on e-commerce sites. In this course, you will learn the essentials of building, deploying, and testing a recommendation engine on Microsoft Azure. You will also build skills to fine-tune a recommendation model and evaluate its effectiveness.

This course is made up of five lectures covering deploying, testing, configuring, evaluating models, and making API requests. This is an intermediate-level course, and prior Azure and API experience is recommended.

Learning Objectives

  • Deploy a recommendation engine on Microsoft Azure
  • Test and evaluate different recommendation models
  • Make API calls to the Microsoft Product Recommendations Solution

Intended Audience

  • People who are interested in artificial intelligence services on Microsoft Azure, especially recommendation engines


  • Experience using Microsoft Azure
  • Experience using APIs

Related Training Content


The GitHub repository for this course is at https://github.com/cloudacademy/azure-recommendation-engine.



- [Instructor] So there are a lot of parameters you can set, but how can you figure out if they'll make your recommendations better or worse? That's where The evaluation capability comes in. As with other machine learning methods, the way it works is you divide your data into two files, training data and evaluation data. After the model gets trained on the training data, it runs the evaluation data through the model to see if people would have actually purchased the items recommended by the model. 

I already split Microsoft's sample data into two files for you. The training file is called demoUsage.csv and it has about three-quarters of the data in it. The other quarter is in the evaluation file, which is called demoEval.csv. This is a fairly typical ratio between training and evaluation data. Note that I randomized the order of the records before I split it. Otherwise, the model would likely not perform as well, due to a problem called over-fitting. Let's go back and train a new model. This time, fill in the evaluation file. Leave everything else with the defaults and click Train. Alright, it's done, so click on it. Then go to the Evaluation tab.

 The first section shows the diversity of recommendations. That is, it shows how many different items were recommended and how often. Here it says that there were 101 unique items in the training set. Of those, it recommended 73 of them. This graph shows the number of recommendations by popularity level. Items that are purchased rarely are in a lower percentile and the most popular items are in the 99th percentile. 

So this bar shows what percent of the recommended items were in the 99th percentile. That is, in the top one percent of most purchased items. Remember the Similarity Function Parameter that we set to Jaccard? If had set it to Cooccurrence instead, then it would have recommended more of the popular items. I tried it and it recommended the most popular items about 18% of the time versus 17% with Jaccard so it didn't make a huge difference in this case, but sometimes it does.

 I also tried it with Lift and it dropped to about 11%. That's already quite a change, but look what happened with the other two buckets. The 90 to 99th percentile bucket shows recommendations for items that are very popular, but that isn't quite the most popular. 

The number of recommendations for those items dropped dramatically and there were far more recommendations for less popular items. So the option you choose for the Similarity Function can have a profound effect on the model's recommendations. This is all quite interesting, but the part you'll probably care about the most is whether or not customers would have purchased the items that the model recommended. 

That's what the next section shows. It shows how successful the recommendations would be, depending on the number of recommended items. For example, if you were to show only one recommended item to a customer, then they would purchase that item about 18% of the time, but if you recommended five items, then they would purchase at least one of those five items about 37% of the time. As you'd expect, the more items that are recommended, the more likely the customer is to purchase at least one of them. Let's see which of the three models I tested gives the best recommendations. Well, there's no clear answer. 

It depends on what you're looking for. The Cooccurence model does the best with one or two recommendations, whereas the Jaccard model does the best with four or five recommendations. Although the differences are pretty small in both cases. They both did far better than the Lift model. That's only for this particular dataset though. It could very well be different for your organization's products. There are also lots of other parameters you could tweak to see what effect they would have on the success rates. You could also add something to the data to change the way the evaluation works. 

Recall that in the schema for the usage data, there's an optional field for the transaction type. This lets you tell the model about more than just purchases. For example, you can tell it when a customer's clicked on a recommended item or when they added an item to their shopping cart. Each of these types of actions has a different weight, with a click being worth only one point and a purchase being worth four.

 This can improve your recommendations even more, but you might have noticed a potential problem. You need to have a recommendations system in place already so you can track how your customers interact with your system's recommendations. What if you don't have one? Well, you can just start off implementing a simpler model, based only on historical purchases and then start gathering more data that you can feed into a more sophisticated model later. 

And that's it for evaluating models.

About the Author
Learning Paths

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).