How we use AWS for Machine Learning and Data Collection

Author: Alex Casalboni, Roberto Turrin, Luca Baroffio

May 5, 2016

What You’re Going to Learn From This on Demand Webinar

Follow along with Alex Casalboni, Roberto Turrin and Luca Baroffio, a dedicated team inside our Engineering group at Cloud Academy, and learn how they use AWS to manage daily challenges and build a robust machine learning system. Here are the most important skills and knowledge we want you to gain from this webinar:

How to deploy machine learning models in the Cloud.
Tips on serverless computing on AWS.
How machine learning can improve your learning experience.

Introduction: Meet the Team

Hi everyone, I am Alex from Cloud Academy. I’m here with my colleagues Luca and Roberto to share with you our ideas and experience about machine learning technologies – especially how to deploy them in the cloud, specifically on Amazon Web Services (AWS).

So just a very quick introduction about us: we each have a computer science background. I am personally more experienced in web technologies and software engineering, while Roberto and Luca recently completed their PhDs. We are now part of the internal team at Cloud Academy that works with machine learning and AI to improve the user experience and to make our application smarter.

Now You’ve Met Us – Let’s Meet You!

I would bet that a considerable part of you are data scientists as well, but in order to improve our presentation today, I would like you to answer just a few questions, about what you do, about your job, about how you use machine learning in your team or in your company. I will open a very quick poll, and the first question will be “What is your job role currently?”, so we’d like to know if you’re a data scientist, an engineer, a DevOps, or a manager, let’s see, data is coming.

A lot of engineers today I’m happy to see that. Okay, great. I will close it in just a few seconds. Okay, we are good. Another super quick question about your experience with machine learning. We would like to know if you’ve ever used machine learning for your products, for your projects, if you have used it on AWS. So if you have never used it, if you’ve used it but only for prototyping or experimenting, you have deployed something to production, we’re just curious. Okay, data is coming, thank you very much. I see that many of you have never used machine learning – at least on AWS – and I’m glad to hear that, so we have a lot to talk about.

Awesome. Let’s close this question too, just one last question. I’m assuming you work in some company or research team and maybe you have a data science team and I would like to know how large is your team at the moment. Many of you don’t even have one. Okay, data is coming, thank you again. So many of you don’t have a team and let’s get started, for real this time.

What We’re Going to Cover Today

As I mentioned, we would like to share with you our ideas about machine learning because that’s what we are doing at Cloud Academy. And as an outline, we are starting with just a few concepts about machine learning, for the ones of you that are not data scientists, we’ll go through just a few basic concepts and what machine learning can do for you. And then we’ll dive deeper into how you get your models and deploy them into the cloud by using Amazon Machine Learning. Okay, so I would like to introduce you to Roberto, my colleague and he’ll talk to you about machine learning technologies.

An Introduction to Machine Learning from Roberto Turrin, Senior Data Scientist at Cloud Academy

[Roberto] Thank you, Alex and hi to everybody. I would like to give you a very quick introduction to machine learning, especially for all those of you who are not used to working a data science field.

What is machine learning? One of the main definitions from back in 1959 is “the capability of a computer to learn from data without being explicitly programmed.” Actually, the definition I prefer is the one that states that machine learning supports decision problems that can be molded from data. And that according to this definition we can see that machine learning is an alternative from experience in order to take the right decisions.

When you have a problem and do you want to make a decision – I don’t know, for example, you have a new product and you have to decide which of your customers to email to advertise this new product – well, you can use either your experience and say that yes, let’s send an email to all women aged over 40 in my customer base, or I can use machine learning that starting from data and creates and defines new information and understands that which users is best to send an email to.

The Anatomy of a Machine Learning Pipeline

All machine learning pipelines start from data and extract information. Actually, the main step in my opinion in this pipeline is the first one, which is feature extraction. It means that from the data – for example, a soundtrack – you extract the correct features that can be used to model your data. So you extract, for example, the enrichment of your soundtrack. From a text or document you extract the words of this document. And this is the most important step because without these features you cannot extract any information at all, regardless whatever model you are going to use.

Once features have been extracted, you can train a machine learning model and these phases usually are performed in a batch stage. So you have the model, you can use it in real time in order to extract the right information. Let me just spend a few minutes giving you high-level taxonomy on machine learning, starting from supervised learning which applies to all problems where you have labeled data and you want to predict an attribute of this data. Suppose you have these six users, four are green and two are red, and you know some of their attributes – for example, you know their age. The machine learning algorithm will learn from this data and will try to predict the class of the unknown user with the question mark. And so, given, for example, the age of this user will predict that is the right user. The algorithm that I just mentioned is what is called a classification algorithm. Very similar to these algorithms in the same supervised learning family, we find the regression algorithms that, instead of trying to predict the class red or green, try to predict the numerical value of an attribute. For example, in this case, the age of the user.

On the other hand, we can find the unsupervised learning algorithms that do not use labeled data, but just try to find out some hidden patterns within your attributes. For example, you have these six users and the algorithms will find out that four users belong to the same group because they have similar properties and the other two users belong to another group. This family is called clustering and the goal of clustering is to discover groups within your data. Another algorithm that is used in the unsupervised learning family is the rule extraction, where you want to find out that A and B gives C. For example, you know that your user is 20 years old, is 1.70m tall and you can predict that is a male.

I will conclude this part, giving you some example of problems that you can solve using the four families algorithms that I just mentioned.

Starting from classification, a typical example is fraud detection: you want to understand whether your transaction, for example of your customer on your e-commerce site, is fraud or is a correct transaction or you can try to classify a document between, I don’t know, you have different topics and you want to decide which document would you want for this topic, or you have an email and you want to find out whether it is spam or not. These are all cases of classification problems.

Regression, for example, you want to predict the price of a stock over time, or you want to predict the temperature of any instrument you are going to use or whatever other problem you have with a numerical goal.

Clustering is usually applied to find out users to segment your customer base and find out user segments so for example, you want to automatically identify groups of users that behave in the same way in your customer base and for example email these users that have been discovered by the clustering algorithms.

Finally, rule extraction is typically used in finding out that for example, when the users buy product A and product B, they usually also buy product C. And so you can estimate the purchase likelihood of the users.

This concludes my presentation of machine learning, I will let Luca continue by presenting you the challenges in the data that are used in these problems.

How Big Data and Machine Learning Work Together: Big Opportunities and Big Challenges

[Luca] Thank you, Roberto. We talked about machine learning algorithms, and how they can be divided into two main categories, namely supervised learning and unsupervised learning. But still, we are missing a key ingredient, that is data. Okay, even if you have the best machine learning algorithm, withou the right data and the right amount of data you can’t really train it and you can’t extract anything from it.

In the last few years, we have been using a lot of buzzwords like Big Data, or Deep Learning or even Internet of Things and all of these buzzwords are kind of correlated and they are based on one big concept: data. Okay, so let’s see in more detail how data is changing our world. Basically, with Big Data we identify very large data sets that can’t be analyzed or sorted to traditional algorithms and methods. And let’s try to find out the reasons behind Big Data. First of all, let’s think about the cost of data storage. In the 1980s, 10 megabytes of data could have cost up to a few thousand dollars, and this is crazy if you think about it now, it’s funny. Because now we can just easily store gigabytes or even terabytes of data. And we don’t even know how big data is, we have just some estimates and they tell us that about a few thousand billion gigabytes of data is generated each year. And this is amazing. It’s incredible and is enabling a whole new kind of applications.

And another big revolution that’s happened in the last years corresponds to a paradigm shift in content creation: a decade ago, for instance, you had just a few content creators on the web, but now the situation is very different because basically everyone is a content creator. And with platforms like YouTube or Facebook and Twitter, there is a very large amount of data even beyond our imagination.

And of course, Big Data corresponds to bigger opportunities but also big challenges. The first one is that you have so much data that it is impossible to manually inspect it. You can’t just draw a chart, for instance, or construct a few KPIs and get a hint of what’s going on. You just need more complex algorithms and methods to analyze your data. And the first one is even bigger and more philosophical, I would say, because it’s a real paradigm shift. In the past you had decisions that were driven by experts and by intuitions. Now you have data-driven decisions. And sometimes these data-driven decisions are counterintuitive. But you have to accept them and to let data drive the change and the innovation.

And the third one is about algorithms: you have a lot of computational power in your hand. If you think that your smartphone is way more powerful than the computer that helped man reach the moon. But you have to make a good use of these resources. For instance, with GPUs you have to use distributed and parallel computing algorithms to make wise use of resources.

And the last one is the curse of dimensionality, and let me explain better with an example. So let’s try to take a look at the chart on the right. There are two variables: the Internet Explorer market share and the murder rate in the U. S. Of course, it is clear that the two variables are somehow correlated and if you have used Internet Explorer, in the past you might think that well, sometimes you might want to commit a murder, but this is a clear example of a spurious relation. This is by chance. There is not any reason behind it. And so the more data and the more data sources you have the more likely it is to encounter this kind of situation and you have to pay attention to it.

Now we have all the ingredients. We have the data on the one hand and on the other hand we have the algorithms, so you can just create a data set and feed it to the machine learning algorithms so that you can obtain very good predictions, and that’s it, right? You can just take your machine learning model and deploy it to the production environment and you’re done, right? Well, not so easy. Deploying a machine learning model into a production environment is not so easy, and Alex will explain to you why it isn’t and how to tackle these kind of issues.

How to Deploy a Machine Learning Model in the Cloud

[Alex] Let’s keep going and see why and where’s the problem, where’s the challenge? You don’t simply to deploy a machine learning model into production. I would like to show you what’s the challenge and why the cloud can be a better solution with respect to traditional ones.

So I think the first challenge is that most data scientists or even developers or software engineers start with prototyping. And that’s not really something which is production-ready at all and so most of the time you train your model, you test it, it’s just great accuracy, great performance, but then you can’t just take it and deploy it somewhere, right? It’s more complicated than that. And another reason is that machine learning is not a simple computational problem, so when I say you need elasticity, of course you need elasticity for everything, for your website, for every kind of application. But machine learning has different needs even because GPU isn’t available everywhere and at every time you need it. And if your prototype is using a specific framework, maybe it’s not that easy to distribute or to parallelize the computation. So elasticity is a little bit more tricky than other scenarios.

And another reason, in my opinion, is that you need a lot of features that you never want to implement yourself. For example, you don’t want to deploy one model to the cloud, you want to deploy N models, so you want to design and build a multi-model infrastructure and not only that, maybe you want to A/B test and optimize your models, so you need to find a way to keep them updated and A/B test them in real time and decide which one works better, at some point. And maybe you want to add a custom authentication layer, features can go on and on and on. And your prototype is not ready for that. So you will need to have an environment or a platform, or an ecosystem, that gives you all this stuff for free or at least ready to be used, right?

And the last point, I think, is since your prototype is not ready, you need to give it to someone that will put it into production one day, you get far from your code, this generates a lack of ownership, I think. And why is that? Well, in a team there are different type of skills. You are a data scientist, you work with data, you do statistical analysis, you build your models, and then there are other very skilled people like DevOps or system administrators who deal with operations and do code reviews and design your software infrastructure. These are very different skill sets. And the more you depend on DevOps or operations in general, the more this lack of ownership increases.

So why can the cloud help you with this? Let’s see, for example, what Amazon Web services can offer. On the left-hand side we can see a lot of services which are related to infrastructure management: you can build your EC2 instances, you can put them behind an Elastic Load Balancer. You can design an Auto Scaling solution and then you can deploy everything you want. You have an operating system, that’s your abstraction layer, you did everything you want, you deploy it. But then you need a lot of operations there. You could use a container, like Docker containers, you could build your Elastic Beanstalk Stack and just deploy it, but still it’s not a trivial solution. I have a machine learning prototype in my hands, so if you want to avoid operation that’s not the best thing. If you have an operations, team that’s great. But let’s see how we can reduce this lack of ownership.

On the right-hand side I have put some higher level services, like application layer services: Elastic MapReduce or Amazon Machine Learning API Gateway and Amazon Lambda, these are all services that simplify your life on the application layer, because you don’t have to manage that service you can just focus on your logic and your business logic. But today I would like to stress a little bit about how Amazon Lambda can help you in this particular machine learning scenario: I’m talking about serverless computing. It’s kind of a new thing, Amazon Lambda was released almost two years ago, in 2014.

The Power of Serverless Computing

What is serverless computing? I’m not sure how many of you are confident with the concept, but basically, you can design and develop and deploy your Lambda functions, that’s how we call them. And basically, your abstraction layer is not anymore the hardware or the operating system, but you can just talk about computing units like single functions, that do their business logic and then they die. Why is this great? Because I can focus on my code. There is almost no operation, it’s a very developer-friendly environment: I can do stuff like versioning, I can alias my Lambda functions, and I can deploy them in just a few clicks, even in a browser. That’s interesting. You can achieve scalability for free because AWS takes care of everything, you’re elastic and of course high availabilities guaranteed. And you can also change how you think of designing your framework or your application because you can be event-driven, and most of all if you are a manager you will never pay for the resources you don’t use. So never pay for idle. It’s like a mantra. You will not pay for your server if you don’t invoke the function. That’s a great point, I think.

How do we use Lambda for machine learning? Well, you can think of a machine learning model as a computing unit, so like a function, and if you want to design a more complicated system you can design, for example, A/B testing by composing different Lambda functions together and orchestrate a complicated systems of functions, that either call each other or interact with each other in some way.

Just a little clarification: what is serverless? Why is serverless? Well, the point is of course there is a server. There must always be a server somewhere. The deal here is you don’t have to worry about it. I don’t want to worry about servers, I’m a data scientist, that’s the deal.

I didn’t mention Amazon API Gateway so far, but one of the main nice to have features I mentioned is that you want to expose your model and your Lambda function by means of a RESTful interface. You want to build an API and I think most of you are software engineers, I’m sure you understand what I’m saying, that’s just the best way to provide a service.

Interestingly, you also have for free a completely independent authentication layer, you have got global content delivery network, you’ve got caching, you can stage version and even mock your API and your lambda functions. And if that’s not enough, and if at some point you get tired of Lambda or if it doesn’t suit your needs anymore because maybe you want to develop a real-time online learning solution like something more complicated, then what can you do with Lambda? Well, you can decouple the logic because you can bind your API Gateway resource to anything you want, to your backend, to other external services, and you’re decoupling the logic here. So you build the interface and then you connect the logic you want. Lambda here is a great solution to start building a machine learning model.

With this presentation we wanted to show you how easy this can be. So we developed a very quick example. We just trained a model using a sentiment analysis data set provided by Stanford University and we trained a model. We trained a model with Python, with the Scikit Learn framework, and what did we do? I built a simple HTML page, which called an API Gateway endpoint, which is connected to a Lambda function. This Lambda function will predict whether your text is positive or negative. Only two classes, so it’s a simple binary classifier and you can give it a try and maybe I can show you something here. Let’s get out of the presentation.

I can show you the example, let’s say something. “I really love this presentation.” Well, this is positive. “I really hate it.” This is negative. Well, it works. And this solution that we developed is completely serverless. We didn’t have to deploy or maintain or connect to any server.

You can find the source code on GitHub, you can install the requirements and deploy everything into AWS. So there is a Lambda function: you can basically see that it’s about 20 or 30 lines of code. And there is also the training part. So you will find the training is a few lines of code, included with comments, documentation and you can download the data set and everything. So just have fun with it and I can show you it’s real, it’s an API and here is the Lambda function, sentimental analysis, let’s see if someone’s calling it. Interesting. We have got monitoring for free: you can monitor your code. I’ll keep this online for a few hours and just have fun with it. Let’s get back.

So what happened? We built a machine learning model. It was a prototype at the beginning, but then we made it scale out with a serverless solution just by uploading our model into AWS Lambda.But this is a very simple example.

Let’s talk about something more complicated. We use the very same solution at Cloud Academy, and here you can see our infrastructure, a logical representation of our infrastructure. On the left you can see our clients: the website and then the mobile applications. We deployed our models into Lambda functions which are invoked by API Gateway. We also had this problem of keeping our models updated and we don’t want to redeploy the Lambda function because the code doesn’t change much. So we decided, well let’s put the model into a file because that’s what you can do, at least with Python. And we just uploaded each model into an S3 bucket and while the Lambda function will download the model, sometimes you will pull S3 and see if there is an update you can version files and other things really easy. Then what do we do? Well, we collect a lot of data, you can see on the bottom side and we collect all this data into a relational database on RDS inside Amazon. And we periodically, when we need it, turn on an EC2 instance and just rebuild our models. We update them, we upload them on S3, and the Lambda function will automatically be updated at the next invocation. That’s pretty simple.

Of course there are some limitations. For example, you can’t really easily develop and deploy real-time models, meaning that if you want to do online training and have a stream of events and use them to serve the next prediction that’s not really doable as it is right now with AWS Lambda, because basically your Lambda function is independent and immutable. It’s not easy to modify it until you retrain your model, reupload it and there you go.

Another strong limitation so far is that the management of the deployment package that you need to upload in order to deploy your Lambda function,you have some limits as far as size, as far the operating system libraries that you need, so there are some hacks and workarounds that you can find them on the repository I think. For example, in our case we were using SciKit-Learn. Well, it turns out that SciKit Learn is using a lot of operating system libraries based on C, based on Fortran code, well the problem is that you can’t install them. You can’t tell AWS Lambda, “Please install these packages in the machines you will use for my Lambda functions.” So a little hack we developed is that you include the .os files with your Lambda deployment package and you load them into memory dynamically at run time. It works, it’s a little detail that I hope will be fixed soon., but I’m really looking forward to simply defining your Python or JavaScript or Java requirements and let Lambda deal with that.

Another problem is that you can’t really execute forever. So currently there is a limit of 5 minutes of max execution so this is not really suitable for training. If you have to train a big model with a big data set, well five minutes execution is not a great idea. It’s not enough, probably. If your model is small you do not have a lot of data, it’s great: you can even do the training phase with AWS Lambda.

I didn’t mention so far the AWS Machine Learning solution, which is Amazon Machine Learning. Well, I want to mention it right now because I think it’s a great product, it’s improving a lot in the last year. We gave it a try last year, it’s been improving so much. And I think it was one of the first machine learning as a service solution. And it’s really great if you need classification models, progression models, and they also have textual feature extractions. You can do a lot of stuff, but there are some limitations as well. If you are a data scientist you can’t do whatever you want, you have to find some workarounds. For example, you can’t build nonlinear models, you need to find a way by creating new features like quadratic features, for example. Because they basically use only logistic regression and other linear algorithms. If you have other more advanced scenarios like recommendation systems or multimedia processing or online training, well it’s not supported yet. So I’m looking forward to this and I’m quite sure Amazon will move fast on this product because that’s where the market is going.

I would like to recap a little bit. The takeaways you should get home after this presentation. Well, data is crucial in your company, in your research team, wherever you are you must be able to take data-driven decisions and especially if you do user-centered machine learning, well you want to make products smarter. It’s not just data analysis, you want to build smarter applications and machine learning is the way to go and that’s why you need a better way to, first of all, maximize the ownership as I said, you need to remove every obstacle between the prototype and the production code and that’s crucial. And of course, this also implies that we need to eliminate every trade-off between the scalability and availability you want to obtain and the features you need and of course, provided that your models work right as a prototype so that’s for sure a requirement.

So our own solution here was serverless, it gives you great flexibility, and if you are a developer or data scientist I personally believe it’s the best way to go right now, because you can just stop worrying about operations. If you like operations please do it, but I don’t. And in the end, well, machine learning as a service will of course make your life easier, but if you’re a scientist and you need more control you still want to go for something more custom, like building your own machine learning model, deploying it into Lambda or even Docker or EC2, you will let us know what is your favorite solution.

Q&A

Thank you everyone for attending and now we can maybe go through a few questions.

“How does your data collection pipeline look like?” Well that’s a good question, let’s go back to the infrastructure. Well at the moment, of course, the whole platform is on Amazon Web Service, as you can see from the chart, the data at the end goes into RDS. We use Python internally and basically, I can tell that we are using another service based on AWS too, it’s called Segment, you can find it on segment.com so I think it’s a great product. Basically, you can trigger your events and send your data to segment.com and they will, in a very transparent way, forward all your events to external services like MixPanel, Google Analytics, you can connect anything to it. And an interesting feature is that you can also connect with webhooks to Segment. So that’s what we are doing we use webhook, bound to a public API Gateway resource which goes into a Lambda function and we do some processing and validation there and then we need a way before writing into RDS basically you want to be flexible enough to do maintenance, to shut down your database for a minute and never lose any events, right? So what we needed a buffer, so at the moment we are using Amazon SQS (Simple Queue Service). Again, data goes into Segment, API gateway, Lambda, then into an SQS queue and then another process or scalable fleet of servers or even a Lambda function in the future we’re thinking about this, will take the data, do some more processing and transformation and write into RDS. And from there you can build your model do data analytics and make data-driven decisions.

“Could you run Lambda functions if you have very large models?” This is another interesting question. So if you have models at the gigabyte scale: no. One limit I didn’t mention in my slide here, is you have a limit on your deployment package size limit. And if you ship your model within your deployment package, the model should be included into the size limit. But then you can say, “Well, let’s put the model into S3 and download it.” Well, the problem is there is no real installation procedure, so even if you’re downloading a gigabyte from S3 will be kind of fast, and well, it would happen at the first invocation, so you don’t want to let your first user wait a minute for the download. Also, because you have the other limitation about max execution time. And I think I need to check it, I will check it for you, I think you also have a local Lambda function storage limit so you can’t really download 30 gigabytes of model even if you have enough time. So that’s, I think, a great limitation. I think as long as your model is a few megabytes, maybe even up to 100 megabytes, you won’t have much trouble.

“Can we use AWS Lambda for making RPC codes? If so, how does data transfer happen? Does it need to go to S3?” Well, no. Well, what you do basically is you connect your Lambda function to an API gateway interface, a resource, and you have a template where you can transform your HTTP post body into the AWS Lambda input and yes you can add input like that. RPC could be a scenario, sure, basically you have the function and you are invoking it so whatever is your remote procedure called you can do it with Lambda. And you don’t need to always go through S3, no. You can let your input go directly into AWS Lambda. I can show you a little practical example so if you go into the repository I linked, inside the Lambda folder you can find the Lambda function. In practice, you have this Lambda handler here, and inside the event variable, this is Python, inside the event variable you have every input that’s coming from API gateway in this case. So you can do anything you want with that.

“In your chart, where do you do the simple feature engineering stuff like feature pairing?” Well, that phase is actually an offline phase most of the time so in my chart this happens inside the EC2 spot instance on the right-hand side. What we do is we extract the feature from the raw data, reading from an RDS database and we manipulate it, you can apply some transformations, you can clean it, whatever you need to do. And then with that you train the model and you upload it to S3. So that’s where the feature extraction and training phases happen. That’s offline, so it doesn’t need to be on a Lambda function, it doesn’t need to have a RESTful interface. – And it is in a Lambda function for prediction. If you give some text as an input for a prediction, we extract features inside – Yes, of course, one little part of it if you have textual feature extraction will also need to be deployed into your Lambda function. Because of course your model is trained on that particular shape let’s say, of data, and the same shape must be provided to the model at run time.

“Does AWS Lambda support Go? Do you use Go internally?” Well, no we do not use Go internally, wait, I think we have something in Go, it’s not crucial, but not for machine learning at the moment. Can you use Go? Yes. Some people started hacking long ago with AWS Lambda. And a nice feature is that you can run any executable binary file, either with Nodejs or Python, it doesn’t matter. So yeah, people started using Go. Since Lambda supports Nodejs, Python, and Java, you can in practice run anything that runs on the JVM even Ruby or Go. When you compile it, it’s executable, you can execute it. It’s not internally supported yet, so you need to do a little bit of hacking, but it works. I’m not sure how many people put Go code inside AWS Lambda into production, but that would be interesting to know. We currently don’t do it at all. We use Python internally almost for everything.

Do we have any more questions? I think we are good. So thank you again, everyone, for attending. Just a little reminder, this webinar is recorded so you will receive a little follow-up email with the recording on YouTube and you’ll be able to watch it again. You can contact us at support@cloudacademy.com if you need any help. And well, you can also let us know what you think about machine learning on AWS. We are really happy to discuss with you, we have a great community at cloudacademy.com and you can ask questions and give us your feedback either by email or on the community. So again, thank you very much for attending. We are glad you are here today. Goodbye. – Thank you, bye bye. – Thank you.

Back to webinars list