Module 1 – Python for Machine Learning
Python for machine learning

One of the tools you can use to programme machine learning. In this module, you’ll learn the basics of python when it’s used for machine learning, how to use loops to compute total loss, regressions and classification, and how to setup machine learning in python.  


In this session, we're going to review Python for machine learning. That means two things. First of all, just doing a basic review of Python syntax, in the sense that we need to cover and just hit off a refresher, a review of things that are going to be important for this course. We expect you to know Python already or to have attended or taken one of the other courses available on Python. Now, the the second goal of this session, is the machine learning bit. So that's putting in a more Pythonic way, a more programming way, the ideas that we have learned from the introductory sessions. So taking that setup of machine learning and phrasing it in a programmatic way. So okay, so we're gonna do both at the same time? Gonna do both at the same time. So let's look at the set of machine learning then and just remember, just recall what we're talking about. So let's look at, let's call it machine learning set up. Increase the size, maybe a little, there we are. And as I said, many many times when we say machine learning, we tend just to mean supervised learning in the sense that, we were doing some prediction, the supervised machine learning set up, right okay. So what kind of problems do we have in supervised learning? Well, if you remember, we've got we've got regression, which is where our target variable Y, the thing we're trying to predict, that is a real number. That's a real number. Do want to say anything else right now? Let's just say in, now when I execute this little cell here, you get this mathematical notation. So just to give you some background, this syntax here is known as latex syntax or LaTeX syntax, and these dollars here indicate we're using it. So you can think of this as a kind of equational notation. And you know, I am inclined to put this in, not just to leave it as a white boarding kind of thing. To give you, you know, a realistic picture on how to actually use this notation within a Jupyter notebook. And also so that when we compare the mathematics to the programming, that there is actually some kind of notation here that you can see in this notebook. So that's, we've got why they're being a real number and if you remember just the general, I suppose you could say you know in general, the set up there is that we have Y which is the target, which is the thing we're trying to predict. So prediction target might even say, you know, the put a little note here, the variable we are trying to predict, okay. Then we have X, you know as part of the general setup, which is the features or feature or features. So that's, a little x is a single feature. If you remember the capital X by convention and that's going to be multiple features. So we think of it, you can think of a variable as a column, and so, one feature is one column, multiple features will be multiple columns. What else we've got in the general setup? Well we've gone the idea that X actually, maybe in truth X is related to Y somehow. So how is X related to Y? Well you've got Y, and that's going to be just some function applied to X. So this is gonna be, so that's gonna kinda tell you that you know, you can predict or calculate, you can calculate Y, can calculate Y from X. So that's what that tells you. And then we've got the, of course, what we're trying to do here and in the learning problem is find some approximation to that relationship. So we're trying to find, a Y hat. So if I put a little hat here, so hat a Y hat, which is the guess or estimate, the estimate for Y. We're trying to find then, that relationship that allows us to estimate. So which is going to be, put a hat on the F, meaning the function used estimate and then we go to go. SO you got Y as a target, X is a feature, F is the relationship, Y hat is the estimate for Y and then this is the estimate for F so you can put it even in here the estimate for F. Right so do we want to, do I want to say, do I want to do something here now? I think maybe I do. So I just wanna put some code against this before we talk about specific problems. So if I just put here, I'll just put this for later on. I mean put a little cell here. Let me run this for you, and then we can just, you know, choose something. So let's start by defining the real relationship, F. So maybe here, let's choose a problem. So, an example problem, so we've had several now, let's stick to one that we know that's somewhat familiar maybe. This is where our Y is going to be a film rating. Sort of, sometimes in mathematics we use the colon here to kind of like, as a definitional thing, It's gonna be all, I'll choose that as a convention we'll see. X will be a user's age and we're going to try and do is, you know, we're going to suppose that there is some real relationship that allows us to predict or know, you know, if it's a real relationship, we would just sort of know or be able to calculate, what that user would rate a film. So in this case let's say that, let's call this F rating. So that's our real F, F rating. And that's gonna take our user's age and it's going to produce, what should it produce, it'll produce a rating out of 10. So if a person's age goes between zero and 100, and a film rating goes between zero and 10, the very least we'll need to divide, their age by 10 or times 0.1, that'll give us a range between zero and one. And let's go, let's do something kind of interesting, with go for 0.8 and then add 0.5, right okay. So that we imagine this is a real relationship. Let's just, let's just try out that relationship. So if I say F rating, if I asked for how how much will, a newborn baby, how much will they rate this film of interest? Say, you know, they only give it half a star, or out of 10 or half a point out of 10. You know, so if I ask you know how much would a person who's 80 years old rate it? Only rated at 6.9, so it's quite a, it's quite a poor film maybe. And let's try rating it for 10 year old. They're going to give it a lower number, presumably 1.3. Okay so maybe what I'll do is rather than have these across lots of different cells like this, what should I do here? Maybe if I put a little comment here, that would be one way of tidying it up. That would give us, let's have a look, what am I doing here? Missing a comma, good. Give us some numbers there. So could I put this in a dictionary? Maybe I could, maybe if I put a maybe I create a dictionary, I'd say they're 10 years old, and that's what they've rated. They're zero years old, and that's what they've rated. And here they're 80 years old and that's what they've rated. And maybe if we output that, that gives us all the information on one line so. That's how much a 10 year old rates, a newborn baby rates, an eighty year old. Of course, these are possibly inputs that we will never see. You'll never see a newborn baby rating a film. But that tells us something about what this real relationship expects. So it's not a very good real relationship, because you know a real relationship would presumably give you zero for a newborn baby because, hard to even say, what should the truth be for a newborn baby, I don't know. Right okay, so we've got a little function here, using the Def key word, we've named the function. That function has an input X and, when that function is evaluated, when it runs, it returns or produces, puts into memory, allows us to calculate, this formula here, which is this number multiplied by X plus 0.5, okay. So that's a little definition of a function. Then here we've got a data structure called a dictionary, and that dictionary, pairs up information, so that you can think of these, each of each of these elements as being a pair of stuff. So you've got the key here and the value there. And the key serving as, well, we'll come into dictionaries a bit more detail in just a second but, you can see the detail here is, were getting these pairs values here and the role the key serving here is to track the age that we are using for our calculations and the role the value is serving is to track the Y. So this is X, Y, this is the dictionary of X and Y., X and Y, X and Y. This is our prediction Y. So to be, you know maybe to be nice and clear about that. Like if I want you to know someone's film rating, for, you know, 21 year old person say, that would be the Y, that would be the true rating for that person. Now, of course, we don't know this, so in general, we don't necessarily have this function. Maybe we never have it. Probably we never, I'm not sure if you ever have it, but, it's certainly right to have it. What do we need now? We need the estimator, the estimator. So what are we doing here when we solve the machine learning problem? Well we're coming up with some function that kind of performs the same way, so we call it, call it F hat, if you wanted to. F hat for rating, the hat there meaning estimator. So we could say F S or F predict or F something. I'm gonna use the word hat here, only because the notation has a hat in it. And I think there's some value in, first of all, learning the notation a bit and also having that correspondence between the terminology in the code and the terminology in the setup and the mathematics. So F hat rating, let's take the age and let's say that what we come up with somehow the machine tells us somehow this is the best or good, estimation is going to be 0.07 times X plus 0.6, right? So we can see that's probably gonna be quite close. And if I just try F hat of rating, and I try it for a 21 year old person, that's going to be 2.7. If I take away the real one, F rating for the 21, we can see that the, without by 0.1, without by 0.1. And if I didn't care about how far we were out, one way or another, I could take the absolute value of that and that would tell me to just ignore the sign. Nothing I can do there, of course. There's maybe square it in Python and that's to the power of two, that's the power of two. And that again would, it would change the number, but it would mean that all negative numbers became positive. Right so maybe we'll, just a bit of notation there. See without abs if you want, so it gives you a sense of, you know, we're off from what the truth is, which is what we'd expect of course in general. So we're not quite finished yet, because what I've just done there is to introduce the notion of loss, right? It's the notion of loss. So that in the general set up, there's still something missing from general setup and that's how we're going to measure our distance between the truth, the data set that we suspect it's come from, you know a useful environment and our estimation function. So what's that going to be? Let's say that's going to be, let's try, that's what this looks like. If I say loss of, and comparing our prediction Y hat, and compare that to the known, and that's just going to be, You might, also just give it a little L for loss, depending on preference. So there's loss and then there's a total loss. So a capital L maybe for total loss, capital L, again taking now all of the, you know, all of the Y and things, that would just be kind of maybe the sum, of the loss of every point that's right like that for now, okay. So let's just give you some terminology here. This is the, so the, the loss, what's this? This is, how wrong we are in each point. So how bad an estimate, so how bad an estimated point is and the keyword there is point, that loss is defined per point, you know. So there's one point, there's another point, there's another one, this is how far we're at. And then the total loss here, so that's total loss. And what's this? This is sort of how wrong our entire model is in the sense that it's looking at every point. How wrong every point is, good. And I remember the goal of machine learning, the goal of supervise machining anyway, is to minimize this, is to find sum, sum F hat, sum prediction function, which minimizes or reduces the error that we get from all of the points, okay. So let's define loss. And here with this is definitely perfectly reasonable definition, just above that it was with Def, loss. Well let's create the loss we're gonna use for the rating problem, and that's going to take two things, that's gonna take a Y hat and a Y, and then return. We do the square loss maybe, yeah why not? Y hat minus Y squared, there you go. So if I use, if I use this, this loss here, I get how wrong every or each point is, So I have to put point in here. So if I have my Y here, this is the Y for 21, let me define a Y hat, and Y hat is F hat rating for 21. So if I just need to show you that, it's going to be a little bit less than, a bit less than the actual result we've got above a 2.07. You can see just visually it's going to be about the difference is 0.1. So the square difference would be a bit less, a bit lower than that. So if you do a Y hat minus Y, we get some we get some number. Now there's a few things going on here, not just notation but in terms of how we're naming variables. So let's just take a step back and then we'll come to some problem setups in a little moment. Just put a little bar there or something, right. I'm putting 21 into F hat rating and I'm naming it variable here Y hat. Now this is, I mean in notation, this is a little, little bit suspicious compared to this notation up here. Cause what this Y means and what this Y hat means, it means the entire column or the you know, it means it means the whole set of data Y hat, whereas this means a particular point. So this is one point and that's the whole column. So in mathematical notation, if we were actually talking about points here, what we would do is, we would put a little index on these Y's. So let's do that cause I think there is some value in seeing that. So the loss isn't actually defined for the whole column Y, it's defined for each point. So in machine learning sometimes we make the index of the row a superscript, which is a little odd. And the reason for that is the subscript is often used to select the column. So Y nought or let's say X nought, would be the first column, X one, second column, X two, third column. Then the what would be a power but it's actually just the superscript that would be the row. So the loss is actually per row. This, power here is the row, this part is a row. And when we leave off the column, it just means the whole column, so each row, each row. So here, this is actually, you know, this is a particular row, so this is Y nought. It's a particular Y and Y hat here, that's well you can, well, if you could find it for 0.21, we could do it that way or we could just say it's for the first point, it doesn't really matter. This is our first point of data and then this will be another point. So it's a point not column so it's one value not many. And again, here, we could either just leave these as Y hat or Y or it's gonna be a point there right so, you know a point. So if I put in here this point and that point, then, you know, we get the value coming out. So another subtlety here is that the parameter names here, Y hat and Y, are defined for this method here. And then we're using the same kind of terminology for the points as for parameters. So Y hat zero becomes the first parameter and Y zero becomes the second parameter and that's just a convention in programming, that I'll just highlight as part of a refresher, but often times when we define functions like this, the parameter names will kind of be either be the same or very similar to the variables that we will put into functions to perform the calculations. But they're actually distinct, right? So this is actually a piece of data 2.07 and then this is a parameter that will be 2.07 and could be lots of other different values when we run the function again.


About the Author

Michael began programming as a young child, and after freelancing as a teenager, he joined and ran a web start-up during university. Around studying physics and after graduating, he worked as an IT contractor: first in telecoms in 2011 on a cloud digital transformation project; then variously as an interim CTO, Technical Project Manager, Technical Architect and Developer for agile start-ups and multinationals.

His academic work on Machine Learning and Quantum Computation furthered an interest he now pursues as QA's Principal Technologist for Machine Learning. Joining QA in 2015, he authors and teaches programmes on computer science, mathematics and artificial intelligence; and co-owns the data science curriculum at QA.