The course is part of this learning path
This course delves into the theory behind the topics of statistics, distributions, and standardization, all of which give you a solid foundation upon which the field of data science is built. We look at a variety of aspects of the field of statistics and how to use statistical tools to analyze and interpret data. You will then be walked through the NumPy library and how it can be used in a variety of real-world situations.
- Understand the different types of data and the relationships between them
- Understand the different way of finding the average of a set of data
- Know which statistical tools are available for analyzing data
- Grasp the impact that the distributions of data have on data analysis
- Learn about standardization and its use cases
- Explore NumPy library and its computational and statistical tools
This course is intended for IT professionals looking to learn more about data analytics and the NumPy library.
To get the most from this course, you should already have some basic statistics knowledge as well as some programming experience.
Hello, and welcome back. What we're going to have a look at now is the NumPy library. NumPy is a numerical computing library underwritten in C and Fortran. It allows us to put data into arrays and calculate things using that data. It vectorizes most of operations, it allows us to vectorize functions that we have written, and it is mathematically consistent in the way that it functions. First and foremost, what we want to do is import NumPy. So we're going to start off having a look at the features of NumPy, and then we'll do some exercises where we have to compute some things. So the main workhorse of NumPy is something called an ND-array, standing for N-dimensional array. In reality, you can think of it as a matrix, if you like. And how are we going to make an array? Well, we'll start off by making a list of ages, for example. So I'm just going to make a list of ages, which I'm going to copy and paste the values for just because I'll make mistakes, I'm sure. So just a simple list of ages. And I can use this to create a one dimensional array object. There are three ways that we can import things from NumPy. I can import NumPy as np, I can import NumPy just like this, and I could say from numpy import pi, something like that. All of these are different ways of importing from a library. The first statement uses something called aliasing, which says import NumPy and give it the nickname np, which is a standard alias. It's convention for people to just nickname NumPy np. Then whenever we want to use something from NumPy, I would write np dot, and I should hopefully hit Tab, and I get a list of everything that NumPy has to offer me. And it seems that NumPy has quite a lot of functions available to it. So this is the standard way of importing NumPy. If I wanted to reference NumPy directly by name every time I wanted to use it, then I can import it from using the second syntax, which means every time I want to use something from NumPy, I have to say NumPy dot. And then the third way of importing things, this will import just pi from NumPy. So what is pi? Pi is literally the value of pi. So one of the best things about NumPy is that it has pi just there waiting for you to use. So I can import pi. For that reason, if I import like this, I don't have to reference the NumPy library at all. I could have used np.pi, and I could have used numpy.pi. We're going to stick to this road for importing. There are no inherent advantages to importing in different ways, you will always have to load the entire module in to be able to search for the various pieces that you want to use. It's merely a matter of style as to how you want to import your libraries. They have all these libraries that are just collections of Python code, containing classes, numbers, functions, loads of stuff that we want to use, the things that we've defined already. So, if I want to create an array, I'm going to call it ages. And I'm going to create a NumPy.array object, and I'm going to pass in my ages list. This is one way of creating an array. I have a look at my ages, I can see if I print them out, I get this thing. If I print my ages list, then I get this. What's the difference? So the difference here are the commas. So if we print out a data structure and we get something looking like a list without commas in it, that's just an array. Arrays, when you print them out, structure themselves in a more mathematical way. In maths, we don't use commas to separate things out because we just assume that we know what it is. We don't need syntax to separate out our numbers, we just leave enough space. And whereas when we're looking at lists, we have commas between our values. Now, I am going to add context. So I'm going to add an extra set of brackets around this list. I have an extra set of brackets around this ages lists, so I'm ending up with a list structure, and I'm going to talk about why I'm doing that in a second. The only difference is when we have a look at that array that we've created, we've got an extra set of brackets around this array. Why have I done this? It's because it makes more sense when I describe the shape, size, and data type of the thing. If I want to have a look at what the shape, size and data type of my array are, I can access these as attributes of my object. So if I want to know what size my array is, I call ages.size. It is not a method, it is not a function, it is a value that this type of object has. It knows how big it is, it knows what shape it is, and it knows what data type it is. I run all of these, and then I get this information out. So, the problem if we have a one dimensional list and we construct our ages, is that we get this out. Now, the reason that we get this, we still have the same size because this is the one dimensional NumPy array, but the shape of the array, we've only defined a single dimension. So it only knows about one direction. It only knows about one thing. So this is a vector. It doesn't have a second dimension. It's not six by one, it's just six. There are six things, that's all there is. So it just says, I have a shape of six and nothing else. Whereas if we embed it within the list of lists, then we can think of it as being a column from a table or something from a greater dimension. It is six by one, it has six columns and one row. It's easier to think about. It's very specifically strictly mathematical. Now, as a matter of interest, we can reshape our data using methods called reshape. Reshape I can use to specify some new dimensions I want my structure to follow, so I can have a two by three array, I can have two rows and three columns. It's relatively simple, we've just flipped one part of it around. And if I added in 90 and 100 into these ages here, if I have a look at the shapes, nothing should be changing too dramatically. But what we'll get an idea of now is this dimension will match the size. So when the dimensions, the rows and columns, always have to add up to the overall size of the array, I can in fact now change this so that it is a two by two by two array structure. So this is now rows, columns, and it has depth to it as well. Working with as many dimensions as you want. So, when you start diving into neural networks and things like working with tensors, NumPy is fantastic for that. It's sort of like a cubed matrix structure. It's a 3D array. And we can have as many of these as we want, so we can define one by one by one. I have now transformed this into a seven dimensional space. This is a representation of a seven dimensional matrix. But we're going to stick with two, we're going to stick with two dimensions, but it's good to know the sort of extensibility of this. So, we'll stick with it being a two by four so we can reshape the shapes, we can choose it dictated by the overall size. There's a number of different shapes that we can have our arrays in. Let's do an exercise just for now. Now, the correct format, I want you to take in the data that has been generated below here and transform it so that it is three columns and however many rows is appropriate. First of all, generation one did a NumPy array with these values, generating a matrix of the values of this. We've got a one dimensional NumPy array, then we're going to have a two dimensional array. Change the dimensions to committed traits, and then we'll have a look at running. Run all these bits of code, and then have a go at reshaping using the reshape method to change its dimensions to be three columns and however many rows. The dimensions will reshape all these rows and then columns. So now let's have a look at the solution. So the first two exercises we should have been fine with a rough idea of generating a 2D array of the values. We could do this for just the numbers that we have on the slide and giving a three by three list of lists. I know that I need three columns. And we estimated three columns, and then I have however many rows. The number of rows is going to be dictated by data.size divided by three for this example. What if I generated individually? If I want to have a look at c1, c2, and c3, what is c1? So, I've asked for 50 random numbers from a uniform distribution between zero and 110. How do I know that? Well, I've got from low to high, from a discrete uniform distribution of blah blah blah. This is how it's getting random data. If I then have a look at c2, c2 is different in that everything seems to be zero point something. Now, what is np.random.rand doing? It's picking a random number between zero and one in a given shape. We can specify shape or values that we want to get here. It's always going to be uniformly between zero and one, but not including one. Now, the last one here, what about this one? Numpy.random.choice. A list containing French and English, and then 50. Here we get 150 values at random from a choice of French and English. So it's essentially like a coin toss. We could have gone with heads and tails, and I would just be doing a coin toss between heads and tails however many times. It is completely random. It does it purely by probability values. So I could specify p as equal to 0.0, so I need a tuple with 0.1 and 0.9 for example. I would get, as we can see, what I would get is a lot more English than French. If I run this a sufficient number of times, I still may end up with more French than English, it would be based on our probabilities here. Every time I'm getting different quantities. Then, if I type np.random in here, I get told about the module random. So I get random cells and distributions, I can get random integers, I can get randomly permute things, random seeds, there's an entire library of random things. I can get random data from all of these distributions. There's a whole library about generating random data. If I hit Tab here, then I can see I've got lots of functions available to me for generating data. Then, with my final entry here, when I do np.column_stack, what's this doing? The question is, what is np.column_stack doing on this here? What's this done? I generated my integers, my random numbers, and my random choice between French and English. What's column_stack doing with these arrays? So it's stacking them column-wise. If you call flatten on this, what that does is that removes structure from an array, and it just turns this into one long string. You could have just gotten it to flatten, and that's an equally valid solution. So, c1, c2, and c3 is column one, column two, and column three, put them together, et cetera. So there are all sorts of functions available to us. Now, you may have noticed something a little bit weird about what the actual output array is, about the formatting of this output array, that everything has come out as a string. What we've actually done here is we've actually done a little bit of an abuse of the NumPy array. NumPy arrays are designed to be single type data structures. What NumPy does is it finds the easiest common data type it could transform everything into. So let's take a look at some data generation methods that we have with NumPy. Then we'll have a look at one of the main reasons why people tend to use NumPy. So, data generation. So there's a few methods I want to look at, and then we'll have a look at vectorized operations. Numpy.arange. What numpy.arange does is it generates a range of data between a start and stop point. I want to go between zero and 55 in steps of five to obtain a five times table. I can use numpy.arange. It works in pretty much the same way as Python's index range function. I want it to go zero, five, steps of five. The difference being that range creates a range, while arange creates a NumPy array. An interesting fact about range, range is what we call a lazy list in that it only ever stores its start points and end points from where it currently is in memory. It's not actually got the list of numbers between zero and 55 in memory. All it knows is where it starts, where it ends, and what value it's currently on, and this saves memory space. If I want to turn this to a list, then I would store the whole thing in memory. But it doesn't make sense to store every collection of values in memory. That's why it prints out range, because it's not a list of numbers. It's something of a lazy list. It only gives you numbers when you ask for them. Arange generates data for us between zero and 55, and it tucks them all away in memory. We have something called linspace. Numpy.linspace. Linspace is very good when it comes to graphs with continuous data. If you want to plot a graph on some axes and you need some, say, X data to throw into it, linspace is fantastic for that. We can ask for data between zero and 50. So a start point and end point, zero, 50. So this has given me a five times table again, but in a slightly different format. What linspace is going to do is it's going to generate linearly spaced data between the start point and the stop point. I could ask for 1100 data points between zero and 50, and I'm going to get lots of small, equally spaced points of data. So the difference between this and range is that range is steps, it's only steps, whereas linspace is giving me X number between two points. It's good for filling out graphs and things like that. So now let's quickly look at repeat. We don't need to talk about repeat too much. Np.repeat. If I want 10 two times, then I could get 10 two times. If I want two 10 times, then I could get two 10 times. It does what it says on the 10, it just repeats the value however many times you want it. So, now we can also generate identity matrices with ease. It uses eye, as in eye like a human eye. And because as we know, the symbol for the identity matrix tends to be a capitalized I. But NumPy is going for eye as in the human eye. So here you can see we have an identity matrix. So now let's look at an example of how we can generate using normal Python, let's say, numbers between zero and 50, where we've multiplied each number by 0.75 hypothetically. How could we create a list of numbers like that? So we could use a while loop, for example. But in this example, let's use a for loop because it makes life a bit easier. So we can use something like a for loop. If I want to list a list of data, my list of data is going to be given by for a number in range, 50, we can do list.append number times 0.75. This is one way of doing that. This is me generating some data where I've got multiples of 0.75 here. I have a look at that list, and we can see I do indeed have all these various numbers. But that is a bit of a hassle. Ideally, I would want a better way of doing this. And another way of doing this would be using a comprehension, for example. Do something along the lines of i times 0.75 for i in range, 50. The two are equivalent. Both of them are going to create a list of values here. These are equivalent ways of doing this, and in reality, this is the best way of multiplying data structures together using Python. If I had a range of numbers and I wanted to perform an operation on them, this is pretty much as good as it's going to get. In an ideal world, what I would like to be able to do is to take something like this. So, here you can see I've created a five times table, a five times table list. Then just carry out the operations on it as if I was working on an individual number. I would like to treat the data structure as if it was one thing that I'm performing a transformation on. I would like to be able to do something like multiply my list by two. So what we're going to do is we're going to stick two lists together. This is what Python knows what to do for a list. I want to try and add 10 to a list, but I can't because I can't add an integer to a list. If I want to add so on and so forth, I want to take something away from it, we can't again. If I want to divide the list by five, I can't, and so on and so forth. And I can't ask a Boolean question to my list, because again, this operator isn't supported between lists and integers. If I wanted to do this for a NumPy array, thankfully, NumPy allows us to compute in the way that we want. Equate this five times table, this is exactly the same as the above, but this is just an array this time. So if I print this out, I've got this array. And if I carry out the exact same operations on this five times table array here, magic. Every operation I've carried out has been performed element-wise. The operations are what we call vectorized or broadcast over the array. So some of these are quite cool. So the power of two, we can map the power of two over our array. We can map a Boolean expression over it, that becomes very useful. I can say five times table, less than 20, and what I end up with is an array containing those elements which it's evaluated true for, and those which it's evaluated false for. Now I can actually do something called masking where I take only the elements for which this expression is evaluated as true out of it. We can look at slicing and things like that. We've chosen to only return things which are less than 20 using this syntax. So here, if we try putting in tilde, oh no. Put tilde over the whole thing, it should flip the Booleans, yes. So the not operator for things like NumPy and pandas is tilde or squiggle, there we have it. Operators broadcast across our collection. Similarly, when we were looking at wanting to perform an operation over our data, when we wrote a loop to create some data, I could create a range of numbers. And then I can just map multiplying by 0.75 across these numbers to easily generate what I had to use a loop to generate up here.
About the Author
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.