Practical Data Science With Python
What is a Data Scientist?

This course offers an introduction to data science and looks at what a data scientist does. It then moves on to data science in Python and, through a range of guided walkthroughs, shows you how to use Python and its features. You will learn how to set up Anaconda and Jupyter Notebook and learn, using real-world examples, how to write Python code in Jupyter, with useful tips within the context of data science.

The course also looks at object-oriented planning, as well as Python variables and Python functions, and finally, it takes a look at Python data types and functions.

Learning Objectives

  • Understand data science and the role of data scientist
  • Set up Anaconda and Jupyter notebooks
  • Improve your knowledge of coding with Python
  • Understand how to work with Python variables, functions, and data

Intended Audience

This course is intended for:

  • Individuals looking for an introduction to data science
  • Those looking to enhance their knowledge of Python and its features


To get the most from this course, you should already have some knowledge of Python and programming languages in general.


Okay, let's take a look at data scientists. What is a data scientist? Here we have a good definition. A data scientist is a better statistician than a computer scientist, and a better computer scientist than a statistician. I like this definition. It gives you a good idea of the overlap of skill sets and a general definition of what data scientists are and what they do, so extracting information from specific sets of data. And quite often you would want to see a definition of a data scientist involving machine learning but that's not necessarily always the case. And what do we need to know? What skills do we need to know? So, we've got programming, we have maths and statistics, we've got computer programming, maths, statistics and domain knowledge. These all fall into the data science skill set and you see this sort of diagram in every introduction. It simplifies it quite a lot but it's a broad idea of the skills that you do need. This is an equally valid Venn diagram of skills that data scientists are often asked to know or required to know. For this reason there is another definition of a data scientist, which is a unicorn because they actually don't exist. It's very hard to be in the centre of this Venn diagram. And it's for that reason that as a role you end up trying to get these skills within a team. 


But you have things called horizontal and vertical data scientists where a vertical data scientist is an expert in a specific area, a horizontal data scientist knows enough about most of these areas to get by and maybe a lot about one or two of them. All of these are perfectly valid skills for a data scientist and we could add more but there isn't enough space on this Venn diagram but there could be more. So, curiosity is something that always pops up. Whether it's a skill or not, I don't know. It's the most important thing in the world according to this quote from a very important data scientist, so this is a nice idea if it gets you motivated. 


Machine learning, machine learning is very much a sub-field of data science or it's a tool that gets used within data science anyway. What is learning? What do we mean when we say 'something learned'? It means finding out something you didn't know before. As a human some people say the quest for knowledge and that's always a good one. Can computers learn? If you feed them information they can learn, right? How do children learn? They run around and they put things in their mouths and they just do things wrong and they hit their heads, and they bash into stuff constantly until eventually they're somewhat competent and not doing that. Who tells a child to do that stuff? Who tells a child that they're supposed to run around and do things wrong? They have to explore the environment somewhat of their own volition, nobody tells them to do anything. They run around doing all these things because, well, they feel like they have to. They want to do it, right? 


What does a fully learned trained computer look like? What do trained machines look like? Is the machine ready to learn? Let's look at a linear equation, the equation of a straight line. This is a machine learning algorithm ready to go. The computer's getting ready to learn. It's going to understand the relationship between, let's say, happiness and money. How is it going to do that? Well, it's going to have just a nice linear equation, we're going to assume. When a machine has learned the difference is going to be that we'll have an equation that looks like, purely for example, 0.2x plus 77. If Y is equal to the overall happiness of someone and X is equal to money then this equation explains the relationship between happiness and money. That means a linear model relating the equation relating happiness in there and money here, is this what children do when they've learnt? Do they look like this equation with some numbers in it? No, they don't. So, when we talk about learning in a machine learning context all we're talking about is a computer doing exactly what it's been told to do a lot of times until it eventually obtains some numbers. 


That's machine learning. Numbers get bumped up, bumped down, coefficients of whatever machine learning equation we have will be bumped up and bumped down until eventually you have some equation, some numbers there but it's all mathematical. It's all numeric. Computers have no idea what they're doing, right? You program a computer to do a sequence of steps and it will do it. It's not doing it because it's quite interested in Spanish this year, it's doing it because it's been told to do it. So, I would argue that computers don't learn, machines don't learn, they can be programmed to get better at tasks but whether that's learning I'm not so sure. It's not necessarily a poor name, it does explain what goes on to some degree but it's just a tool, it's a program, it's a program that does something. In machine learning we have models designed to be able to capture some aspects of reality usually, an untrained model is some general mathematical formula, some mathematical model with many free parameters. In the equation of a straight line the free parameters are going to be m and c and we know variables that we are going to throw in. The data that we're going to throw in is y and x. We're trying to train m and c so we can get some very good numbers. Now, it's potentially overly simplified. We would need an algorithm that allows us to minimise the error within this equation but that's learning, that's machine learning. And when a machine learns it's just automatic adjusting of parameters up and down. Once we've fitted this model, once we've got a number then we say it's trained or it's been learned. 


So, we have different types of learning, I'm talking very much big picture, we're boiling it down to two main types. If I'm trying to predict some best possible line I need to have data which tells me what is a good line and a bad line. I need to have an answer to my question of, 'What is the best line I could fit to this data?' So, we have supervised and then we have another type which is called unsupervised. Supervised simply means we have the answers, our training data has correct answers and once we've pushed it around with those stabilisers on eventually we'll put-, we will put it into the real world where it will be predicting data for which we don't necessarily have the answers. And then we have unsupervised learning algorithms, which you can probably guess, the data doesn't have the correct set answers for unsupervised learning, so we'll dive a bit more into the distinction between those. What could we do with data if we don't know whether what we've done is correct or not? If we're not really sure whether we have a solution or not? Whether we've helped or whether we've not helped? So, we have supervised. Plotting a line is a classic example of a supervised learning problem, predicting labels is another classic example of supervised learning. The generation of labels is a very common unsupervised test, seeing whether we can group data by some aspects that we haven't got within our data set. 


So, it's good to think of these in terms of questions that we can estimate, every time you were going to do any machine learning you have data and what your goal is is to pose a question to your data which can be answered by a machine learning algorithm. They all answer very specific questions, so regression or the plotting of a line, a trend line, the training part of this is learning some line over a seen X, Y and Z. What we mean by seen X and Y is X is our data, Y are our answers and we have a set of data with answers. We're trying to relate X to Y. The ultimate goal is to predict Y from some unseen X. Given a data point can we predict a number, a numerical value? Regression is always going to be continuous data and numeric so this is the classic idea of what machine learning is and we have something very similar, classification. Classification is essentially regression but for labels. We have to use different tests for this. Instead of generating a trend line we are running a line that discriminates between groups, known groups within data. In a retail scenario we might be trying to distinguish between high value shoppers and low value shoppers based upon historical transactions. We want to apply a label to them which dictates whether we want to retain them as customers or if we don't care about losing them, something along those lines. 


Regression would be predicting the amount of money a customer spends on an average visit, the regression is numbers and classification is labels, and they both come under supervised learning because we know the label and we have some data where we have actual values, actual numeric values. Another big use for unsupervised learning is dimensionality reduction, i.e. getting rid of columns from a data set. The name sounds very fancy but it is actually just getting rid of columns from a data set. There's not necessarily a correct number of columns or just columns to get rid of from your data but you will have an algorithm that will do the best possible job it can to get rid of them in a smart way. You're always going to lose information if you delete data but an unsupervised learning algorithm just has a specific process by which it will delete your data. There is no right or wrong answer. Similarly, with grouping of data there are no right or wrong groups. The algorithm will run, it will find the best groups it possibly can, even if your data is a mess. A clustering algorithm will find groups in your data if it seems like there are groups. 


If we have some data that looks like this, for example, it will cluster usually. It will find whatever groupings that it's supposed to be looking for or what you want to be picking up on but even if your data is a horrible mess, where there actually was no groupings within it and you've told your algorithm to find three groups within this data set it's still going to find three groups because there's nothing to compare itself against. All it knows is you want three groups, it's going to give you three groups no matter what. When you reduce the dimensions of a data set you're going to get the number of dimensions you ask for whether it got rid of critical information or didn't get rid of critical information. So, that's our unsupervised and all of these are really different types of questions. They're different types of questions and working on different types of data. Regression asks, 'What number?' Classification asks, 'What label?' Clustering asks, 'Are there labels? Or give me labels for this data set according to this method.' So, the learning process here we can think of it as learning and distribution. We're learning information about this sort of spread to the overall structure of the data and based upon that trying to separate out into some grouping. These are common applications of machine learning algorithms and some general types of machine learning algorithms so we've spoken about unsupervised and supervised learning and these are the main two branches. 


We're not really going to speak about reinforcement learning in this course but it's definitely something worth looking at. Anything where you have something trying to navigate an environment reinforcement learning is massively useful for that. This is when a robot becomes more like a child than it would in any of these sorts of situations. Instead of having correct and incorrect you sort of nudge it by saying, 'That was a bad thing to do and that was a good thing to do,' when you're training it as opposed to just saying, 'You got this wrong, you got this right.' So, having a look at supervised learning we'll start with regression because that's the first one that we have a look at. So, looking at population growth, life expectancy, market forecasting, weather, advertising popularity, what's the common trait between all of these in regression? What does the thing that we're trying to predict with regression in all of these examples tend to be? It's always numeric data, so population growth is a number of people, life expectancy a number of years, market forecasting a number of money, etc., etc. These are all numeric attributes that we're trying to predict. So, then for classification what's the common trait between all of these? We have identify fraud, image classification, customer retention and diagnostics. These are all labels, aren't they? Good, bad, yes, no, sick, not sick, will leave, will not leave. Image classification labels might be 'is cat', 'is not cat'. Identity fraud labels might be 'fraud', 'not fraud'. So, on the whole these are all binary but not always. So, there are many labels that you might want to try and predict as well as possible. So, that's supervised learning, so we have historic data on which we are basing our prediction or classification. 


So then when we have a look at our unsupervised learning clustering, clustering is slightly harder to think about but once you get it you realise it's fairly simple. So, we have recommender systems targeting marketing and customer segmentation. What are we trying to get when we cluster? We're looking for groups, groups that relate to each other so we're trying to identify customer segmentation, a certain type of customer, a common customer type. We're trying to obtain with targeted marketing something quite similar, a common target and we have recommender systems as well so grouping, again, generally people by what they might like to watch. So, in this case something like Netflix it will group and then it will attempt to recommend. Now, we did look at dimensionality reduction, dimensions are just what we call columns in tables when we want to sound fancy. In machine learning they often get called features as well. 


The reason we're calling them dimensions is because if you think about a table of data, when you try and plot on a graphic any of your data each of the columns is going to represent an individual dimension, for example, so we use this for big data visualisation. You've got big data, you want it to be littler so you get rid of some of the columns. Meaningful compression is an interesting one so later in the course we will talk a bit about common message for dimensionality reduction that goes beyond looking at a column of data and thinking, 'That's no good, let's get rid of it.' So, at this point you might be wondering whether getting rid of columns gets rid of the data and the answer would be yes if we didn't perform some mathematical transformation on it. So, we're going to have a look at something called 'principal components analysis' and what this does is it will take your data and it will generate a new set of columns which are made up of little pieces of all the other columns that you had before but capture as much information about the old data sets as the new one, as much information about the old one within the new one while reducing the number of overall columns you have. How it does this mathematically is it will take the space that your data is in and it will flip it all around so as to maximise information along certain axes and then you will reduce the number of columns you have by how much information is contained within each view, we say column, axis, dimension, whatever you want to call it. So, it's an information maximisation transformation and you then drop everything that's no good. That's principal components analysis. 


Now the maths behind that is pretty cool. You can do it in just a few lines of code in Python so part of that is why it's also used for instructor discovery, you can get an idea of how which columns were decided to be kept or which columns actually play a part in your final analysis when you perform a transformation like that that you can reverse. And then feature elicitation, which is a predictor in your data for some problem that you have. So, the reason it's unsupervised learning is that you were always going to lose some amount of information when you delete a column of data. There is no correct answer, there is only a best possible answer that you can obtain. So, the questions that we're going to be asking for regression are things such as how much, how many, with what frequency, what score, what numbers. Classification as always is important. What group do I put this in? Which category, which class, which this, which that? Then clustering is about ascribing your data, there's not really an inferential step within clustering. It's just about ascribing common attributes. 

About the Author

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.

Covered Topics