Data Science - Core Skills
Data Science - Core Skills
Difficulty
Intermediate
Duration
57m
Students
1919
Ratings
3.8/5
Description

This course offers an introduction to data science and looks at what a data scientist does. It then moves on to data science in Python and, through a range of guided walkthroughs, shows you how to use Python and its features. You will learn how to set up Anaconda and Jupyter Notebook and learn, using real-world examples, how to write Python code in Jupyter, with useful tips within the context of data science.

The course also looks at object-oriented planning, as well as Python variables and Python functions, and finally, it takes a look at Python data types and functions.

Learning Objectives

• Understand data science and the role of data scientist
• Set up Anaconda and Jupyter notebooks
• Improve your knowledge of coding with Python
• Understand how to work with Python variables, functions, and data

Intended Audience

This course is intended for:

• Individuals looking for an introduction to data science
• Those looking to enhance their knowledge of Python and its features

Prerequisites

To get the most from this course, you should already have some knowledge of Python and programming languages in general.

Transcript

Hello, and welcome back. We'll speak a little bit about what skills we need from specific areas to be able to perform all this machine learning and data science. Hypothesis testing. We can actually utilise hypothesis testing to do things like check if data is from a known distribution. So, is the data normally distributed? Accuracy. So, accuracy metrics. We've got other things like probability, how likely we think something is going to occur. Things like Bayes' rule. So, we can talk about Bayesian inference briefly. It gets used a lot in documents and classification, and then ideas of just how to describe data statistically. We want to be able to use the correct language and techniques to reach the type of data. Mathematics is a language, and the reason it crops up in data science is that it allows us to describe rather complicated analytical techniques in a concise, and therefore, universal manner. That's what mathematics allows us to do. So, what linear algebra is, is a specific area of mathematics which describes how to logically combine certain types of objects. Most machine learning is weighted sums, just lots and lots of weighted sums. It's massive simplification. It's an area of maths that tells us whether we're going uphill or downhill. How do you compute the gradient by differentiation? So, calculus is what informs us as to how to best update coefficients in machine learning algorithms. It tells us areas of probability, distributions, and things like that.

A lot of algorithms use something called gradient descent as their optimisation method, which is simply computing the differential of your error function to decide whether you're getting better or worse given your current coefficients. Python. What do we need to know about Python? Python is slow-ish. It's an interpreted scripting language, and what do we mean by interpreted? That basically just means that, instead of having to compile the entire thing to turn it into a machine code to then run, it will just run through your program, and then, when it encounters an error, it will go, 'Oh, no, what just happened there?' It will only find your mistake when it encounters your mistake. There's no, sort of, compilation step which will check on your syntax, and then decide whether it's going to run or not. That's why it's interpreted. It's very popular. A lot of people use it, which means that there's a lot of documentation online. It's rather simple to use, and it's very easy to extend. So, part of Python's mass appeal are its libraries. These libraries are a lot faster than core Python. They're things such as NumPy, pandas, etc. So, what NumPy is, is it's basically MATLAB for Python. It allows for fast matrix computation. It is written underneath it all in Fortran and C, but because nobody in their right mind would ever want to do data science with Fortran and C, what we have is this lovely wrapper around it called NumPy. We can work with it as if it's Python, but we can pull in anything we want with it.

Some of the libraries we're going to have a look at are pandas, Matplotlib, Seaborn, SciKit-learn, and then these other ones are probably worth having a look at in your spare time. So, NumPy is computation. Pandas is data manipulation. Matplotlib and Seaborn are both visualisation. Then, SciKit-learn is a very good introductory machine learning library. Projects. I want to take a broader view of how we run a project now. If, hypothetically, we've been contracted to run a daytime project for a public transport authority predicting daily passenger volume, how do we go about structuring this kind of project? We'll have a look at something called CRISP-DM, and then we'll have a look at some machine learning workflow which fleshes out this much more. What is the first thing we tend to want to do when jumping into a project? We acknowledge that we have no understanding of the public transport authority whatsoever. So, understanding the domain, understanding what everything is, figuring out what is a train, what is a passenger. How do we quantify these in our data? What data do we have available to us? Everything like that. Getting a main understanding of where there may be extra data available. So, the first step is always going to be understanding whatever business that we're trying to do this for. Once we've done that, where are we getting the data from? What data is available? Where does the data come from? Issues surrounding the data, like data governance, formats, type, security.

All these various things, like identifying the data sources, once we've done that, of course, categorisation is domain understanding, and data understanding. We've done the first few steps there by understanding what's going on, understanding what the data looked like. We moved on to a step called data preparation, which involves structuring, filtering, reducing data at data cleaning steps, selection, and then the obtaining of data that is actually relevant to our problem, and potentially linking in other datasets that may add content. And then following that, is when we move on to our modelling, evaluation, deployment, which we'll get to later in the course. We'll give us an idea of how to run a data project, a generic data analysis project. We'll do a deeper dive into a more machine learning tailored project later on.