Continue the journey to data and machine learning, with this course from Cloud Academy.
In previous courses, the core principles and foundations of Data and Machine Learning have been covered and best practices explained.
This course gives an informative introduction to deep learning and introducing neural networks.
This course is made up of 12 expertly instructed lectures along with 4 exercises and their respective solutions.
Please note: the Pima Indians Diabetes dataset can be found at this GitHub repository or at Kaggle page mentioned throughout the course.
Learning Objectives
- Understand the core principles of deep learning
- Be able to execute all factors of the framework of neural nets
Intended Audience
- It would be advisable to complete the Intro to Data and Machine Learning course before starting.
Please note: the Pima Indians Diabetes dataset can be found at this GitHub repository.
Transcript
Hey guys. Welcome to the exercise solution. So let's go through exercise one. In exercise one, we were tasked to predict the occurrences of disease so we have a population of people and these are a population of Indian origins, and the objective is to predict whether a patient has diabetes or not based on some diagnostic measurement. So we have a bunch of features like number of pregnancies, glucose, blood pressure, skin thickness, etc., etc., and the last column is the outcome, and it's a binary variable. So what we had to do is load the data, draw a histogram for each feature, explore the correlations of features using the pairplot, that was one suggestion, there are other ways of doing it, and then wonder, ask the usual questions, you know, do the usual checks of do we need standardization, and if so, what standardization technique are we gonna use? And then finally, prepare x and y and give it a model. So let's do it.
We load the data, so we have all our numbers and the outcome We call the f hist on the data frame itself and this plots all the variables, all the histograms at once. It's pretty nice and easy, so we see that some features have a nice kind of bell-shaped curve, although we probably need to rescale them. Some other have this decaying type distribution, like the age, and then we have, you know, binary variables like... yup. So let's look at correlations. We import seaborn and we run a pairplot with the color determined by the outcome variables so we can actually see them as separate.
And it's gonna take a little while, because there are a lot of plots but this is what our data looks like. And we see that, in many features, so in many dimensions, this is a data set with a lot of features, the two classes are pretty much overlapped. So like, blood pressure's not really much difference between the people, if you look at the distribution, they're pretty much identical. Whereas in some other cases, there is a slight difference, although not much. And so it's legitimate to ask the question, will our model be able to separate anything or will it essentially be impossible to separate the two? Okay, another way of checking the correlation is to actually calculate the correlation and build a heat map. So we see that, for example, age and number of pregnancies are correlated, that's expected. The longer you live, probably the more babies you've had. I don't know if there is a medical reason for a correlation between skin thickness and insulin, I'm not a doctor, but there seems to be some.
So the heat map of the correlation matrix is another way of checking for correlations between variables. These are useful pre-checks that you should always get into the habit of doing when you're dealing with a machine learning problem if you can, if your data set allows that, it's good to formulate some intuition around, "What are the variables that may be correlated," since we care about outcome, what are the variables that are correlated with the outcome, okay, we see that glucose definitely very correlated, and then a couple others, boss, Body Mass Index, and pregnancies have higher correlations coefficients as well as age. So we expect that these four will probably be the features that drive most of the predictive power, but let's actually see in the next exercise. So we run info, we check, there are no values and all the columns are numerical, great. And then we check with the scribe, minimum values and maximum values, we see that maximum values are kind of all over the place.
Some like insulin is almost a thousand and 846, age is 81 but pedigree function is... from 0.07 to two, and pretty much they all have different scales. We had seen that in the histogram plot, but this is very clear from even the minimum and the maximum so what we're gonna do is rescale all of them with the standard scaler and remember, the standard scaler, what it does is... take the mean of a column... and the standard deviation of a column, and subtract the mean from all the values and then divide the result by the standard deviation. So what it will do is rescale everything to mean zero, and standard deviation of one.
So everything will be in the same kind of range. We do that and at the same time, so we do this to the features, so we drop outcome and for the other columns we fit and transform at once. Store that in x, and we take the values of outcomes, store them in y, and use to_categorical, or the categorical body. So let's check what we've done, so x is actually a NumPy arrays or we can check the shape of it and it's eight columns, 768 points, and y_cat is... zero one type array... where the shape is... "shape" is two columns... and 768 rows. Thank you for watching and see you in the next video.
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.