Build DecisionTree Model using Zeppelin


Distributed Machine Learning Concepts
Course Introduction
AWS Glue
3m 47s
Course Review
1m 26s
Start course
1h 26m

In part 4 we show you how to use Apache Zeppelin to manage and apply Machine Learning notebooks. We import our DecisionTree notebook and walk you through the script. The DecisionTree notebook is implemented using Scala and MLlib. We train our DecisionTree model against a training subset of the "Census Income" dataset stored in the AWS Glue DataCatalog.


- Welcome back to part 4 of our demonstration. In this part we'll use Zeppelin to run a notebook. Which contains machine learing code to build and train a Decision Tree Model. But first let's quickly review what we accomplished in part 3 of our demonstration. In part 3 we spun up an EMR cluster. The cluster contained 1 master node and 1 core node. We configured it at the launch with Apache Spark in Zeppelin. And importantly, we configured it to have access to the Glue Data Catalog.

Finally, in part 3 we set up port folding to allow us open a browser on our desktop and connect to the Zeppelin application which runs on the master node. Before we start using Zeppelin we're never going to get up and take a clone of the machine learning repository which contains some scripts, including a Zeppelin notebook which we will use to implement a machine learning decision try model. Within the repository click on the clone or download green button and take a copy of the URL.

We'll then jump over to onto our terminal session and perform a git clone of this repository. Next, list the contents of the directory. Here we can see it contains just the one folder. Loose change directories enter this folder and we'll list the contents again. From within this directory we'll start visual code. This will allow us to look at the contents and in particular the notebook files. On the left hand side is the directory structure of our current directory.

If we navigate down into the notebooks and then down into the spark-mlib directory we select the last file. This is our notebook that we'll import into Zeppelin. We don't have to worry too much about the contents as shown here in the editor. Let's now bring up our browser. And navigate to the Zeppelin application. Which is running on the master node with a now EMR cluster. Recall that we've set up an SSH port forwarding rule which forwards local hosts port 8819 to local host port 8819 on the master node with the now EMR cluster.

Once in the Zeppelin application the first thing we'll do is to import our notebook. We'll give the new notebook a name. Here, we'll call it census-ml for machine learning. Next, we need to navigate to the location of the notebook file. Here, we select out spark mllib decision tree income.jasonfile. This is the notebook that implements our decision tree. Back within the Zeppelin homepage we can now see that our census-mllib file is listed. Let's click on it to open it up within Zeppelin.

Okay, so this is our machine learning script. Written in scalor, using the mllib framework. As you can see it's composed of several paragraphs. Before we do anything, the first thing we'll do is to clear the output from the last run of this notebook. That's better and more convenient as it will allow us to focus on the script itself. Let's now explain how the script works. Starting from the top. The first paragraph with the now notebook has a bunch of import statements. We're importing a bunch of classes that we intend to use later on in the notebook.

To execute this paragraph use the shift enter key sequence or click the triangle button in the top right hand corner of the paragraph. This will execute just this block of code. A blue progress bar is applied at the bottom of the paragraph. This indicates how much time has taken and how much time is left to go.

Okay, the execution looks like it has completed successfully, as per the output that has rendered back into the paragraph. And as per the finished statement in the top right hand corner. In the next paragraph we set up the database that will connect back to in the current spark session. This database will be the database that we configured and set up in the blue data catalog.

If we swap over into the athena console we can quickly review what the name of our database was. Here, we can see that it's censusdb. Let's copy this and paste it back into our notebook. Again we need to run this statement by entering shift return. Here we can see it again as in progress. And has just completed. In the next paragraph we'll simply list the tables in the current database. And as expected we have our 2 database tables addon_data and addon_data_clean.

Let's copy the last table. And in the next paragraph we'll update our sequel statement to select from this database table. Executing this will construct a data frame. Next we'll display the schema of the current data frame. We do so by calling print schema. Here we can see we have our expected column names. Age, work class, education, relationship, occupation, and country. These are all features and the last one income_cat is the label. Next we'll show the first ten rows in our data frame. Here, we can see the values for each of the ten rows.

Again, the first 6 columns are all features. And the last column is our label. Here the label is tracking whether a user earns less than 50,000 per annum or more than 50,000 per annum. We can rerun the same statement but instead of showing the first 10, let's try the first 20. Next we begin to set up the parameters for our decision tree model. The first 2 variables, training size and test size. Training size indicates it will train our decision tree model with 80% of the data.

Test size indicates that it will then validate the model with 20%, or the remainder of the data. We then set up a number of indexes, where each index converts a string value into a numerical value for the particular column in question. We do the same again for the income label. Next we create a vector assembler where we combine a number of columns into a single column called features which contains the vector.

In this case, this will include the unmodified age column and the four index columns. Following on we call the random split method on our data frame and split our data set or data frame into our training data and test data as per the training size and test size variables respectively. Next we create our decision tree classifier. And on this we see the label column and the features column. We then create a label converter. The job of the label converter is to convert back from a numerical value to a string value.

It does so by using the index to string. Which the reverse of the string indexer we used earlier. Finally we set up a pipeline to orchestrate the machine learning training. The stages with the now pipeline are the string indexers, the vector assembler, the decision tree classifier and the label converter. And this completes the configuration of our decision tree. Let's now execute this codeblock by again hitting the shift enter key sequence. Here we can see it as in progress.

Okay, the execution of our configuration has completed. One thing to highlight here is the training data and test data splits. We can see here that the training data consists of 26,030 records and the test data consists of 6,531 records. The next part is where the magic happens. We kick off the following piece of code to actually train our decision tree model. Because we're actually running this training on our cluster the training phase actually happens fairly quickly.

You can see here that it's already completed. Okay, the next thing to do is to take out the decision tree model and test it with our test data. This is what we do here. Let's kick this off now and show the first 20 records. In which we'll see the predicted label on each record. Here we can see our test results coming through. If we take a close look at the results we can see that the predictive label for the first 20 records is showing that the person in question has been predicted to have an income of less than 50,000. If we look at the age and the education and the relationship, we can see that this is quite feasible.

Given that the person in question is young, have never been married and their highest education qualification is fairly low. Let's move on and do a search within our productions to find those users who earn more than 50,000. So we've kicked this off and now we're waiting for our results to come back. So this filter is searching for a predictive label that contains greater than 50K. And here we can see the first 20 results that have come back.

Now the interesting thing here again is if we look at the attributes of age, work class, education and relationship we can see they're quite different to the previous records. Here we can see that each person is much older and the education qualification is much higher. And so the predicted label of them earning more than 50K is again quite feasible. In the final section of our decision tree notebook we're simply going to dump out the decision tree model to see how the splits were done.

By doing so we can get an understanding of what features within our data suit are being used to do the splits within the decision tree model. Here you can see the decision tree model that we've trained is fairly extensive and fairly detailed. There's a lot of branching going on against various features within our data suit. And again, keep in mind that this hasn't been explicitly programmed, this has been moved by the machine learning training phase.

Okay, that's the end of running our notebook. Let's jump into our EMR service consult. And we'll click on the application history tab. Here we'll be able to look at the applications that are running with the now spark cluster. Clicking on the one application here allows us to drill down into the details of the application. Here we can get a list of the jobs that are being submitted as part of this particular application. Next we can click on the stages tab and see the individual stages that have been running with the now EMR cluster.

And finally, if we click on the executors tab we can see the individual executors again running with the now EMR cluster. And finally let's click on the monitoring tab where we can see various statistics being collected and aggregated regarding our EMR cluster. If you followed along and spun up your own EMR cluster don't forget to terminate it at the end of the demonstration. We do so by changing the termination protection to off and then clicking the terminate button.

Okay, that concludes our demonstration on distributed machine learning using domestic memory, with Apache spark, Zeppelin notebooks, building a decision tree machine learning model. Go ahead and close this lecture and we'll see you shortly in the last one.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).