Dealing with Categorical Variables


The course is part of this learning path

Start course
1h 33m

Bokeh is an interactive visualization library in Python that provides visual artefacts for modern web browsers. In this course, we're going to have a look at the fundamental tools that are necessary to build interactive plots in Python using Bokeh.

Bokeh exposes two interface levels to users: bokeh.plotting and bokeh.models, and this course will focus mainly on the bokeh.plotting interface. 

We'll start things off by exploring two key concepts in Bokeh: Column Data Source and Glyphs. Then we'll move on to looking at different aspects related to the customization of a bokeh plot, as well as focusing on how to introduce interactivity into a Bokeh object.

You'll also learn about using inspectors to report information about the plot and we'll also investigate different ways to plot multiple Bokeh objects in one figure. We'll round off the course by looking at plot methods for categorical variables.

Learning Objectives

  • Learn about Columns Data Sources and Glyphs in Bokeh and how they are used
  • Learn how to customize your plots and add interactivity to them
  • Understand how inspectors can be added to plots to provide additional information
  • Learn how to plot multiple Bokeh objects in one figure
  • Understand the plot methods available for categorical variables

Intended Audience

  • Data scientists
  • Anyone looking to build interactive plots in Python using Bokeh


To get the most out of this course, you should have a good understanding of Python. Before taking this course, we also recommend taking our Data Visualization with Python using Matplotlib course.


The GitHub repo for this course can be found here: 


Welcome back. To illustrate this scenario, we will use a different dataset that is well-known as the cars dataset. Bokeh comes with a submodule called sampledata, which contains different datasets that can be used to play with the bokeh interfaces. For further details, please check the official documentation.

We make the necessary imports: from bokeh plotting, we import the figure, show and the output_notebook functions. We call output_notebook as usual. We then import from the bokeh sampledata autompg submodule the autompg_clean dataset and we store that dataset as cars.

We can inspect the first five rows using the pandas head method, as follows: this data frame is characterised by the following columns, and each observation in this dataset identifies a model of car with respect to its production origin and a set of features highlighted here.

As we have seen in Lecture 2, the process of creating basic bar charts with bokeh is very simple using the vbar() or hbar() glyph methods. For instance, suppose we want to group the cars dataset by origin. We can do the following: we group the dataset by the column “origin” and we count the number of occurrences. This has been done for you.

Let’s check the result of this “cars by origin” dataset: we see that in this dataset the most popular region is North America, followed by Asia and Europe. 

We’ll now create a ColumnDataSource. So from bokeh models, we import ColumnDataSource and we basically create an object based on the “cars_by_origin” dataframe, and we store this into the variable “cars by origin source.”

We now want to embed that result into an appropriate graphical dimension. To do so, we proceed as follows: firstly we extract the distinct `origin` values using the pandas function `drop_duplicates()` and we store the result in a list called `factors`. So we access the origin column and we drop the duplicate values and we drop those results into a list. A simple inspection of this confirms that we have a list with three regions inside it.

We then initialise a figure object by setting the x_range argument equal to the list of factors we have just created, and the title equal to "Cars by Origin". Also, we want to set the plot width equal to 800 and plot height equal to 300. We also set the toolbar_location to None and set tools as an empty string.

We then apply vbar() to the plot with the following arguments: for the x-coordinate we set the `origin` column from the CDS, and as the top argument, we pass count. We also need to specify the CDS here. And finally, we set the width of each bar to be equal to 0.9.

We then set the title font size directly by setting the text_font_size attribute equal to "12pt" and then we show the plot.

We can improve this plot by setting the y_range.start to zero: this means that the bar will be completely adjacent to the x-axis whereas now the plot has a gap between the bar itself and the factor in the x-axis. We can also remove the vertical line that matches each single factor by setting xgrid.grid_line_color equal to None.

There are situations in which we may want to have bars that are shaded in a color. This can be accomplished in different ways. In this lecture, we investigate a very important method in bokeh called CategoricalColorMapper, which basically maps factors with a given color so that the bars inside the browser are now identified by the color.

There is a function `factor_cmap()` that makes this simple: so from bokeh.transform we import the factor_cmap method. We also import the Colorblind3 map from bokeh palettes.

The factor_cmap function creates a dict that applies a CategoricalColor transformation to a ColumnDataSource column. This requires the following fields. The field_name - in our case `’origin’`, a palette  – that is a list of colors to use for color mapping - in our case `Colorblind3, and finally, the factors, which are a sequence of categorical factors corresponding to the palette ( in our case the list of factors).

We store this into the cat_color_map object. This will create an implicit legend inside the plot: to do so, inside the bar glyphs we need to specify the legend_field equal to the origin column. And just so that it looks nicer, we specify line_color to be equal to 'white'`, since by default the bars are contoured by a blue line.

We then set fill_color equal to the cat_color_map we just created. We also set the legend orientation as "horizontal" and its location as  "top_center"` so that the legend will be shown in the top center of the plot. To avoid a possible overlap between bars and legend, we also set y_range.end to be equal to 400. We see that now we have a one-to-one mapping between an origin and a color, and this is shown in the legend. 

Another common operation on bar charts is to stack bars on top of one another. Bokeh makes this easy to do with the specialized `hbar_stack()` and `vbar_stack()` functions. The example we are going to look at now will show the bars for each origin type stacked instead of grouped.

Unlike from the previous example, we now group the cars dataset with respect to the ‘origin' and ’cyl' columns to get the total number of statistical observations in our dataset. This has been done for you below here.

Please note that we also need the cylinder to be of type string since we want to stack observations with respect to both origin and cyl. Hence, we need to convert this feature into an object using the `astype('str')` function.

To stack bars, we also need to reshape the data a little bit: we therefore create a pivot table from the cars_by_origin_cyl dataframe with index set to be the origin and columns the cylinders. We also require fill_values to be equal to 0 to be sure possible nan values are filled properly. This has been done for you again here, and it is stored in the variable pivoted_table. We get the following table.

Now, we create three variables, namely, data, which is the dictionary version of the pivoted_table, origin_list, namely the `origin` key of the data dict, cyl_list - the list of cylinders from the pivoted_table, that is nothing more than the columns from index 1 to the last column.

We are now ready to create a stacked plot: we first import a particular palette for this kind of figure, since we want to distinguish each particular stack that forms the bar associated to each factor of the origin list. In particular, we use the Category20_5 palette since we have 5 different types of cylinder values.

We then proceed as follows. We initialize a figure object by setting x_range equal to the origin_list, then we set the plot_height equal to 300 and plot width equal to 450 and then we need a title, in this case it’s going to be "Cars by Origin and Cylinder"`. And as before, we set the toolbar_location to None and tools equal to an empty string.

We then apply the vbar_stack method to the new figure object by specifying the following arguments. Stackers, which is the `cyl_list`. Indeed, these are the columns used to create the bar charts based on each different cylinder value. x — this is the ’origin’, and this is the column from the `data` source. We then set color equal to the Category20_5 palette to distinguish each single stack. We then set the legend_label equal to the cyl_list to display a legend inside the plot. And finally, we set the argument width equal to 0.9 to control the width of the bars.

We finally customize the figure in the following fashion. We set the y_range.start attribute equal to 0 : this controls the y range starting value. We set the x_range.range_padding attribute equal to 0.1: this controls the padding added around the computed data bounds.

We also add the xgrid.grid_line_color attribute, set to None: this disables the vertical line associated to each single x coordinate. We also set the axis.minor_tick_line_color attribute to None: this disables the line color of the minor ticks.

We then set the outline_line_color attribute to None: this disables the line color for the plot border outline. We set the legend.location attribute as "top_left": this controls the legend position. And we also set legend.orientation equal to "horizontal": this controls the legend orientation. And finally, we show the plot.

This is the output. We see that now each bar is showing the number of cylinders. This concludes the lecture on categorical variables with the bokeh plotting interface.


About the Author
Learning Paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.