This Course explores how to interpret your data allowing you to effectively decide which chart type you should use to visualize and convey your data analytics. Using the correct visualization techniques allows you to gain the most from your data. In this Course, we will look at the importance of data visualization, and then move onto the relationships, comparisons, distribution, and composition of data.
If you have any feedback relating to this Course, feel free to get in touch with us at email@example.com.
- Get an overview of what data visualization is and why it's important
- Learn how to visualize relationships within data
- Learn about comparisons, distribution, and composition of data
This Course has been designed for those who work with big data or data analytics who need to interpret data results in an effective way.
As a prerequisite to this Course, you should have a very basic understanding of the terminology used in relation to tables and graphs
Hello and welcome to this lecture which will be looking at how you can present the distribution of your data through column histograms. But firstly, we need to understand what data distribution is.
Put simply, a data distribution shows us all the values of our data set and how often each value occurs, allowing you to easily visualize your data that could contain thousands of data sets. We need to create a data distribution frequency before we can create a histogram, so let’s take a look at a scenario of how to get our distribution frequency first before creating our histogram.
Scenario: A fishery has taken stock of all Tench in one of its lakes and measures the length of each fish in centimeters. 50 Tench are caught, and the results are recorded and displayed as follows.
By creating a frequency distribution of this data we will be able to see how the data is distributed or grouped across the whole data set.
To find this, we must first find the upper and lower bounds of the data. Looking at our table we can see that the largest value is 80 and the smallest is 15. To get our range, we subtract the largest from the smallest:
80 - 15 = 65
So 65 is our range.
We now need to define how many classes we would like in our distribution. These classes can be considered ‘intervals’ of data entries. Each class will have a count of the number of values that fit within that interval.
So to make it clearer let’s work through it. So we have a range of 65 and I estimate that a class number between 12 to 14 would be a good fit. This should be enough to give a range of values to allow us to see the significant shape of our data from our data set.
To get our class number we need to take our range and divide it by how many classes we would like, so I took 65 and divided it by 14, which = 4.64. Once you have your value, I rounded up my data to the nearest whole number, which was 5. This then gave me my class width.
We can now start to build our data distribution, and the clearest way to show you this is to create another table.
We start by building a simple table of 3 columns, with the fields of ‘Class Limits’, ‘Tally’ and ‘Frequency’
Our Class Limits will be as follows, starting with our lowest value, which is 15, and then using our class width of 5 it will go up to 19, the values ranging from 15, 16, 17, 18, and 19, a total of 5 values in the class width. The next limit then starts from 20 - 24, and so on, until we reach the final class of our highest value which is 80, and so the final class is 80 - 84.
Now we have our class limits we can populate the tally column with the number of values from our original data set that fit into each class. For example, there are 5 values between the range of 15 - 19.
Once the tally column is populated, it's then easy to fill out the frequency which is a numerical version of the tally column.
We have now created our data distribution frequency from our original data set. So as I explained earlier, a data distribution shows us all the values of our data set and how often each value occurs, and that’s exactly what we have done here.
Now we have our data distribution frequency, we can now create a Histogram. A histogram is effectively the same as a bar or column chart, but it shows values based on data distribution frequency. With that in mind, our column histogram would look as shown here.
So from our initial data set of 50 fish sizes, we have created a data distribution frequency that grouped that data into different classes.
From this data, we can clearly see the fish range size that occupies the lake the most through the frequency of occurrence in each class, and with that knowledge, we can assume the likelihood that if you were to catch a Tench in this lake it would be between 40-49 cm in length as these are the 2 highest values across the histogram.
Histograms offer the benefit of being able to present huge data sets in a simplistic and readable manner through the means of distributing the data values using frequencies. We could have easily scaled this up to thousands of fish using the same method using class widths and classifying each fish to a different class.
That now brings me to the end of this lecture covering data distribution.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.