Distributions help us summarise the relationship of observations and present them graphically.
A frequency distribution is a representation, either in a graphical (histogram) or tabular format, that displays the number of observations within a given interval. The interval size depends on the data being analysed. The intervals must be mutually exclusive and exhaustive.
| Image: example of frequency distribution |
The height of each bar shows the number of observations with values from the interval at the base of the bar.
While a frequency distribution gives the exact frequency or the number of times a data point occurs, a probability distribution gives the probability of occurrence of the given data point.
When the number of data points is large, the frequency distribution and the probability distributions are similar in shape.
Probability density function is the equivalent of the probability distribution function for continuous random variables. It gives the likelihood of a random variable to assume a certain value.
The graph below is an example of Probability distribution. On the horizontal axis are the ages of children in the playground, on the vertical axis are probabilities. The graph shows the probabilities to find children of those ages in the playground. For age 7 it is highest - between 20% and 25%.
| Image: Example of probability distribution |
The most important type of distribution is the Normal (Gaussian) Distribution, named after the famous German mathematician Karl Friedrich Gauss (1777-1855).
Practically, all biological data are normally distributed.
The first application of the normal distribution was the analysis of errors of measurement made in astronomical observations (Galileo). The conclusions were:
- errors due to imperfect instruments and imperfect observers.
- errors were symmetric.
- small errors occurred more frequently than large errors.
- hypothesised distributions of errors followed a normal distribution.
Normal distributions are:
- Bell-shaped, symmetrical about the mean.
- The mean, median and mode coincide.
Their parameters are:
- μ - mean
- σ - standard deviation
| Image: example Normal distribution |
Sample Size↑ ⇒ Standard Deviation ↓
| Image: Example of normal distribution |
The larger the sample size, the smaller the spread of values around the mean of the sample.
For a normal frequency distribution:
- about 68% of all observations are within one standard deviation of the mean.
- about 95% of all observations are within two standard deviations of the mean.
- about 99% of all observations are within three standard deviations of the mean.
This is called the 68-95-99 empirical rule. The rule provides a very good tool for cleaning data (removing outliers).
| Image: The 68-95-99 empirical rule |
When a distribution is skewed, the mode remains the most commonly occurring value and the median remains the middle value in the distribution, but the mean is generally ‘pulled’ in the direction of the tails.
Negative skew is a left skew, and a positive skew is a right skew.
When most of the data are on the right, the tail is to the left and when most of the data are on the left, the tail is to the right.
The below histogram on the left is negatively (left) skewed. Most data are to the right (see the height of the bars). The tail is to the left (smaller numbers). In this case, the name comes from where the tail points.
The histogram on the right is positively (right) skewed. Most data are to the left (see the height of the bars). The tail is to the right (bigger numbers). In this case, the name comes from where the tail points.
| Image: Skewed Distributions |
In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level of risk for an investment because it indicates that there are high probabilities of extremely large and extremely small returns. A small kurtosis points to a moderate level of risk because the probabilities of extreme returns are relatively low.
Kurtosis refers to the degree of presence of outliers in the distribution.
Kurtosis measures whether the data is heavy-tailed or light-tailed in a normal distribution.
| Image: Kurtosis |
Standard Normal distribution
A variable that follows a normal distribution with:
Mean = 0
Standard deviation = 1
is known as a ‘standardised value’ or ‘z-score’.
It is convention to refer to this normal distribution as the standard normal distribution.
| Image: normal distribution vs standard normal distribution |
Benefits of a standard normal distribution:
- Transforming data to comparable scales can prevent variables with larger ranges outweighing those with smaller ranges.
- Standardised scores are without units, because units are cancelled when we standardise the spread.
The z score can be interpreted in units of the standard deviation.
e.g. 𝑥 ~ 𝑁( 𝜇=4 , 𝜎=2)
This is a normally distributed data set with mean 4 and standard deviation 2.
To find the z score of x=10, we have to find how many standard deviations is its distance from the mean:
x = μ + z*σ
10 = 4 + z*2
6 = z*2
Next, let’s trial our understanding with a quiz on probability and statistics concepts that were covered in this section.
When you’re ready, select the vertical Learning Path button to continue.
In this Course, we will find out about the concepts underpinning Statistics.
A world-leading tech and digital skills organization, we help many of the world’s leading companies to build their tech and digital capabilities via our range of world-class training courses, reskilling bootcamps, work-based learning programs, and apprenticeships. We also create bespoke solutions, blending elements to meet specific client needs.