## Population and Sampling

Let’s first explain the terms population and sample.

As we explained, **population **consists of all the members of a group whether that group may be made up of people or things. The size group of the group can be large or small, depending on what you’re interested in.

The **sample** is the subset of the population you’re interested in. Statistical studies typically involve working with samples as collecting data on the whole population is often time consuming, expensive, or impossible to do. The sample strategy you use needs to ensure the sample chosen is representative of the population.

**Population Parameter** is a data item that describes the population in some way. It is not a statistic.

**Sample Statistic** is a data item that describes the sample from a population. In a well-designed study, the sample statistic should be an accurate estimate of the population from which the sample has come.

Let’s look at an example research, based on an article on running which you can access here.

Read the article and decide on what the population and sample of the research are.

We can extract the following details:

**Research Question**

Can people become better, more efficient runners on their own, merely by running?

**Population of Interest**

All people.

**Sample**

Group of adult women who recently joined a running group.

**Population to which results can be generalised**

Adult women, if the data are randomly sampled.

Let’s look at another research and anecdotal evidence on smoking, taken from Brandt’s book, The Cigarette Century (2009, Basic Books).

Anti-smoking research started in the 1930s and 1940s when cigarette smoking became increasingly popular. While some smokers seemed to be sensitive to cigarette smoke, others were completely unaffected.

Anti-smoking research was faced with resistance based on anecdotal evidence such as "My uncle smokes three packs a day and he's in perfectly good health". This evidence is based on a limited sample size that might not be representative of the population.

It was concluded that "smoking is a complex human behaviour, by its nature is difficult to study, confounded by human variability".

In time, researchers were able to examine larger samples of cases (smokers), and trends showing that smoking has negative health impacts became much clearer.

So, wouldn't it be better to just include everyone and "sample" the entire population?

This is called a **census **and it can be difficult to complete:

- There always seem to be some individuals who are hard to locate or hard to measure. These difficult-to-find people may have certain characteristics that distinguish them from the rest of the population.
- Populations rarely stand still. Even if you could take a census, the population changes constantly, so it's never possible to get a perfect measure.
- Taking a census may be more complex than sampling.

**Exploratory Analysis and Inference**

Sampling is natural. Think about sampling something you are cooking - you taste (examine) a small part of what you're cooking to get an idea about the dish as a whole.

When you taste a spoonful of soup and decide the spoonful you tasted isn't salty enough, that's **exploratory analysis**.

If you generalise and conclude that your entire soup needs salt, that's an **inference**. For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population). If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot. If you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.

**Sampling Methods and Bias**

We are now going to examine a few sampling methods:

*| Image: Sampling methods and bias | *

**Simple Random Samples:**Randomly selected cases from the population, where there is no implied connection between the points that are selected.

Examples

- Tossing a fair coin
- Pulling names out of a hat
- Random number generators

*| Image: Simple Random Sample | *

**Stratified Samples:**Strata are made up of similar observations. We take a simple random sample from each stratum.

Examples

- Gender
- Age groups
- Geography

*| Image: Stratified sample |*

**Cluster Samples:**The population is first divided into groups (clusters) and then a fixed number of clusters are chosen randomly amongst all the clusters. The resulting sample will include all members from the clusters that were chosen.

Examples

- Classrooms in a school

- Properties within a postcode

*| Image: cluster sample | *

**Multistage Cluster Samples:**They are similar to cluster sampling, except a random sample is taken from each cluster chosen, rather than all members of the cluster.

*| Image: Multistage cluster sample |*

Have you understood the difference between these methods? Answer the below practise question to find our:

**A city council has requested a household survey be conducted in a suburban area of their city. The area is broken into many distinct and unique neighbourhoods - some including large homes; some with only apartments. Which approach would likely be the least effective?**

(a) Simple random sampling

(b) Cluster sampling

(c) Stratified sampling

The correct answer here would be **(b) Cluster sampling.**

We also talk about **non-random sampling** which is selected based on the convenience, experience or judgment of the researcher therefore it carries bias.

Non-random sampling can be:

**Voluntary**: Occurs when the sample consists of people who volunteer to respond because they have strong opinions on the issue. Such a sample will not be representative of the population.

**Convenience**: Individuals who are easily accessible are more likely to be included in the sample.

**Non-response**: If only a small fraction of the randomly sampled people chooses to respond to a survey, the sample may no longer be representative of the population.

Now, let’s practise some more:

**A school district is considering if it will no longer allow high school students to park at school after two recent accidents where students were severely injured. As a first step, they survey parents by mail, asking them whether or not the parents would object to this policy change. Of 6,000 surveys that went out, 1,200 were returned. Of these 1,200 surveys that were completed, 960 agreed with the policy change and 240 disagreed. Which of the following statements are true?**

- Some of the mailings may have never reached the parents.
- The school district has strong support from parents to move forward with the policy approval.
- It is possible that majority of the parents of high school students disagree with the policy change.
- The survey results are unlikely to be biased because all parents were mailed a survey.

(a) Only 1 (b) 1 and 2 (c) 1 and 3 (d) 3 and 4 (e) Only 4

The correct answer here is **(c) 1 and 3**

Analytics is an embodiment of the scientific approach. We state what we aim to achieve (hypothesis) and then try to prove it with data. Are our results correct? What is the chance that what we discovered is random? You will find out more in the next lecture.

When you’re ready, select **Next** to continue.

In this Course, we will find out about the concepts underpinning Statistics.

A world-leading tech and digital skills organization, we help many of the world’s leading companies to build their tech and digital capabilities via our range of world-class training courses, reskilling bootcamps, work-based learning programs, and apprenticeships. We also create bespoke solutions, blending elements to meet specific client needs.