4. Analyze


Lean Six Sigma Yellow Belt
Analyze Part 1
PREVIEW17m 22s
Analyze Part 2
Analyze Part 1

After completing this course, you will be able to:

  • Understand the key principles of Lean Six Sigma
  • Identify improvement opportunities in your organization (projects)
  • Understand and use the Define, Measure, Analyze, Improve, Control (DMAIC) model and key activities
  • Use the basic tools and techniques
  • Understand the role of Yellow Belts in Lean Six Sigma projects
  • Run small improvements in their day-to-day work processes

The modules covered in this learning:

  • Lean Six Sigma Overview
  • Define Phase
  • Measure Phase
  • Analyze Phase
  • Improve Phase
  • Control Phase

The recommended study time for this course is approximately 5 hours.

Please note: this content was produced in the UK and may include the use of British English.

Key exam information

There is no exam at the end of the LSSYB. A Yellow Belt certificate is issued upon completion of the training. 


Welcome, I'm Stephen Halliday, and I'm going to take you through the analyze phase of Lean Six Sigma. The analyze phase of Lean Six Sigma is about taking the data and the process and looking for the root causes of the problem that you're investigating. In this module, we will identify the critical factors which are affecting the required situation, find the root cause, and verify the root cause.

There are two approaches in analyze, the process door or system door, and the data door, we will look at each one. In this section, exploratory data analysis, we will consider what to do if we've collected data in the measure phase. When we've collected data, we can display them in various graphical techniques. These are known as the simple quality tools. The ones we'll look at at this course are histograms, bar charts, Pareto charts, box plots, run charts, and scatter diagrams.

Histogram, the histogram is a type of bar chart, and you can see an example here. The difference between a histogram and a bar chart is that on a histogram, the X axis, the bottom axis, is made up of a continuous variable. For a bar chart, the X axis is made up of a discrete variable. So what's the difference between continuous variables and discrete variables?

Discrete variables have data in distinct categories, and usually it's count data. If you're watching traffic in a road, you could count how many green cars, blue cars, black cars, silver cars, they are distinct categories. Continuous data is a little bit more difficult to understand in that the definition of continuous data is endlessly divisible. What this means is that you can measure to a smaller and smaller unit, provided you have the equipment to do so. For example, if you were measuring heights, you can measure heights in centimeters, in millimeters, and then go to lower and lower units as you require, but you need a piece of equipment to tell you what the value is.

So data like weight, height, mass, time, temperature, are all continuous variables. The histogram itself shows you the shape of the data. In this example here, we're looking at resolution time. And you can see most of the queries are resolved very quickly, but some take quite a long time, and so you see there is a long tail on the shape of the data. It's important to know the shape of data because that will tell you, have you got an issue with the data or with the process? If the data looks like the shape you were expecting, that you can take various measurements from it and move on. However, if the shape is not what you were expecting, then you should be asking, why is the data not giving me the shape that I expected? This in itself may give you some indication of some root causes of the process.

Let's look at an example. This example is from financial services, it is relating to suspected fraud on financial accounts. In the process, a customer would ring up the organization to say that they suspected fraud on their account. Some details were taken, put into a document, and this document was then put into a queue for someone else to do further work and collect further data. Once that document was complete, it was then sent to an external company to inform them of the suspected fraud. The external company had set a deadline of three months, they wanted the information about the suspected fraud within three months of the initial call. If they did not get that data within three months, they fined the processing organization.

When we look at this data, we can see quite a lot of it happened after the three-month deadline. In fact, 40% of the documents were sent to the external company after the three-month deadline, potentially a large fine. Someone took a look at this and started to assess what was going on in this process and why they took so long to send the information to the external company.

When you look at the shape of this data, various things stand out, but probably the biggest one is at three to four months. Here, there is a spike in the amount of activity and data being sent to the external company. This occurs just after the deadline. Now, it would seem reasonable that in most organizations this peak of activity would occur just before the deadline rather than just after. So the question is, why were they getting into a lot of activity just after the deadline rather than before?

When someone looked back in the process, they discovered that after the initial call on the document that was put into the queue, a date stamp was put on to it. This date stamp was three months from the original call. This meant that if that document had not been worked on and sent to the external company within three months and a day, it would appear on a report saying please deal with me. Clearly, being told one day after the three-month deadline that they were late is in itself too late. How was this resolved? Very easily, the date stamp was changed not to three months but to two months. So items that had not been worked were now brought to people's attention and they still had a month in which to get the data and get it to the external company.

Using a histogram identified this issue and allowed the improvement. Here is a bar chart, and this has been included simply to show the difference between a bar chart and a histogram. And you will see here there are distinct categories on this chart. In fact, this is about looking at a priority system, and you can see most of the items are being put into high priority. This in itself might raise some questions, why are most of the items going into high priority and not the others?

A special type of bar chart is known as a Pareto chart. In a Pareto chart, the bars are aligned in descending order. In this Pareto chart, you can also see a red line, this is known as the cumulative line. A Pareto chart has two axes, the left-hand axes shows count, how many items are in each bar, and the right-hand item shows cumulative count. The reason for the cumulative count is as a result of the work of Vilfredo Pareto, an Italian economist, people started to talk about the 80/20 rule. And that came from his work where he was looking at the distribution of wealth in Italy. And Pareto found that 80% of the wealth in Italy was held by 20% of the people.

In process improvement, we will discover that roughly 80% of the failures are created by 20% of the causes. And so if we can identify this 20% of the causes, or what are known as the vital few, and fix them, we will get rid of 80% of the failure. And this is the principle behind Pareto. And so you would look for the cumulative 80% of the failures and identify the key causes that are creating it.

Here is an example of a room service from a hotel and its complaints. As we can see here, the largest one is the food took too long to be delivered. Now, we could identify the vital few, but I would always recommend when using a Pareto, especially at the yellow belt level, to focus on the largest bar. In this case, if we can focus on why the food is taking too long and fix it, we will get rid of approximately 35% of the complaints, which is a good improvement at a yellow belt level.

The next graphical technique is a box plot, or box and whisker plot. A box plot is a way of representing a group of data in a similar way to a histogram. In the box plot, the middle line is the median or middle value. The box itself represents the middle 50% of the data. From the box are lines, known as whiskers, depending on the software you're using, these whiskers will either go to the maximum and minimum data, or to a point that is roughly plus or minus three standard deviations from the mean. Any data outside of the whisker is shown as a dot and known as an outlier.

On its own, a box plot is not very useful for representing data and a histogram would be preferred. However, a box plot is very useful for comparing several sets of data, and we'll see an example next. This example shows a comparison of four sets of data from four different countries. It's looking at schedule delays. The middle line, or median, is very similar for each of them. However, if we look at this graph, we can see that for India, there's a lot more variation in the data. We need to ask the question why and investigate to see why India is very different from the other three countries. If we look at the UK data, we can see that there is an outlier very high away from the main body of the data. Again, we need to investigate this to see what is happening. In fact, if you look at the data itself, it appears to have a value of around 110. Was this a mistaken entry? Should it have been 11 or 10? Or is it a real value that came from the process? In either way, we need to look why it's occurred and take action to make sure it doesn't happen again. So a box plot is very good for comparing several sets of data.

The next graph is a time series or run chart. A run chart shows data through time. The bottom X axis shows the progression of time, and the data is plotted against this. You could see a run chart in many organizations where they're plotting monthly values against each month. The run chart will be developed further in the control phase when we introduce the control charts. Here is an example of a run chart. In this case, we're looking at fixed income investments into an organization. And as you can see, the amount of fixed income is slowly increasing through time.

The next graphical method is a scatter diagram. The scatter diagram shows the relationship between two variables. In this example, the bottom axis shows the amount of effort put into fixing bugs in an IT system. The Y axis shows the amount of total effort for that project. Clearly, we can see here as the amount of time needed to fix bugs increases, the amount of time for the total project increases as well. We have a positive relationship. We can see different types of relationship with a scatter diagram. In this example, we have some more scatter plots. In the left-hand scatter plot, we can see a positive relationship in that as the X axis increases, so does the Y axis.

We saw an example of this with the bug fixing and the amount of time in development. The middle scatter plot shows a negative relationship, in that as the X axis increases, the Y axis decreases. An example of this could be that as the number of errors in an organization increase, the customer satisfaction decreases.

In the right-hand scatter plot, we see no relationship between the two variables. This relationship can be represented by a mathematical number. This is small r, which is called the correlation coefficient. Perfect positive correlation is plus one, perfect negative correlation is minus one. No correlation is zero. And so from this number we can find the strength of the relationship between the two variables.

So how might we use a scatter plot? In this example, we are looking at the number of files processed against the number of errors. Now, as we would expect, the more files we process, the more errors we will find. And here we see a positive relationship. In this case, that's not what we're interested in. What we're interested in is the point labeled June-02, which stands out from the main body of data. This point, there were 3,700 files processed. Under normal circumstances, we would've expected around 400 errors. In this case, we've got nearly 500, a big increase. And so this plot would ask the question, why has the number of errors increased in this case? Investigate and take action to prevent it happening in the future. That is the end of part one of analyze. In part two, we will be looking at some tools from the analyze phase called five whys and cause and effect diagram.

About the Author