Advanced Data Structures in R
Factors in R

Course Description 

This module looks at more complex data structures, building on what was covered in the intermediate data structures module.  

Learning Objectives 

The objectives of this module are to provide you with an understanding of: 

  • How to construct a factor in R  
  • How to construct a data frame in R  
  • How to modify a data frame  
  • How to subset a data frame  
  • Data frames automatically factorising data in R  

Intended Audience 

Aimed at all who wish to learn the R programming language. 


No prior knowledge of R is assumed. Delegates should already be familiar with basic programming concepts such as variables, scope, and functions. Experience of another scripting language such as Python or Perl would be an advantage. Understanding mathematical concepts will be beneficial. 


We welcome all feedback and suggestions - please contact us at to let us know what you think. 


- [Instructor] Factors are special vectors that represent categorical data. They can be ordered or unordered. An example of an unordered categorical piece of data is, for example, gender, male and female. An ordered example of categorical data would be, say, for example, the rankings between low, medium, and high. For example, if I was to ask for a factor to be created, I could use the factor constructor to create and utilise a series of categorical pieces of data regarding answers to questions. I could store them using the assignment operator and ask for this to be returned to the screen. And I could see that my answers of yes, no, no, yes, yes have been stored, and the levels have been noted as no and yes. 

I can ask for the levels to be returned to the screen separately by using the levels function on the factor that we have stored as hope_answers. By default, these are stored or returned to the screen as alphabetical. They contain the predefined values. These levels are known as the predefined values of our categorical data. The factors highlighted on the screen might look like categorical or category vectors, but under the hood, they're stored as integers. For example, if I was to ask for levels hope_answers and the first entry, we'd see that R has decided to assign one to the level no. Then, I can repeat the same logic for hope_answers levels two to understand that the second level, or the level two, has been assigned to the level yes. How many levels do we have? We can tell from the screen above that we have two in total. We can use the function nlevels if, for example, we had too many to visualise on the screen. Can I set the order of the levels in advance? For example, I might want to prioritise yes before no. I might prefer to see yes as my first level when I run the levels hope_answers one. I can do this by setting the orders of the levels when I instantiate and create my factor. And now, if I was to call levels hope_answers to the screen, I can see that yes has come before no. This is very useful in linear modeling because the first level is usually termed as the baseline level. 

What happens if I have an observation outside of the levels that we are defining? So, let's say we created a factor where my data contained a series of answers including yes, no, and then we had a red herring of maybe, but the questionnaire that we had considered at hand only contained the levels yes and no. Observations outside of the levels that we have predefined are recorded as NA. And here, we can see the NA. Let's take, for example, a family of two girls and four boys. We could create a factor using this vector here of the two girls and the four boys, where the girls are defined by the number one, and the boys are defined by the number zero. These are just terms that I've chosen to distinguish between girls and boys. I'm setting the order in advance, so I'm determining that zero will be for boys and one will be for girls. I am adding in the label to which the screen will, or the computer will output whenever it sees one of these levels. We will have now a label to help us understand, rather than relying on the integer that we have assigned. 

We will now utilise the key term girls and boys or girl and boy. So, if I now call to the screen after having created this, what the factor has been created, as you can see, the labels are shown, the levels are known, but we created this with zeros and ones. What is the class of this? It's a factor. We're wondering now, is there any difference between the low-level data? Underneath the kids factor of boys and girls, we have a series of integers that we have utilised to create this. Whether we had utilised levels here or not, the purpose on the previous screen to state that a factor is stored as a integer under the hood is indicated or demonstrated by the typeof function. If they are integers, then can I add or multiply these numbers? It's not so meaningful with a factor. So, the fact that this has a class of factor informs R that the plus operator, the addition operator, is meaningless. 

If I would've preferred instead of zeros and ones as the underlying integers, how could I have changed that from being zero and one? Well, I can convert kids to a number and then add one, and that would return back a series of different numbers. Instead of the one, zero, one, zero, zero, zero, we could've returned back three, two, three, two, two, two. Instead of having created the factor as a lengthy process with zeros and ones, and defining the levels and the labels in advance, I could've taken the approach to, say, creating a factor of just the data that we have, which is two girls and four boys in the respective order. That would've returned the same output as what we had created before, but here we've created our factor in a lot simpler fashion without worrying too much about the underlying levels. And I can prove that this is the same by using the equality comparison to say, have we got each and every item in line for each and every one of our data points?

About the Author
Learning Paths

Kunal has worked with data for most of his career, ranging from diffusion markov chain processes to migrating reporting platforms.  

Kunal has helped clients with early stage engagement and formed multi week training programme curriculum. 

Kunal has a passion for statistics and data; he has delivered training relating to Hypothesis Testing, Exploring Data, Machine Learning Algorithms, and the Theory of Visualisation. 

Data Scientist at a credit management company; applied statistical analysis to distressed portfolios. 

Business Data Analyst at an investment bank; project to overhaul the legacy reporting and analytics platform. 

Statistician within the Government Statistical Service; quantitative analysis and publishing statistical findings of emerging levels of council tax data. 

Structured Credit Product Control at an investment bank; developing, maintaining, and deploying a PnL platform for the CVA Hedging trading desk. 

Covered Topics