- Home
- Training Library
- Big Data
- Courses
- 5. Advanced Data Structures in R

# Subsetting Data Frames in R

## Contents

###### Fundamentals of R

## The course is part of this learning path

**Course Description**

This module looks at more complex data structures, building on what was covered in the intermediate data structures module.

**Learning Objectives**

The objectives of this module are to provide you with an understanding of:

- How to construct a factor in R
- How to construct a data frame in R
- How to modify a data frame
- How to subset a data frame
- Data frames automatically factorising data in R

**Intended Audience**

Aimed at all who wish to learn the R programming language.

**Pre-requisites**

No prior knowledge of R is assumed

Delegates should already be familiar with basic programming concepts such as variables, scope and functions

Experience of another scripting language such as Python or Perl would be an advantage

Understanding mathematical concepts will be beneficial

**Feedback**

We welcome all feedback and suggestions - please contact us at qa.elearningadmin@qa.com to let us know what you think.

- [Instructor] In order to subset a date frame, let me create a data frame first regarding weights and emotions. Again, this data is random. I would like to pull the observations onto the screen so you can see what we have just created. How can I access the elements within this data frame? I can use index notation such that I use the square brackets and the rows and the columns that I would like to return. So for example, I can leave out this information here. By having blanks, I asked for all rows and all columns, I can ask for example, if I'd like to grab the first weight of the first row, I can ask for observations one, one. I can ask for the first row but all columns by leaving out the columns index and hence I receive back two columns worth of information. I can repeat the same logic if I was to ask for the first column and leave out the first index, meaning the rows. I would be asking for all rows but I'd like to only see the first column. Here I receive back the weights column. I can use negative indexing. So I can ask for all rows but without the first column. Hence I've received back the second column in this case, because we have a two-column data frame. I can repeat the same but with minus one on the rows and I can remove the first row and ask for all columns to be returned. Let me introduce you to a function which helps us with sampling before we try and take a random subset of our data frame. Now if I was to create, for example, a vector ranging from one to 10, I can use the sample function around this vector of one to 10 and receive back my same data that I imported, but with a random order. I can also update this and use a different parameter called size to tell me what size I'd like to receive back. And in each one of this, I can receive back a different random order of information plus a different size as I choose to return back. If I would like to select a random subset utilizing this sample function, I can double check how many rows I have. I have five inside my observations data frame. I can run a vector from one to five or I can choose to utilize the nrow observations function to help me generalize this for any size of observations data frame. And then I can ask for the sample of this, ranging from one to five and receive back a random order. By default, I'm using the size parameter of five because that is the length of the vector that I have. I could change this to ask for a sample of four numbers and as a final example, I can say three numbers at random. Now let us try and utilize this as part of our index element access of our data frame to take a subset. So I'm taking a subset now with the sample function. Now let me show you, if I was to use just an arbitrary number four, I would receive back the fourth row of our data frame. If I pull the data frame up onto the screen, I've picked the fourth row by using the function four as our row index. I could ask for the first four rows or a size of four rows with a sample function which would return the first four rows in a random order. I could ask for all of the rows, one to five but only asking for four to be returned. So if I run that again, we can see that sometimes it will include the fifth row, sometimes it will ditch, in this case the fifth row. I can generalize that. So rather than using five as my sample size or my vector, I can use number of rows of my data frame. I don't have to name the argument for size, I can remove that name, it is the second parameter. We note here that there is a trailing comma at the end of each of my observation data frame index element access because we are selecting all columns for the given rows that we are asking for. So noting that all columns utilizes the term called the trailing comma. I can ask for all rows for a given column, say for example, naming it. So this is just reiterating element access and index notation. Can I use a logical vector to help me subset my observations? If I call observations to the screen again, just to show you what we had created, it's a two-column data frame. I can pull using a vector of trues and falses for my row index. Now remember in R, we have vector recycling, so at this stage here R as being smart and saying, if your observations data frame is five rows in length and you want the first as being true, every odd row because it will repeat this true, false combination to the length of the data frame that we have at hand and it will ask for the first, the third and the fifth, which is what we see on the screen up above. Here we're asking to keep the odd rows. If I create a data frame called super and I use this to help me access a logical test on say for example, age and I say anything that is less than 50, that will return false, true, true for each of our different elements inside of this column. I can then use this to help me filter or subset my data frame of super, utilizing this logical vector as part of my index notation. Because I'm asking for every age under 50 for all columns. And it brings back both columns or three columns for the two rows that are under the age of 50. I can also, if I wanted to, take the same concept and ask for instead of all columns, just the names and I can make this more complicated by updating my row index notation for a series of false, false, true due to complicated logical tests. And if I pull each part of this out to show you on the screen, what we have is an or statement for those under 40 and those located in US. To reiterate index notation, we can remove columns or rows using the negative index. So this would be without the second column.

Kunal has worked with data for most of his career, ranging from diffusion markov chain processes to migrating reporting platforms.

Kunal has helped clients with early stage engagement and formed multi week training programme curriculum.

Kunal has a passion for statistics and data; he has delivered training relating to Hypothesis Testing, Exploring Data, Machine Learning Algorithms, and the Theory of Visualisation.

Data Scientist at a credit management company; applied statistical analysis to distressed portfolios.

Business Data Analyst at an investment bank; project to overhaul the legacy reporting and analytics platform.

Statistician within the Government Statistical Service; quantitative analysis and publishing statistical findings of emerging levels of council tax data.

Structured Credit Product Control at an investment bank; developing, maintaining, and deploying a PnL platform for the CVA Hedging trading desk.