Fundamentals of R
The course is part of this learning path
This module looks at more complex data structures, building on what was covered in the intermediate data structures module.
The objectives of this module are to provide you with an understanding of:
- How to construct a factor in R
- How to construct a data frame in R
- How to modify a data frame
- How to subset a data frame
- Data frames automatically factorising data in R
Aimed at all who wish to learn the R programming language.
No prior knowledge of R is assumed
Delegates should already be familiar with basic programming concepts such as variables, scope and functions
Experience of another scripting language such as Python or Perl would be an advantage
Understanding mathematical concepts will be beneficial
We welcome all feedback and suggestions - please contact us at email@example.com to let us know what you think.
- [Instructor] Data frames in R can be created using the data frame instructor. The inputs of which include vectors. Vectors of equal length which can be called or termed named columns. As an example I will use a column with the name name. I will set that to equal to a vector of three names. I can then add in a second vector named age. I can set that equal to a vector of numbers. I can then add in a third vector of location. But this time I can input this as a vector within a factor constructor and I can then close this off. I can then close off my data frame constructor and on the screen I see an output which is the data frame that I have just created. I can store this down as an assigned variable by running, using the, utilizing the assignment operator and terming this super as being the data frame that I have just created. Okay. Are there any restrictions on the vectors when creating a data frame? Now we already knew that the inputs to a data frame have to be vectors but the vectors must be of equal length such that the data frame is a rectangular shape. The main benefit here is that vectors can be heterogeneous to one another meaning that they can differ in type. So for example in super we had characters for the names, we had numbers for the age, we had factors for the location so we have various different types. How can we indicate a missing entry? We use the key term NA and here I will create an update to my super definition. So if I go up to my super definition using the up arrow on the keyboard and instead of adding in the cute name Kara I add in NA in its place. When I run this and I look for what I have created, on the screen will we now see an NA for a missing value. To understand what, from a high level, we have created we can ask for the class. For an internal structure or a low level, we can ask for the type of this. So underneath the data frame we have a list. That is the internal storage structure. How can we access elements within our data frames that we have created? So calling super back to the screen to show you what we have. We can use index access, meaning we can use square brackets and then call an index. Say for example one. That would return the first column. Which is perhaps not so readable if your data frame contained many columns and multiple rows. Meaning instead of three rows, imagine you had a data frame containing three million observations of all of the various many different people that might exist. You would be better off using, and this might be called best practice. If I call the dollar notation, which is used for member referencing, and I say name at this point here, that brings up the names that I have. I can ask for the second name and that returns the second name in my super data frame. I can repeat the same logic for location to find out what the second location is. Which is KS. What's interesting to note is that we see a return of levels for each of my string or character columns. What's happening is that inside of our data frame we are automatically factorizing the data. Here I've recreated the data frame and I'm just noting that I can choose how strings are handled by the data frame constructor. By default, we had set strings to equal true. So in our initial definition of our data frame, we had implicitly defined, or by default defined, strings as factors to equal true, which mean that any string vector being inputed into our data frame was coming through as a factor. We automatically factorized the data. Such that when I called the column name which was not defined as a factor as opposed to location which was defined as a factor, this returned levels and implied that we had a factor here. Let's say I didn't want to have my character vectors as strings and I would like to see, actually, the names here not defined as strings. I could use the false argument for this parameter and now if I call super dollar name to the screen I can see just the names as if they were a normal character vector, not defined as a special character vector which would be a factor. If I defined the information that we had as a list, we could see that we are using named vectors to help us define our list. But if I call super list to the screen we can see that this differs from super, which has been defined as a data frame, and that data frames appear more like a table than a list does. So they're more user friendly. Can I name objects? So if I was to ask for the names of these objects. I can ask for names of supers, of the super data frame and I can return back the names of my vectors that have been defined. I can also ask for the names of my super list in exactly the same way. What about for column names? It's exactly the same as I would have for names from my data frame. But does a list contain column names? No. Useful for annotating data would be the identifier. So we can ask for the row names. Which again, is sensible and makes sense with a tabular format or a rectangular shape that a data frame takes, however, this does not make sense with a list. And just to round off the topic, I'd just like to state a few functions that are useful. So the number of rows of super, knowing that each vector inside of my data frame must be the same length, I can see that each vector much be of length three or is of length three. The number of columns that I have. I can understand how wide my data frame is. I can also ask for both of these parameters to be returned in one function or one swoop and I can use the dim for dimensions on the super data frame.
About the Author
Kunal has worked with data for most of his career, ranging from diffusion markov chain processes to migrating reporting platforms.
Kunal has helped clients with early stage engagement and formed multi week training programme curriculum.
Kunal has a passion for statistics and data; he has delivered training relating to Hypothesis Testing, Exploring Data, Machine Learning Algorithms, and the Theory of Visualisation.
Data Scientist at a credit management company; applied statistical analysis to distressed portfolios.
Business Data Analyst at an investment bank; project to overhaul the legacy reporting and analytics platform.
Statistician within the Government Statistical Service; quantitative analysis and publishing statistical findings of emerging levels of council tax data.
Structured Credit Product Control at an investment bank; developing, maintaining, and deploying a PnL platform for the CVA Hedging trading desk.