Implicitly Repeating Operations in R
Fundamentals of R
The course is part of this learning path
This module looks at conditional statements in R, such as for loops and how to repeat functions.
The objectives of this module are to provide you with an understanding of:
- When to use a for loop in R
- How to nest a for loop
- Built-in functions being vectorized
- How to apply functions
- How to use the family of apply functions
Aimed at anyone who wishes to learn the R programming language.
No prior knowledge of R is assumed. You should already be familiar with basic programming concepts such as variables, scope, and functions. Experience of another scripting language such as Python or Perl would be an advantage. An understanding of mathematical concepts would be beneficial.
We welcome all feedback and suggestions - please contact us at firstname.lastname@example.org to let us know what you think.
Built in operations in R are vectrons. Meaning they operate on vectors. Say, for example, I'd hit the sum function. This will return the values, return the sum of all values present in its arguments. I can also pass the vector to this function. Such that if I pass a vector to the sum function it will add all the elements within that vector together. I can also pass vectors into several arguments and it will return the sum of everything it receives. I could use a matrix instead of a vector. Say for example, discounts. I can take the sum of this matrix and what it's doing in the background is a vectorization, summation across the entire contents of the matrix. It is converting the matrix into a vector and then sum mating it. In order to prove that my comment made sense I'd like to just convert this into a vector first and then add it up. And as we can see it makes sense. Imagine instead of wanting to sum all we prefer to sum over certain columns or certain rows. And moving forwards from using a matrix I'd like to consider using a data frame. Here I have replicated 10 columns of where each column has 10 elements from the normal random distribution. Hence, we have no more random numbers. Then we have a data frame here. And we take a copy of this to utilize. So, I'm going to use this one here DF_KH and edit this. I'd like to add some noise to each element inside of this. So, I could where the noise is defined by the cause function as follows. This is purely for the sake of argument, but I would like to show you two ways to do this. One, I could use a loop. So, I can use a nested four loop running across every single element inside of this data frame for I and J notation. Adding in the noise. Ignoring the fact this is a bad loop is part of my demonstration, because it grows with the data. And I have now updated this to include a piece of noise. Loops in general are slow, because the memory allocation and definitions occur on each and every single iteration element by element. A better way to do this would be to, and here I am recreating my data frame to use, because I have updated it in my previous nested four loop, to utilize the power of vectorization. Which takes on the constructs from the loop, implicitly. And it says, for every element in the data frame add the noise, add the signal that I'd like to see. The key word, implicit being, it means that the implementation has occurred on the lower level. So, the loop constructs that we see here, still occur as part of vectorization, but they're happening at a lower level than we need to concern ourselves with. Remember that R is a high level programming language. And in the same vein as me stating that a loop is slow, I'd like to just note that the vectorized version of this addition here, is fast, because the definitions of anything that we require for our loop happen within the interpreter and only once, irrespective of size. Meaning that I could have used a 10 by 10 in this case, 20 by 20, 30 by 30. As this grows we only have to worry about pushing from the interpreter to the low level language once. As opposed to using a nested four loop, I'd have to repeatedly see the element by element algebraic operation occurring as the size of the data frame grows. To understand that little bit better we might ask ourselves, how can we time this? Can we prove that what I've just said makes sense? Instead of a 10 by 10, I can change it to a 100 by 100. And in the same light I will use a copy of this, but this time now for my nested four loop I'll wrap this around with a system timed function to return the time that it takes. And I can show you how long it takes with a and if I repeat the same logic for the vectorized version we can see that this takes significantly less time. I can scale this up from 100 to say for example, 500. So, now I'm using a 500 by 500 a nested four loop, which is 500 by 500 iterations long. Meaning 2500 calculations are occurring. As we can see it takes a lot longer. And what we're comparing here is, an increase in time compared to, if I was to say, recreate the data frame that I'd like to use in system time measure over the vectorized version of this calculation and see that as this has grown, the system time has not grown. Whilst overall the time has grown relative to where we had a smaller size of 100 or 10, as our data frame.
Kunal has worked with data for most of his career, ranging from diffusion markov chain processes to migrating reporting platforms.
Kunal has helped clients with early stage engagement and formed multi week training programme curriculum.
Kunal has a passion for statistics and data; he has delivered training relating to Hypothesis Testing, Exploring Data, Machine Learning Algorithms, and the Theory of Visualisation.
Data Scientist at a credit management company; applied statistical analysis to distressed portfolios.
Business Data Analyst at an investment bank; project to overhaul the legacy reporting and analytics platform.
Statistician within the Government Statistical Service; quantitative analysis and publishing statistical findings of emerging levels of council tax data.
Structured Credit Product Control at an investment bank; developing, maintaining, and deploying a PnL platform for the CVA Hedging trading desk.