7.1 Abstraction

For me, the biggest change in the quality and efficiency of my code was when I began learning about the concept of abstraction. Abstraction is essentially the process of breaking down complex processes and objects into their base function or quality. From here, we can abstract away levels of complexity, thus creating cleaner looking code.

Abstraction is relevant to every aspect of your projects, from how they are structured to your functions and code and so we’re going to spend some time understanding abstraction and how it can be applied.

7.1.1 Definition

Abstraction is the idea of removing levels of complexity. For example, when you press a key on your keyboard and a letter appears on the screen, you don’t need to know how the keyboard interfaces with the computer, or how that stroke is eventually turned into coloured pixels on a screen. That degree of complexity has been abstracted away.

Another example is a calculator. You type in the numbers and decide what you want to do, and your general goal (say, adding two numbers together) is translated into the practicality of performing that action. Your general goal is translated into lots of little more specific ones.

The idea of abstraction is a very prevalent one in computer science. R itself is an abstraction; it lets you interface with the CPU without having to know everything about it. Understanding abstraction and particularly how it relates to functional programming and R can greatly improve the efficiency of your code. Understanding and applying abstraction is more of an art than a science. By abstracting away complexity, you make things easier for the user but you will take away some of the flexability, and so applying the concept of abstraction to your projects will always be a balancing act.

7.1.2 Abstraction in R

Finding examples of abstraction in existing R code is easy. For instance, in the teacheR book, we looked at how R used method dispatch to find the appropriate method for a particular type of object. If you print a dataframe, for instance, then R will find the appropriate method to print that type of object (a dataframe), and will eventually call the print.data.frame() function to do so. But that’s not what you have to type in. You just type print(my_dataframe) and R takes care of the rest. And that’s a good example of how R has abstracted away that complexity, instead focusing on the core goal - printing something.

Applying abstraction to your own R code can greatly improve your code cleanliness and efficiency. One of our main tools for implementing abstraction into our projects are functions. We want to break down our steps into their smallest constituent parts, and aim to create functions that are as simple as possible. We can then use these functions together to solve the overarching issue. The best way to demonstrate how powerful abstraction can be in R is probably through an example. First we’ll look at a simple example and then we’ll move onto something more complex.

7.1.2.1 Example 1 - plotly labels

This is a real example of abstraction that I used very recently. The plotly package is a great library for producing interactive graphs for Shiny applications and RMarkdown documents. I often use the ggplotly() function, which takes a ggplot object and converts it to a plotly one.

One of the features of the package is that you can create a tooltip, such that the individual looking at the graph can hover their cursor over the line or bar or whatever and a little box will pop up showing them the value that they’re hovering over.

The tooltips are generated automatically from the aesthetics you define in your ggplot2 call. So if you create a graph with an x called x_val, then the tooltip will say something like “x_val: 100” when you hover over it. This is fine, but it can look a bit messy when you don’t have very nice variable names, which I often don’t. Instead, you can create a new aesthetic (like text) and provide your values to that aesthetic with nicer names. For example:

gplot <- ggplot(data, aes(x = x_val, y = y_val, text = paste0("Better X Label:" = x)))...
plotly_plot <- ggplotly(gplot, tooltip = "text")

Now, the tooltip will show “Better X Label: 100” instead of “x: 100”. Much better.

But what about when you want to have more than one variable in the label? You could try adding more aesthetics and including them to the tooltip parameter (tooltip = c("text", "label", etc.)), but that could be tough if you’ve got more than a couple of aesthetics.

So let’s think less about the here and now and try and break this into the simplest possible terms. We want a function that will need to accept a new name for our variable, the variable values, and it will need to be able to accept any number of variables. After a bit of trial and error, this is the function I created:

plotly_label <- function(labels) {
  raw_lists <- purrr::map2(names(labels), labels, function(name, values) {
    purrr::map(values, ~paste0(name, ": ", .x))
  })
  purrr::pmap(raw_lists, function(...) paste(..., sep = "<br>"))
}

This function accepts a list of name-value pairs, with the name representing the new name of the variable. This function then returns new labels (separated by a break) that can be used by plotly. So for instance, if I wanted to have labels for both my x and y variables, I could just do this:

gplot <- ggplot(data, aes(x = x_val, y = y_val, text = plotly_label(list("Better X Label:" = x, "Better Y Label:" = y))))...
plotly_plot <- ggplotly(gplot, tooltip = "text")

We haven’t changed the world here, but we have created a function that can be used in multiple situations. We haven’t hard-coded anything in, meaning that theoretically we could create a tooltip with thousands of variables.

7.1.2.2 Example 2 - Plotting

For our first example, let’s imagine you want to create a full suite of graphics for a reporting project you’re working on. You’re going to want to create 10 different line charts and 5 different bar charts using different variables from 3 different datasets. Let’s look at some different ways we could go about doing this, applying different levels of abstraction:

7.1.2.2.1 Approach 1

You could just write out the code needed for each plot. For example:

library(ggplot2)
linechart1 <- ggplot(dataset1, aes(x = x, y = y, colour = groups)) +
  geom_line()
linechart2 <- ggplot(dataset2, aes(x = x, y = y, colour = second_grouping)) +
  geom_line()
.
.
.

After writing our all our code, we’d have 15 different graphs, but a lot of code and a lot of repitition. There’s lots of code here that’s being shared and every time the same call or section is duplicated, that’s twice the code that we might potentially have to debug.

7.1.2.2.2 Approach 2

A different approach may be to create a different function for each plot. The function would take the dataset and spit out the graphic, a bit like this:

library(ggplot2)
create_first_linechart <- function(data) {
  ggplot(data, aes(x = column1, y = column2, colour = column3)) +
    geom_line()
}
.
.
.

We would eventually have 15 different functions - one for each plot - that we could call to produce all of our graphics. Like this:

create_first_linechart(dataset1)
create_second_linechart(dataset2)
create_third_linechart(dataset3)
.
.

This is a bit cleaner than our first approach because we’ve separated out our graph-generation logic and it’s certainly a step in the right direction, but there’s still just as much repetition. Similarly, if there’s an error, we’re still going to have to debug each function separately.

7.1.2.2.3 Approach 3

Building on the use of functions, we could create a function that will create all of the line charts and then a separate function that will create all of our bar charts. Our function to create our line charts might look like this:

create_linechart <- function(data, x, y, colour) {
  ggplot(data, aes(x = {{ x }}, y = {{ y }}, colour = {{ colour }})) +
    geom_line()
}
create_barchart <- function(data, x, y, colour) {
  ggplot(data, aes(x = {{ x }}, y = {{ y }}, fill = {{ colour }})) +
    geom_col()
}

The {{ }} brackets let us pass column names from our function to the aes() function in ggplot2.

Then we’d still have to make 15 separate calls to create our graphics, but we’d have much less code to debug if something went wrong because we’re only relying on two functions, not the 15 we were before. Similarly, if we wanted to change something, we’d just have to change the two functions we created and that change would be propogated to all the 15 graphics.

7.1.2.3 Approach 3.5

The eagle-eyed amongst you may have noticed that we still have some repetition. Basically, everything other than our geom and one of our aes mappings is exactly the same between the two functions. So should we not modify our functions so that we can define the type of graph we want as an input parameter and then we’ll only need one function?

Perhaps, but I would say that this shows a good example of how abstraction is more of an art than a science. If we did combine our two functions in attempt to create a single function that create multiple types of graphs, we’d run into issues if we then wanted to make changes to our graphs that are specific to a single type of graphic.

Let’s look at this through an example. Here’s what a combined function could look like:

create_chart <- function(data, x, y, colour, fill, geom) {
  ggplot(data, aes(x = {{ x }}, y = {{ y }}, colour = {{ colour }}, fill = {{ fill }})) +
    do.call(geom, args = list())
}

Using this function, we can provide our geom_...() function via the geom parameter, allowing us to create multiple types of graph using the same function. But say now I want to change the bars from being side by side to being stacked. To do that, I just need to set the position parameter in the geom_col() call to stack. But currently I don’t have a way of passing that value to the geom_col() function. I could add ellipses (...) to my function and pass parameters that way, but then what would happen if I wanted to only add a theme to my line graphs and not to the bar charts? Well, our function would end up looking something like this:

create_chart <- function(data, x, y, colour, fill, geom, ...) {
  plot <- ggplot(data, aes(x = {{ x }}, y = {{ y }}, colour = {{ colour }}, fill = {{ fill }})) +
    do.call(geom, args = list(...))
  if (geom == "geom_line") {
    plot +
      theme_classic()
  } else {
    plot
  }
}

But I’d say that this is much more confusing and error-prone than our third approach, and now we’ve restricted ourselves to providing a character string of the geom function instead of being able to provide a function object. We may have more code using Approach 3, but making changes and understanding what’s going on is much easier.

Ultimately, although we’ve reduced the amount of code we have, we’ve attempted to abstract away too much of the complexity. And now we’ve ended up hamstrung with our new function. Hopefully this example has highlighted that striking the right balance when applying abstraction can be tricky; you want to remove duplication and complexity as much as possible, but without inadvertently adding more complexity further down the line. Essentially, you don’t want to deviate from the core issue that you’re trying to solve with your function.