2.2 Replicability

The key to a good, robust analysis is replicability. When we say ‘replicability’, we mean two things:

  • That your results can be replicated in other populations and in other settings (i.e. typical ‘scientific’ replicability)
  • That you can easily share your code and method with others who can verify your outputs

If your project is replicable, then it’s likely to have fewer issues, make fewer dodgy assumptions, and rely less on the idiosyncracies of your environment or coding practice. That doesn’t mean that everything that you do that can be replicated is immediately correct, but it’s a useful credential to have.

For now, we’re going to focus on how you can make your code more replicable. By that I mean, how you can share your results and your code with others either in your business or instution or even in the open source community in general.

2.2.1 Good practice

There are a few things you can do to make your work immediately more readable for others:

2.2.1.1 Use a consistent naming convention

Take your pick. Everyone has their preference and that’s okay. If you like camelCase then go with camelCase. If you like the _ approach, then go for that. The most important part of your naming convention however is that it’s stable. Don’t name some of your functions like_this and the rest likeThis. Not only does it make it harder to read, but every time you do it, a kitten dies. So be consistent.

2.2.1.2 Name your variables as nouns and your functions as verbs

Functions do and variables and objects are. Almost all languages share this distinction with verbs and nouns, so utilise that natural divide to improve how you name your functions and variables. Aim to give your functions names that suggest that they do something; and bonus points if you can give it a name that’s a verb and also gives a decent description of what the function does. When you name your variables, give them meaningful noun-like names. Say you’re doing some climate analysis, a good name for the variable that holds the average rainfall in a year might be avg_yearly_rainfall whilst a function that converts a temperatue in Celsius to Farenheit might be convert_degrees().

2.2.1.3 Use functions where you can

R is a functional language and so using and creating functions is at the very heart of programming in R. Rather than relying on R scripts that need to be run in their entirety to produce your output, by creating functions and using them in your analysis, you can more easily scale your project. For example, imagine your scope is initially to produce a report for the year. Then, after your done, your manager is so impressed with your report, they want you to do the same thing for the last 10 years. If you’ve used a couple of R scripts and you haven’t written any functions, then this is likely going to involve copying and pasting a whole lot of code and changing the years. Instead, if you use functions to perform your analysis, then creating reports for multiple years can be as simple as changing your dataset.

Later on we’ll look at the concept of abstraction - removing levels of complexity to focus on the core operation - and this tip is heavily related to that concept. For now though, keep this thought in mind: If you find yourself copying and pasting your R code more than twice to do very similar things, you should probably be using a function.

2.2.1.4 State your dependencies

One of the great things about R is the number of packages that are available that let you do all sorts of weird and wonderful things. Unfortunately, because there are so many great packages, it’s unlikely that the person you’re sharing your code with will have them all installed. If you’re sending someone a script, then best practice is to include all the packages you use at the top of your script like this:

library(ggplot2)
library(dplyr)

An extra step which few people do but can be very helpful is to still prepend your functions with which package they came from like this dplyr::mutate(). Not only does this avoid any namespace conflicts where two packages might have functions with the same name and you don’t end up using the one you think you are because of the order of your library() calls, but it also makes it infinitely easier for anyone reading your code to find the package that a function comes from. Admittedly, this is overkill in a sense because we’ve already told R which packages we’re using with our library() calls, but this practice can really improve the readability of your code.

Later we’ll look at designing our project as a package and packages have a different way of stating dependencies for the user, so this is primarily for the case where you’re just sending a script or two to someone.

2.2.2 RMarkdown

2.2.3 Shiny