17 Workflows

Often the best way to start a data analysis project is to decide what you’re workflow should be, breaking your project into distinct stages that are relevant for your particular project.

For instance, if you know that the data you’re going to be working on is likely to be littered with mistakes and errors, then you should preemptively allocate a decent amount of your time to the “importing” and “cleaning” steps of your workflow. Similarly, if your end goal is to produce a predictive model at the end, then work in a feedback loop where you inspect and evaluate your model before improving it in the next iteration.

The tidyverse packages we discussed previously are designed to make data science easier, and the workflows we’ll look at fit into this philosophy quite nicely. Each stage of these workflows is usually handled by a different package but with a common syntax and data structure underpinning them all.

17.1 Basic Workflow

For now, let’s look at a basic workflow and some likely additions or changes you might make depending on your goals.

For these basic workflows, we’re only going to look at a subset of all of the stages of analysis that you might identify. They are going to be:

  • Importing
  • Cleaning
  • Tidying
  • Collating
  • Summarising
  • Plotting
  • Modelling
  • Evaluation

17.1.1 Example 1: Reporting

In this example, imagine someone has come to you and they want a bit more insight into the data they have. It’s currently in a messy spreadsheet, and they want you to load it into R and produce some nice looking graphics to help them understand trends and correlations in their data. They’re not bothered about any modelling for now, they just want some pretty graphs.

If we mapped out an appropriate workflow for this project, it might look something like this:

We import the data and clean it and tidy it (more on that later), then we summarise it and create our plots. We can then use feedback from our colleague to change how we’re summarising or how we’re plotting the data to improve our plots until we’re happy.

17.1.2 Example 2: Modelling

For this example, imagine a colleague has come to you about a modelling project. They have a number of relatively clean datasets that they want to combine and use to produce a predictive model. They’re not too worried about plots or graphics, they’re more interested in the model itself. If we mapped out a workflow for this project, it might look a bit more like this:

As you can see, we still end with the loop of modelling and then evaluating our model and using the output to update the model, but the steps we take to get there are a bit different.

17.2 Separating Stages

Now we’ve segmented our project into separate stages, this makes organising our code a little bit easier. We want to separate different and distinct actions into separate code chunks - this makes understanding, debugging and testing the code a lot easier, because we can test each bit separately.

17.2.1 Functions

Functions can provide two useful features in a data analysis project:

  1. They allow you to reuse code
  2. They can help you tell a story

By this, I mean that by appropriately segmenting your code into separate functions and naming those functions appropriately, you can communicate what’s going on to your audience more clearly. For example, say you want to load and clean your data, you could write some functions and use them like this:

my_fun_1() %>% my_fun_2() %>% my_fun_3() %>% ...

Or like this:

load_data() %>% clean_dates() %>% clean_factors() %>% ...

We’re doing the same thing each time, but by appropriately naming our functions like in the latter example, we can clearly communicate what’s happening at each step.

If you have a more complicated workflow with more cleaning steps, you can utilise the concept of abstraction to keep your code clean:

clean_data <- function(data) {
  data %>% clean_dates() %>% clean_factors() %>% remove_nas()

load_data() %>% clean_data()

Here we’ve rolled up the three stages of our cleaning stage into a single function (clean_data()). When someone now reads our code, they know exactly what our function is doing - and if they then look at the source, then can clearly see that our clean_data() function is made up of three sub-stages.

17.2.2 Files

Once you’ve separated your stages into distinct functions, saving those functions in separate scripts can help keep things clean. For example, if you have a clean_data() function that calls 5 or 6 functions, then place all of those functions in the same script and call it clean.R. Bonus points if you put the functions in the order they are called in the clean_data() function. Then, if someone wants to know anything about how you cleaned your data, they’re not going to struggle to find the code.

17.3 Summary

Hopefully this section has given you an idea of what a typical project workflow could look like. The benefit of splitting your project into these distinct steps is that it can often help you compartmentalise your functions and scripts into distinct stages. We’ll look later at structuring your project like an R package, and so splitting your analysis into separate stages can help you construct your project in a portable and replicable way.