Often the best way to start a data analysis project is to decide what you’re workflow should be and roughly how long each bit is going to take.
For instance, if you know that the data you’re going to be working on is likely to be littered with mistakes and errors, then you should preemptively allocate a decent amount of your time to the “importing” and “cleaning” steps of your workflow. Similarly, if you’re end goal is to produce a predictive model at the end, then work in a feedback loop where you inspect and evaluate your model before improving it in the next iteration.
2.1.1 Basic Workflows
For now, let’s look at a basic workflow and some likely additions or changes you might make depending on your goals.
For these basic workflows, we’re only going to look at a subset of all of the stages of analysis that you might identify. They are going to be:
* The difference between data cleaning and data tidying to me is that cleaning refers more to data type conversion, removing NAs and the like. Basically, without cleaning, your analysis isn’t going to happen because there’s too much noise. When I say ‘tidying’, that’s more getting the data in a format that is amenable to your analysis. You could get by without changing it but it would likely take you much longer or be much less efficient.
188.8.131.52 Example 1: Reporting
In this example, imagine someone has come to you and they want a bit more insight into the data they have. It’s currently in a messy spreadsheet, and they want you to load it into R and produce some nice looking graphics to help them understand trends and correlations in their data. They’re not bothered about any modelling for now, they just want some pretty graphs.
If we mapped out an appropriate workflow for this project, it might look something like this:
We import the data and clean it, then we summarise it and create our plots. We can then use feedback from our colleague to change how we’re summarising or how we’re plotting the data to improve our plots until we’re happy.
184.108.40.206 Example 2: Modelling
For this example, imagine a colleague has come to you about a modelling project. They have a number of relatively clean datasets that they want to combine and use to produce a predictive model. They’re not too worried about plots or graphics, they’re more interested in the model itself. Our workflow here would be similar to Example 1 but there would be some crucial differences. First, we know that we’re going to combine multiple datasets into one which we will then use for our model. This is likely going to involve some data manipulation or ‘tidying’. I don’t really like using the ‘tidying’ description because that would suggest that it’s the same as ‘cleaning’ but there’s a very useful package for data manipulation and resizing called
tidyr so we’ll stick to ‘tidying’ for now.
Back to Example 2, if we mapped out a workflow for this project, it might look a bit more like this:
As you can see, we still end with the loop of modelling and then evaluating our model and using the output to update the model, but the steps we take to get there are a bit different.
Hopefully this section has given you an idea of what a typical project workflow could look like. The benefit of splitting your project into these distinct steps is that it can often help you compartmentalise your functions and scripts into distinct stages. We’ll look later at structuring your project like an R package, and so splitting your analysis into separate stages can help you construct your project in a portable and replicable way.