4.1 Loading data

The first step in any data analysis project you’ll undertake is getting at least one dataset. Oftentimes, we have less control over the data we use than we would like; receiving odd Excel spreadsheets or text files or proprietary files or whatever. In this chapter, we’ll focus on the more typical data formats (csv and Excel), but we’ll also look at how we might extract data from a web API, which is an increasingly common method for data loading.

4.1.1 csv

If I have any say in the data format of the files I need to load in, I usually ask for them to be in csv format. CSV stands for “comma-separated values” and essentially means that the data is stored as one long text string, with each different value or cell separated by a comma. So for example, a really simple csv file may look, in its most base format, like this:

name,age,
Dave,35,
Simon,60,
Anna,24,
Patricia,75

Benefits of the csv file over something like an Excel file are largely based around simplicity. csv files are typically smaller and can only have one sheet, meaning that you won’t get confused with multiple spreadsheets. Furthermore, values in csv files are essentially what you see is what you get. With Excel files, sometimes the value that you see in Excel isn’t the value that ends up in R. For these reasons, I would suggest using a separated-value file over an Excel file when you can.

4.1.1.1 Loading .csv files

Loading csv files in R is relatively simple. There are base* functions that come with R to load csv files but there’s also a popular package called readr which can be used so I’ll cover both.

* They are technically from the utils package which comes bundled with R so we’ll call it base R.

4.1.1.1.1 Base

To load a csv file using base R, we’ll use the read.csv() function:

read.csv(file = "path/to/your/file", header = TRUE, ...)

The file parameters needs the path to your file as a character string. The header parameter is used to tell R whether or not your file has column headers.

There are lots of other parameters that can be tweaked for the read.csv() function, but we won’t go through them here.

4.1.1.1.2 readr

The readr package comes with a similar function: read_csv(). With the exception of a couple of extra parameters in the read_csv() function and potentially some better efficiency, there isn’t a massive difference between the two.

Using the read_csv() function is simple:

readr::read_csv(file = "path/to/your/file", col_names = TRUE)

In this function, the header parameter is replaced with the col_names parameter. The col_names parameter is very similar, you can say whether your dataset has column headings, or you can provide a character vector of names to be used as column headers.

There are also some extra parameters in the read_csv() function that can be useful. The col_types parameter lets you specify what datatype each column should be treated as. This can either be provided using the cols() helper function like this:

readr::read_csv(file = "path/to/file",
                col_names = TRUE,
                col_types = readr::cols(
                  readr::col_character(), readr::col_double()
                ),
                ...
)

Or, you can provide a compact string with different letters representing different datatypes:

readr::read_csv(file = "path/to/file",
                col_names = TRUE,
                col_types = "cd",
                ...
)

The codes for the different datatypes can be found on the documentation page for the read_csv() function (type ?read_csv()).

The trim_ws parameter can also be helpful if you have a dataset with lots of trailing whitespace around your values. When set to true, the read_csv() function will automatically trim each field before loading it in.

Overall, both functions will give you the same result, so just choose whichever function makes most sense to you and has the parameters you need.

4.1.2 Excel files

R doesn’t have any built-in functions to load Excel files. Instead, you’ll need to use a package. One of the more popular packages used to read Excel files is the readxl package.

Once you’ve installed and loaded the readxl package. You can use the read_excel() function:

readxl::read_excel(path = "path/to/file", sheet = NULL, range = NULL, ...)

Because Excel files are a little bit more complicated than csv files, you’ll notice that there are some extra parameters. Most notably, the sheet and range parameters can be used to define a subset of the entire Excel file to be loaded. By default, both are set to NULL, which will mean that R will load the entirety of the first sheet.

Like the readr::read_csv() function, you can specify column names and types using the col_names and col_types parameters respectively, and also trim your values using trim_ws.