4.1 Loading data
The first step in any data analysis project you’ll undertake is getting at least one dataset. Oftentimes, we have less control over the data we use than we would like; receiving odd Excel spreadsheets or text files or proprietary files or whatever. In this chapter, we’ll focus on the more typical data formats (csv and Excel), but we’ll also look at how we might extract data from a web API, which is an increasingly common method for data loading.
If I have any say in the data format of the files I need to load in, I usually ask for them to be in csv format. CSV stands for “comma-separated values” and essentially means that the data is stored as one long text string, with each different value or cell separated by a comma. So for example, a really simple csv file may look, in its most base format, like this:
name,age, Dave,35, Simon,60, Anna,24, Patricia,75
Benefits of the csv file over something like an Excel file are largely based around simplicity. csv files are typically smaller and can only have one sheet, meaning that you won’t get confused with multiple spreadsheets. Furthermore, values in csv files are essentially what you see is what you get. With Excel files, sometimes the value that you see in Excel isn’t the value that ends up in R. For these reasons, I would suggest using a separated-value file over an Excel file when you can.
220.127.116.11 Loading .csv files
Loading csv files in R is relatively simple. There are base* functions that come with R to load csv files but there’s also a popular package called
readr which can be used so I’ll cover both.
* They are technically from the
utils package which comes bundled with R so we’ll call it base R.
To load a csv file using base R, we’ll use the
read.csv(file = "path/to/your/file", header = TRUE, ...)
file parameters needs the path to your file as a character string. The
header parameter is used to tell R whether or not your file has column headers.
There are lots of other parameters that can be tweaked for the
read.csv() function, but we won’t go through them here.
readr package comes with a similar function:
read_csv(). With the exception of a couple of extra parameters in the
read_csv() function and potentially some better efficiency, there isn’t a massive difference between the two.
read_csv() function is simple:
::read_csv(file = "path/to/your/file", col_names = TRUE)readr
In this function, the
header parameter is replaced with the
col_names parameter. The
col_names parameter is very similar, you can say whether your dataset has column headings, or you can provide a character vector of names to be used as column headers.
There are also some extra parameters in the
read_csv() function that can be useful. The
col_types parameter lets you specify what datatype each column should be treated as. This can either be provided using the
cols() helper function like this:
::read_csv(file = "path/to/file", readrcol_names = TRUE, col_types = readr::cols( ::col_character(), readr::col_double() readr ), ...)
Or, you can provide a compact string with different letters representing different datatypes:
::read_csv(file = "path/to/file", readrcol_names = TRUE, col_types = "cd", ...)
The codes for the different datatypes can be found on the documentation page for the
read_csv() function (type
trim_ws parameter can also be helpful if you have a dataset with lots of trailing whitespace around your values. When set to true, the
read_csv() function will automatically trim each field before loading it in.
Overall, both functions will give you the same result, so just choose whichever function makes most sense to you and has the parameters you need.
4.1.2 Excel files
R doesn’t have any built-in functions to load Excel files. Instead, you’ll need to use a package. One of the more popular packages used to read Excel files is the
Once you’ve installed and loaded the
readxl package. You can use the
::read_excel(path = "path/to/file", sheet = NULL, range = NULL, ...)readxl
Because Excel files are a little bit more complicated than csv files, you’ll notice that there are some extra parameters. Most notably, the sheet and range parameters can be used to define a subset of the entire Excel file to be loaded. By default, both are set to NULL, which will mean that R will load the entirety of the first sheet.
readr::read_csv() function, you can specify column names and types using the
col_types parameters respectively, and also trim your values using