7 Loading

The first step in any data analysis project you’ll undertake is getting at least one dataset. Oftentimes, we have less control over the data we use than we would like; receiving odd Excel spreadsheets or text files or files in a proprietary format or whatever. In this chapter, we’ll focus on the more typical data formats (rds, CSV and Excel), but we’ll also look at how we might extract data from a web API, which is an increasingly common method for data loading. We’ll also look briefly at how you can extract data from a SQL database.

7.1 RDS

If you need to share an R-specific object (like a linear model created with the lm() function) or you’re certain that the data never needs to be readable in another program, then you utilise the rds format to save and read data.

To load in RDS files, we use the readRDS() function, providing the file path of the file we want to read:

my_data <- readRDS(file = "path/to/file")

To save an object to an .rds file, you just need to provide the object you want to save and a file path to save it to:

saveRDS(my_data, file = "path/to/file")

7.1.1 Advantages

RDS files are R specific files that contain a serialized version of a specific object. The benefit of R objects is that the object will be preserved in its entirety - you won’t lose the column data types when you save it and then load it in again.

7.1.2 Disadvantages

RDS files cannot be read natively by other programs. This means that if you’re trying to share your dataset and someone wants to open it in, say, Excel, they’re going to need to convert it to a different format before they can load it. The RDS format therefore isn’t ideal for sharing data outside of the R ecosystem.

7.2 CSV

If I have any say in the data format of the files I need to load in, I usually ask for them to be in CSV format. CSV stands for “comma-separated values” and essentially means that the data is stored as one long text string, with each different value or cell separated by a comma (although you will see CSV files with different separators). So for example, a really simple CSV file may look, in its most base format, like this:


Benefits of the CSV file over something like an Excel file are largely based around simplicity. CSV files are typically smaller and can only have one sheet, meaning that you won’t get confused with multiple spreadsheets. Furthermore, values in CSV files are essentially what you see is what you get. With Excel files, sometimes the value that you see in Excel isn’t the value that ends up in R (looking at you dates and datetimes). For these reasons, I would suggest using a separated-value file over an Excel file when you can.

7.2.1 Loading CSV files

Loading CSV files in R is relatively simple. There are base* functions that come with R to load CSV files but there’s also a popular package called readr which can be used so I’ll cover both.

* They are technically from the utils package which comes bundled with R so we’ll call it base R. Base R

To load a CSV file using base R, we’ll use the read.csv() function:

read.csv(file = "path/to/file", header = TRUE, ...)

The file parameters needs the path to your file as a character string. The header parameter is used to tell R whether or not your file has column headers. Our dataset does have headers (i.e. the first row is the column names) so we set that to TRUE.

There are lots of other parameters that can be tweaked for the read.csv() function, but we won’t go through them here. readr

The readr package comes with a similar function: read_csv(). With the exception of a couple of extra parameters in the read_csv() function and potentially some better efficiency, there isn’t a massive difference between the two.

Using the read_csv() function is simple:

readr::read_csv(file = "path/to/file", col_names = TRUE)

In this function, the header parameter is replaced with the col_names parameter. The col_names parameter is very similar, you can say whether your dataset has column headings, or you can provide a character vector of names to be used as column headers.

Let’s load our Kaggle dataset in using the readr::read_csv() function:

print(readr::read_csv("./data/vgsales.csv"), n = 5)
## Rows: 16598 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Platform, Year, Genre, Publisher
## dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 16,598 × 11
##    Rank Name         Platform Year  Genre   Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>        <chr>    <chr> <chr>   <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports   Wii      2006  Sports  Nintendo      41.5    29.0      3.77
## 2     2 Super Mario… NES      1985  Platfo… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart … Wii      2008  Racing  Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports … Wii      2009  Sports  Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red… GB       1996  Role-P… Nintendo      11.3     8.89    10.2 
## # … with 16,593 more rows, and 2 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>

You can see that when we load in a file using {readr} without specifying column types, we get an output showing us exactly how each column has been parsed. This is because CSV is a typeless format, and so data isn’t always imported as the type you intended it to be.

To override the default types that {readr} has assigned, we can use the col_types parameter. This can either be provided using the cols() helper function like this:

readr::read_csv(file = "path/to/file",
                col_names = TRUE,
                col_types = readr::cols(
                  readr::col_character(), readr::col_double(), ... # You need to provide a type for each column

Or, you can provide a compact string with different letters representing different datatypes:

readr::read_csv(file = "path/to/file",
                col_names = TRUE,
                col_types = "cd",

The codes for the different datatypes can be found on the documentation page for the read_csv() function (type ?read_csv()).

Let’s use the compact string format to load in the video game dataset again but this time, we want the year to be imported as a number:

vg_sales <- readr::read_csv("./data/vgsales.csv", col_types = "dccdccddddd")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

We’ve got some warnings here because {readr} wasn’t able to parse some of the values in the Year column as numeric, giving us some NAs. We’ll clean up these values in the Cleaning chapter so we can ignore them for now.

7.2.2 Advantages

CSV files can be opened in a number of different software packages, making the CSV format a good candidate for sharing data with people who may not also be using R.

7.2.3 Disadvantages

In the interests of simplicity, CSV files don’t store information on the type of the data in each column, meaning that you need to be careful when you’re loading data from a CSV file that the column types you end up with are the ones you want.

The CSV format also isn’t fully standardised, meaning that you might come across some files that say they’re CSV, but generate errors when they’re parsed. It’s relatively rare to see a CSV file that is so different that it can’t be parsed at all, but it’s something worth remembering.

7.2.4 Other delimited files

Comma-separated value files are just a form of delimited files that use a comma to separate different values. In the wild you might see files separated with all different kinds of symbols, like pipes (|) or tabs. To load in these types of files, use the readr::read_delim() function and specify what’s being used to separate the values with the delim parameter. readr::read_csv() basically just wraps readr::read_delim() using delim = ',' anyway, along as you’re comfortable loading in CSV files, you should be well equipped to load in any kind of delimited file.

7.3 Excel files

R doesn’t have any built-in functions to load Excel files. Instead, you’ll need to use a package. One of the more popular packages used to read Excel files is the readxl package.

Once you’ve installed and loaded the readxl package. You can use the read_excel() function:

readxl::read_excel(path = "path/to/file", sheet = NULL, range = NULL, ...)

Because Excel files are a little bit more complicated than CSV files, you’ll notice that there are some extra parameters. Most notably, the sheet and range parameters can be used to define a subset of the entire Excel file to be loaded. By default, both are set to NULL, which will mean that R will load the entirety of the first sheet.

Like the readr::read_csv() function, you can specify column names and types using the col_names and col_types parameters respectively, and also trim your values using trim_ws.

7.3.1 Advantages

I have something of a personal vendetta against storing everything in Excel spreadsheets because of the terrible way Excel displays data, so I personally don’t think there are too many advantages in using Excel files.

You can have more than one sheet maybe? That’s all I’ve got.

7.3.2 Disadvantages

The main disadvantage of Excel files for me is that Excel aggressively formats data for the end user. That is, it’s difficult to know what value is actually being stored in a cell based on the value that the end user sees in the cell. Dates are a prime example, Excel will show you the date as a date, but will store it as a number with an origin. That alone isn’t a sin at all, but combine that with the fact that Excel has multiple origin dates depending on your Excel version and OS, that’s a strike in my book.

You could also argue that the xlsx format is a software-specific format, and it kind of is. But because Excel is so ubiquitous now, there are multiple ways of opening and converting xlsx files without ever using Excel, so I don’t think that’s really a disadvantage.

Overall, if you can send the data in CSV format instead of an Excel file, do that.

7.4 Web-based APIs

Loading static data from text and Excel files is very common. However, an emerging method of data extraction is via web-based APIs. These web-based APIs allow a user to extract datasets from larger repositories using just an internet connection. This allows for access to larger and more dynamic datasets.

7.4.1 What are APIs?

API stands for application programming interface. APIs are essentially just a set of functions for interacting with an application or service. For instance, many of the packages that you’ll use will essentially just be forms of API; they provide you with functions to interact with an underlying system or service.

For data extraction, we’re going to focus more specifically on web-based APIs. These APIs use the HTTP protocols to accept requests and then return the data requested. Whilst there are multiple methods that can be implemented in an API to perform different actions, we’re going to focus on the GET method. That is, we’re purely getting something from the API rather than trying to change anything that’s stored on the server. You can think of the GET method as being read-only.

To start with, we’re going to look at exactly how you would interact with an API, but then we’ll look at the BMRSr package, which I wrote to make interacting with the Balancing Mechanism and Reporting Service easier.

7.4.2 Accessing APIs in R

To access a web-based API in R, we’re going to need a connection to the internet, something that can use the HTTP protocol (we’re going to use the httr package) and potentially some log in credentials for the API. In this case, we’re going to just use a test API, but in reality, most APIs require that you use some kind of authentication so that they know who’s accessing their data.

As previously mentioned, to extract something from the API, you’ll be using the GET method. The httr package makes this super easy by providing a GET function. To this function, we’ll need to provide a URL. The GET function will then send a GET request to that address and return the response. A really simple GET request could be:

httr::GET(url = "http://google.com")
## Response [http://www.google.com/]
##   Date: 2023-03-21 11:39
##   Status: 200
##   Content-Type: text/html; charset=ISO-8859-1
##   Size: 16.3 kB
## <!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="...
## var f=this||self;var h,k=[];function l(a){for(var b;a&&(!a.getAttribute||!(b=...
## function n(a,b,c,d,g){var e="";c||-1!==b.search("&ei=")||(e="&ei="+l(d),-1===...
## document.documentElement.addEventListener("submit",function(b){var a;if(a=b.t...
## </style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;over...
## var h=this||self;var k,l=null!=(k=h.mei)?k:1,n,p=null!=(n=h.sdo)?n:!0,q=0,r,t...
## a.fileName;g&&(0<g.indexOf("-extension:/")&&(e=3),c+="&script="+b(g),f&&g===w...
## if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.fo...
## }
## })();</script><div id="mngb"><div id=gbar><nobr><b class=gb1>Search</b> <a cl...
## ...

That seems like a really complicated response at first, but when we look at each part, it’s quite simple.

  • Response
    • This is telling us where we got our response from. In this case, we sent a request to Google, so we got a response from Google.
  • Date
    • Fairly self-explanatory - the date and time of the response.
  • Status
    • Status codes give you an indication of how the handling of the request went. 200 means “Success”.
  • Content-Type
    • This is telling us what type the response is. In this case, the response is just a HTML page, which is exactly what we expect as that’s what you get when you type “google.com” into your browser.
  • Size
    • This is the size of the response
  • Content
    • Below the size, we see the actual response body. In this case, we’ve been given the html for the google.com page.

As simple as this example was, it didn’t really give us anything interesting back, just the Google homepage. So let’s use the GET request to get something more interesting.

We’re going to access the jsonplaceholder website, which provides fake APIs for testing. But for now, imagine that this is something like an Instagram database, holding users and their posts and comments.

The first step in accessing an API is to understand that commands the API is expecting. APIs will have what we call endpoints. These are paths that we can use to access a certain dataset. For instance, looking at the website, we can see that there are endpoints for lots of different types of data: posts, comments, albums, photos, todos and users. To access an endpoint, we just need to make sure we’re using the correct path. So let’s try getting a list of users:

httr::GET(url = "https://jsonplaceholder.typicode.com/users")
## Response [https://jsonplaceholder.typicode.com/users]
##   Date: 2023-03-21 11:39
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 5.64 kB
## [
##   {
##     "id": 1,
##     "name": "Leanne Graham",
##     "username": "Bret",
##     "email": "Sincere@april.biz",
##     "address": {
##       "street": "Kulas Light",
##       "suite": "Apt. 556",
##       "city": "Gwenborough",
## ...

Looking at the content type, we can see that unlike when we sent a request to Google.com, we’ve got a Content-Type of application/json. JSON is a data structure often used to send data across APIs. We won’t go into the structure of it now because R does most of the conversion for us, but if you’re interested, there’s more info on the JSON structure at www.json.org.

Trying to read raw JSON is hard, but httr includes functions to help us get it into a better structure for R. Using the httr::content() function, httr will automatically read the response content and convert it into the format we ask for (via the as parameter). For now, we’re going to leave the at parameter as ‘NULL’ which guesses the best format for us.

response <- httr::GET(url = "https://jsonplaceholder.typicode.com/users")
content <- httr::content(response)
head(content, 1) # we'll just look at the first entry for presentation sake
## [[1]]
## [[1]]$id
## [1] 1
## [[1]]$name
## [1] "Leanne Graham"
## [[1]]$username
## [1] "Bret"
## [[1]]$email
## [1] "Sincere@april.biz"
## [[1]]$address
## [[1]]$address$street
## [1] "Kulas Light"
## [[1]]$address$suite
## [1] "Apt. 556"
## [[1]]$address$city
## [1] "Gwenborough"
## [[1]]$address$zipcode
## [1] "92998-3874"
## [[1]]$address$geo
## [[1]]$address$geo$lat
## [1] "-37.3159"
## [[1]]$address$geo$lng
## [1] "81.1496"
## [[1]]$phone
## [1] "1-770-736-8031 x56442"
## [[1]]$website
## [1] "hildegard.org"
## [[1]]$company
## [[1]]$company$name
## [1] "Romaguera-Crona"
## [[1]]$company$catchPhrase
## [1] "Multi-layered client-server neural-net"
## [[1]]$company$bs
## [1] "harness real-time e-markets"

We can see that R has taken the response and turned it into a list for us. From here, we can then start our analysis.

In many cases however, you won’t want a complete list. Instead, you’ll want to provide some parameters to limit the data you get back from your endpoint. Most APIs will have a way of doing this. For example, reading the jsonplaceholder website, we can see that we can get all the posts for a specific user by appending the url with “?userId=x”. This section of the URL (things after a ?) are called the query part of the URL. So let’s try getting all of the posts for the user with ID 1:

response <- httr::GET(url = "https://jsonplaceholder.typicode.com/posts?userId=1")
content <- httr::content(response)
head(content, 1) # we'll just look at the first entry for presentation sake
## [[1]]
## [[1]]$userId
## [1] 1
## [[1]]$id
## [1] 1
## [[1]]$title
## [1] "sunt aut facere repellat provident occaecati excepturi optio reprehenderit"
## [[1]]$body
## [1] "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"

Whilst the parameters here are pretty simple, you will come across APIs that accept multiple parameters, making data extraction from an API a very powerful tool.

7.4.3 BMRSr

As easy as the above was, interacting with APIs that have several parameters and complicated URLs can get confusing. To this end, many people create packages in R that act as wrappers for various APIs. These packages will then provide you with functions that will automatically create the request, send it and receive and parse the content. You can kind of think about it as an API for an API!

This is what I did for the Balancing Mechanism Reporting Service (BMRS) API. BMRS provides a massive amount of energy-related data, but creating the correct URLs and dealing with the response can be tricky. The BMRSr package that I wrote was designed to help with that.

We’ll now go through a quick demo of the BMRSr package. If you’re not too bothered about this part, feel free to skip to the next section.

If you’re interested, there are a couple of things you’ll need:

  • The {BMRSr} package installed
  • A free BMRS API key that can be retrieved from the ELEXON portal.

Once you’ve got those two prerequisites, using BMRSr should be quite easy. The main function in the BMRSr package is the full_request() function, which will create your URL, send the request, and parse the response depending on your parameters. To do this however, the full_request() function needs some parameters:

  • data_item
    • A data item to retrieve. The BMRS platform holds lots of datasets, and so we need to specify which one we want to retrieve.
  • api_key
    • Our API_key that we got from the Elexon portal
  • parameters
    • Depending on which data_item you chose, you’ll need to provide some parameters to filter the data
  • service_type
    • What format you want the data returned in: values are XML or CSV.

So what parameters do I need? Well, the easiest way to find out is to use the get_parameters() function. This will return all of the parameters that can be provided to the full_request().

Let’s do an example. Say I want to return data for the B1620 data item, which shows us aggregated generation output per type. So, the first step is to know what parameters I can provide using the get_parameters() function:

## [1] "settlement_date" "period"

This tells me that I can provide two parameters in my request - the date and the settlement period. Using this information in my full_request() function…

bmrs_data <- BMRSr::full_request(data_item = "B1620",
                                 api_key = "put_your_API_key_here",
                                 service_type = "csv",
                                 settlement_date = "01/11/2019",
                                 period = "*") # From reading the API manual,
# I know that this returns all periods
head(bmrs_data, 2)
## # A tibble: 2 × 13
##   `*Document Type`    `Business Type` `Process Type` `Time Series ID`   Quantity
##   <chr>               <chr>           <chr>          <chr>                 <dbl>
## 1 Actual generation … Production      Realised       NGET-EMFIP-AGPT-T…     1636
## 2 Actual generation … Production      Realised       NGET-EMFIP-AGPT-T…        0
## # … with 8 more variables: Curve Type <chr>, Resolution <chr>,
## #   Settlement Date <date>, Settlement Period <dbl>,
## #   Power System Resource  Type <chr>, Active Flag <chr>, Document ID <chr>,
## #   Document RevNum <dbl>

And there we have it, we’ve retrieved a energy-related dataset from an API using the BMRSr package. There are roughly 101 data items available on BMRS so there’s a massive amount of data there for those who want to access it.

7.5 Databases

In the corporate world or when you’re dealing with larger systems, you’ll often have some form of database that stores all of the system data. This database can function as a great repository of data for your analyses, and by utilising a live connection to the database, we can easily update our analyses in the future.

In this section, we’ll really only cover the basics of how R can interact with an SQL database, so don’t feel as though you need to be a data engineer expert to understand this section. As long as you know the tiniest bit of SQL, you should be fine.

7.5.1 Database Management Systems and ODBC

There are lots of different types of database managements systems (DBMS) such as SQL Server, MySQL, MariaDB, and so on. These DBMSs allow us to create and maintain databases.

When we connect to our database, we do so through our DBMS, but all of them have slightly different implementations. This means that if we developed a package to connect to one type of database (like MySQL), we’d have to create an entirely different package to do the same thing with a MariaDB database. This is where the Open Database Connectivity (ODBC) standard comes in. The ODBC standard allows us to interface with databases that use different DBMS systems using a common API. This means that we could swap out our SQL Server database with a MySQL one, and we wouldn’t need to make too many changes. The ODBC standard is implemented via a driver; this driver functions as an interface layer between our application (R) and the database.

So when we connect to a database with R we go in this order:

R -> ODBC Driver -> DBMS

R tells the ODBC driver to run a query, and then the ODBC driver converts that request to one that can be interpreted by the DBMS that it was built for. This means that as long as we’ve got the appropriate ODBC driver for our DBMS, we can use (basically) the same R code to interact with any kind of ODBC-compliant database.

So to connect to our database, we’re going to need 2 things:

  • The odbc R package
    • This provides us with the R functions to create our connections and run our queries as so on.
    • Think of the odbc package as a set of tools for interacting with any ODBC database; it knows how to interface with the ODBC drivers, not the DBMS.
  • The ODBC driver for our database
    • This is the actual driver used by the odbc package that communicates with the database.
    • This is the implementation of the ODBC standard for the database we’re using.

7.5.2 Connecting with the odbc package

For this example, let’s say that we’re using an SQL Server database, and so we’ve got the ODBC Driver 17 for SQL Server installed.

First, let’s make sure we’ve got the driver detected:

##                            name   attribute
## 1 ODBC Driver 17 for SQL Server Description
## 2 ODBC Driver 17 for SQL Server      Driver
## 3 ODBC Driver 17 for SQL Server  UsageCount
##                                                       value
## 1                   Microsoft ODBC Driver 17 for SQL Server
## 2 /opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.7.so.2.1
## 3                                                         1

We can see our driver has been detected. Now we can use the odbc::dbConnect() function to create a connection to our database that we can then use to run queries:

my_connection <- odbc::dbConnect(drv = odbc::odbc(),
                                 driver = "ODBC Driver 17 for SQL Server",
                                 # You'll need to change this to your server
                                 server = "SQLDATABASESERVER",
                                 # You'll need to change this to your database
                                 database = "SQLDATABASE") 

With the drv parameter we’re specifying the type of driver that we’re using. Because we’re using an ODBC driver, we can use the odbc::odbc() function. The driver parameter expects a character string of the actual driver we’re going to use (not the type), so that’s where we enter the name we saw when we ran the odbc::odbcListDrivers() function. The server is the server that your database is being on hosted on and the database parameter is the database on the server that you want to connect to.

The odbc will use this information to create a connection string that it will then use to try and connect to the database. If there are any other values you need to include in the connection string, you can include them as named parameters and they’ll be added to the string:

my_connection <- odbc::dbConnect(drv = odbc::odbc(),
                                 driver = "ODBC Driver 17 for SQL Server",
                                 # You'll need to change this to your server
                                 server = "SQLDATABASESERVER",
                                 # You'll need to change this to your database
                                 database = "SQLDATABASE",
                                 # This would add an 'extra_parameter' entry to the string
                                 extra_parameter = "extra_parameter_value") 

Note: You’ll want to make sure you assign your connection to something because we’ll be passing the connection object to some other functions soon. Authentication

You might also have to specify some credentials when you try and connect to your database. To provide a username and password, just use the uid and pwd parameters.

If your database supports it, you can use Windows Authentication, meaning that you don’t need to provide an explicit username and password - the database will use your Windows account instead. To force the driver to try and use this type of authentication, you can add a Trusted_Connection parameter to the function call and set the value to "Yes" (not TRUE or "TRUE"). This parameter is then added to the connection string via the ... argument of the odbc::dbConnect() function. DSNs

Instead of specifying this information via the odbc::dbConnect() function, you can also create a Data Source Name (DSN) entry. This contains essentially the same information as we provided (the server location, the database, access credentials and so on) but then allows us to just use the dsn parameter of the odbc::dbConnect() function.

A DSN entry might look like this:

Driver = /opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.7.so.2.1
Port = 1234

We can then just provide the DSN name to the odbc::dbConnect() function:

my_connection <- odbc::dbConnect(drv = odbc::odbc(),
                                 dsn = "MyDatabase") 

This can be particularly useful if you’re working in more than one environment, where you might want to connect to the same database but the connection string is going to be different for each environment. Instead, you can create a Data Source Name entry with the same name but with different specifications, and then use the same odbc::dbConnect(drv = odbc::odbc(), dsn = "MyDatabase") call in both.

I won’t go into exactly how to add DSNs here because it depends on your driver and your OS, but the process is pretty simple once you’ve found the right information for your setup.

7.5.3 Querying the database

Now we’ve got our connection (my_connection), we can send queries to the database. To send queries and get back the results in one step, we can use the odbc::dbGetQuery() function, passing the connection and then the string containing the SQL we want to execute.

odbc::dbGetQuery(my_connection, "select top 1 Id from dbo.ExampleTable")
##   Id
## 1  1

We then get back the data as a normal R data.frame. R will deal with the data type conversion, but the exact conversion is dependent on the driver. Luckily, Microsoft provides a complete breakdown of how data types are mapped between on R and SQL Server on their website.