2 The Basics of the R Programming Language

Most of the work in this book is implemented in the R statistical programming language which, along with Python, is one of the two languages that I use in my day-to-day statistical analysis. Sample implementations in Python are also provided at various points in the book. For those who wish to follow the method and theory without the implementations in this book, there is no need to read this chapter. However, the style of this book is to use implementation to illustrate theory and practice, and so tolerance of many code blocks will be necessary as you read onward.

For those who wish to simply replicate this work as quickly as possible, they will be able to avail of the code block copying feature, which appears whenever you scroll over an input code block. Assuming all the required external packages have been installed, these code blocks should all be transportable and immediately usable. In some parts of the book I have used graphics to illustrate a concept but I have hidden the underlying code as I did not consider it important to the learning objectives at that point. Nevertheless there will be some who will want to see it, and if you are one of those the best place to go is the Github repository for this book.

This chapter is for those who wish to learn the methods in this book but do not know how to use a programming language. However, it is not intended to be a full tutorial on R. There are many more qualified individuals and existing resources that would better serve that purpose—in particular I recommend Wickham and Grolemund (2016). It is recommended that you consult these resources and become comfortable with the basics of R before proceeding into the later chapters of this book. However, acknowledging that many will want to dive in sooner rather than later, this chapter covers the absolute basics of R that will allow the uninitiated reader to proceed with at least some orientation.

2.1 What is R?

R is a programming language that was originally developed by and for statisticians, but in recent years its capabilities and the environments in which it is used have expanded greatly, with extensive use nowadays in academia and the public and private sectors. There are many advantages to using a programming language like R. Here are some:

  1. It is completely free and open source.
  2. It is faster and more efficient with memory than popular graphical user interface analytics tools.
  3. It facilitates easier replication of analysis from person to person compared with many alternatives.
  4. It has a large and growing global community of active users.
  5. It has a large and rapidly growing universe of packages, which are all free and which provide the ability to do an extremely wide range of general and highly specialized tasks, statistical and otherwise.

There is often heated debate about which tools are better for doing non-trivial statistical analysis. I personally find that R provides the widest array of resources for those interested in statistical modeling, while Python has a better general-purpose toolkit and is particularly well kitted out for machine learning applications.

2.2 How to start using R

Just like most programming languages, R itself is an interpreter which receives input and returns output. It is not very easy to use without an IDE. An IDE is an Integrated Development Environment, which is a convenient user interface allowing an R programmer to do all their main tasks including writing and running R code, saving files, viewing data and plots, integrating code into documents and many other things. By far the most popular IDE for R is RStudio. An example of what the RStudio IDE looks like can be seen in Figure 2.1.

The RStudio IDE

Figure 2.1: The RStudio IDE

To start using R, follow these steps:

  1. Download and install the latest version of R from https://www.r-project.org/. Ensure that the version suits your operating system.
  2. Download the latest version of the RStudio IDE from https://rstudio.com/products/rstudio/ and view the video on that page to familiarize yourself with its features.
  3. Open RStudio and play around.

The initial stages of using R can be challenging, mostly due to the need to become familiar with how R understands, stores and processes data. Extensive trial and error is a learning necessity. Perseverance is important in these early stages, as well as an openness to seek help from others either in person or via online forums.

2.3 Data in R

As you start to do tasks involving data in R, you will generally want to store the things you create so that you can refer to them later. Simply calculating something does not store it in R. For example, a simple calculation like this can be performed easily:

3 + 3
## [1] 6

However, as soon as the calculation is complete, it is forgotten by R because the result hasn’t been assigned anywhere. To store something in your R session, you will assign it a name using the <- operator. So I can assign my previous calculation to an object called my_sum, and this allows me to access the value at any time.

# store the result
my_sum <- 3 + 3
# now I can work with it
my_sum + 3
## [1] 9

You will see above that you can comment your code by simply adding a # to the start of a line to ensure that the line is ignored by the interpreter.

Note that assignment to an object does not result in the value being displayed. To display the value, the name of the object must be typed, the print() command used or the command should be wrapped in parentheses.

# show me the value of my_sum
my_sum
## [1] 6
# assign my_sum + 3 to new_sum and show its value
(new_sum <- my_sum + 3)
## [1] 9

2.3.1 Data types

All data in R has an associated type, to reflect the wide range of data that R is able to work with. The typeof() function can be used to see the type of a single scalar value. Let’s look at the most common scalar data types.

Numeric data can be in integer form or double (decimal) form.

# integers can be signified by adding an 'L' to the end
my_integer <- 1L  
my_double <- 6.38
typeof(my_integer)
## [1] "integer"
typeof(my_double)
## [1] "double"

Character data is text data surrounded by single or double quotes.

my_character <- "THIS IS TEXT"
typeof(my_character)
## [1] "character"

Logical data takes the form TRUE or FALSE.

my_logical <- TRUE
typeof(my_logical)
## [1] "logical"

2.3.2 Homogeneous data structures

Vectors are one-dimensional structures containing data of the same type and are notated by using c(). The type of the vector can also be viewed using the typeof() function, but the str() function can be used to display both the contents of the vector and its type.

my_double_vector <- c(2.3, 6.8, 4.5, 65, 6)
str(my_double_vector)
##  num [1:5] 2.3 6.8 4.5 65 6

Categorical data—which takes only a finite number of possible values—can be stored as a factor vector to make it easier to perform grouping and manipulation.

categories <- factor(
  c("A", "B", "C", "A", "C")
)
str(categories)
##  Factor w/ 3 levels "A","B","C": 1 2 3 1 3

If needed, the factors can be given order.

# character vector 
ranking <- c("Medium", "High", "Low")
str(ranking)
##  chr [1:3] "Medium" "High" "Low"
# turn it into an ordered factor
ranking_factors <- ordered(
  ranking, levels = c("Low", "Medium", "High")
)
str(ranking_factors)
##  Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 3 1

The number of elements in a vector can be seen using the length() function.

length(categories)
## [1] 5

Simple numeric sequence vectors can be created using shorthand notation.

(my_sequence <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10

If you try to mix data types inside a vector, it will usually result in type coercion, where one or more of the types are forced into a different type to ensure homogeneity. Often this means the vector will become a character vector.

# numeric sequence vector
vec <- 1:5
str(vec)
##  int [1:5] 1 2 3 4 5
# create a new vector containing vec and the character "hello"
new_vec <- c(vec, "hello")
# numeric values have been coerced into their character equivalents
str(new_vec)
##  chr [1:6] "1" "2" "3" "4" "5" "hello"

But sometimes logical or factor types will be coerced to numeric.

# attempt a mixed logical and numeric
mix <- c(TRUE, 6)
# logical has been converted to binary numeric (TRUE = 1)
str(mix)
##  num [1:2] 1 6
# try to add a numeric to our previous categories factor vector
new_categories <- c(categories, 1)
# categories have been coerced to background integer representations
str(new_categories)
##  num [1:6] 1 2 3 1 3 1

Matrices are two-dimensional data structures of the same type and are built from a vector by defining the number of rows and columns. Data is read into the matrix down the columns, starting left and moving right. Matrices are rarely used for non-numeric data types.

# create a 2x2 matrix with the first four integers
(m <- matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Arrays are n-dimensional data structures with the same data type and are not used extensively by most R users.

2.3.3 Heterogeneous data structures

Lists are one-dimensional data structures that can take data of any type.

my_list <- list(6, TRUE, "hello")
str(my_list)
## List of 3
##  $ : num 6
##  $ : logi TRUE
##  $ : chr "hello"

List elements can be any data type and any dimension. Each element can be given a name.

new_list <- list(
  scalar = 6, 
  vector = c("Hello", "Goodbye"), 
  matrix = matrix(1:4, nrow = 2, ncol = 2)
)
str(new_list)
## List of 3
##  $ scalar: num 6
##  $ vector: chr [1:2] "Hello" "Goodbye"
##  $ matrix: int [1:2, 1:2] 1 2 3 4

Named list elements can be accessed by using $.

new_list$matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Dataframes are the most used data structure in R; they are effectively a named list of vectors of the same length, with each vector as a column. As such, a dataframe is very similar in nature to a typical database table or spreadsheet.

# two vectors of different types but same length
names <- c("John", "Ayesha")
ages <- c(31, 24)
# create a dataframe
(df <- data.frame(names, ages))
##    names ages
## 1   John   31
## 2 Ayesha   24
# get types of columns
str(df)
## 'data.frame':    2 obs. of  2 variables:
##  $ names: chr  "John" "Ayesha"
##  $ ages : num  31 24
# get dimensions of df
dim(df)
## [1] 2 2

2.4 Working with dataframes

The dataframe is the most common data structure used by analysts in R, due to its similarity to data tables found in databases and spreadsheets. We will work with dataframes a lot in this book, so let’s get to know them.

2.4.1 Loading and tidying data in dataframes

To work with data in R, you usually need to pull it in from an outside source into a dataframe11. R facilitates numerous ways of importing data from simple .csv files, from Excel files, from online sources or from databases. Let’s load a data set that we will use later—the chinook_employees data set, which contains information on employees of a sales company. The read.csv() function can accept a URL address of the file if it is online.

# url of data set 
url <- "https://ona-book.org/data/chinook_employees.csv"
# load the data set and store it as a dataframe called workfrance_edges
chinook_employees <- read.csv(url)

We might not want to display this entire data set before knowing how big it is. We can view the dimensions, and if it is too big to display, we can use the head() function to display just the first few rows.

dim(chinook_employees)
## [1] 8 4
# eight rows, lets view first six
head(chinook_employees)
##   EmployeeId FirstName LastName ReportsTo
## 1          1    Andrew    Adams        NA
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2
## 4          4  Margaret     Park         2
## 5          5     Steve  Johnson         2
## 6          6   Michael Mitchell         1

We can view a specific column by using $, and we can use square brackets to view a specific entry. For example if we wanted to see the 6th entry of the LastName column:

chinook_employees$LastName[6]
## [1] "Mitchell"

Alternatively, we can use a [row, column] index to get a specific entry in the dataframe.

chinook_employees[4, 3]
## [1] "Park"

We can take a look at the data types using str().

str(chinook_employees)
## 'data.frame':    8 obs. of  4 variables:
##  $ EmployeeId: int  1 2 3 4 5 6 7 8
##  $ FirstName : chr  "Andrew" "Nancy" "Jane" "Margaret" ...
##  $ LastName  : chr  "Adams" "Edwards" "Peacock" "Park" ...
##  $ ReportsTo : int  NA 1 2 2 2 1 6 6

We can also see a statistical summary of each column using summary(), which tells us various statistics depending on the type of the column.

summary(chinook_employees)
##    EmployeeId    FirstName           LastName           ReportsTo    
##  Min.   :1.00   Length:8           Length:8           Min.   :1.000  
##  1st Qu.:2.75   Class :character   Class :character   1st Qu.:1.500  
##  Median :4.50   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :4.50                                         Mean   :2.857  
##  3rd Qu.:6.25                                         3rd Qu.:4.000  
##  Max.   :8.00                                         Max.   :6.000  
##                                                       NA's   :1

Missing data in R is identified by a special NA value. This should not be confused with "NA", which is simply a character string. The function is.na() will look at all values in a vector or dataframe and return TRUE or FALSE based on whether they are NA or not. By adding these up using the sum() function, it will take TRUE as 1 and FALSE as 0, which effectively provides a count of missing data.

sum(is.na(chinook_employees))
## [1] 1

In some cases, we might want to remove the rows of data that contain NAs. The easiest way is to use the complete.cases() function, which identifies the rows that have no NAs, and then we can select those rows from the dataframe based on that condition. Note that you can overwrite objects with the same name in R.

# remove rows containing an NAs
chinook_employees <- chinook_employees[complete.cases(chinook_employees), ]

# confirm no NAs
sum(is.na(chinook_employees))
## [1] 0

We can see the unique values of a vector or column using the unique() function.

unique(chinook_employees$FirstName)
## [1] "Nancy"    "Jane"     "Margaret" "Steve"    "Michael"  "Robert"   "Laura"

If we need to change the type of a column in a dataframe, we can use the as.numeric(), as.character(), as.logical() or as.factor() functions. For example, given that there are only seven unique values for the FirstName column in chinook_employees, we may want to convert it from its current character form to a factor.

chinook_employees$FirstName <- as.factor(chinook_employees$FirstName)
str(chinook_employees)
## 'data.frame':    7 obs. of  4 variables:
##  $ EmployeeId: int  2 3 4 5 6 7 8
##  $ FirstName : Factor w/ 7 levels "Jane","Laura",..: 5 1 3 7 4 6 2
##  $ LastName  : chr  "Edwards" "Peacock" "Park" "Johnson" ...
##  $ ReportsTo : int  1 2 2 2 1 6 6

2.4.2 Manipulating dataframes

Dataframes can be subsetted to contain only rows that satisfy specific conditions.

(chinook_employee_5 <- subset(chinook_employees, subset = EmployeeId == 5))
##   EmployeeId FirstName LastName ReportsTo
## 5          5     Steve  Johnson         2

Note the use of ==, which is used in many programming languages, to test for precise equality. Similarly we can select columns based on inequalities (> for ‘greater than’‍, < for ‘less than’‍, >= for ‘greater than or equal to’‍, <= for ‘less than or equal to’‍, or != for ‘not equal to’). For example:

(chinook_employees_upto5 <- subset(chinook_employees, 
                                   subset = EmployeeId <= 5))
##   EmployeeId FirstName LastName ReportsTo
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2
## 4          4  Margaret     Park         2
## 5          5     Steve  Johnson         2

To select specific columns use the select argument.

(chinook_employee_names <- subset(chinook_employees, 
                                 select = c("FirstName", "LastName")))
##   FirstName LastName
## 2     Nancy  Edwards
## 3      Jane  Peacock
## 4  Margaret     Park
## 5     Steve  Johnson
## 6   Michael Mitchell
## 7    Robert     King
## 8     Laura Callahan

Two dataframes with the same column names can be combined by their rows.

chinook_employee_7 <- subset(chinook_employees, subset =EmployeeId == 7)

# bind the rows to chinook_employee_5
(chinook_employee_5and7 = rbind(chinook_employee_5, chinook_employee_7))
##   EmployeeId FirstName LastName ReportsTo
## 5          5     Steve  Johnson         2
## 7          7    Robert     King         6

Two dataframes with different column names can be combined by their columns.

chinook_reporting <- subset(chinook_employees, 
                     select = c("EmployeeId", "ReportsTo"))

# bind the columns to chinook_employee_names
(full_df <- cbind(chinook_reporting, chinook_employee_names))
##   EmployeeId ReportsTo FirstName LastName
## 2          2         1     Nancy  Edwards
## 3          3         2      Jane  Peacock
## 4          4         2  Margaret     Park
## 5          5         2     Steve  Johnson
## 6          6         1   Michael Mitchell
## 7          7         6    Robert     King
## 8          8         6     Laura Callahan

2.5 Functions, packages and libraries

In the code so far we have used a variety of functions. For example head(), subset(), rbind(). Functions are operations that take certain defined inputs and return an output. Functions exist to perform common useful operations.

2.5.1 Using functions

Functions usually take one or more arguments. Often there are a large number of arguments that a function can take, but many are optional and not required to be specified by the user. For example, the function head(), which displays the first rows of a dataframe12, has only one required argument x: the name of the dataframe. A second argument is optional, n: the number of rows to display. If n is not entered, it is assumed to have the default value n = 6.

When running a function, you can either specify the arguments by name or you can enter them in order without their names. If you enter arguments without naming them, R expects the arguments to be entered in exactly the right order.

# see the head of chinook_employees, with the default of six rows
head(chinook_employees)
##   EmployeeId FirstName LastName ReportsTo
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2
## 4          4  Margaret     Park         2
## 5          5     Steve  Johnson         2
## 6          6   Michael Mitchell         1
## 7          7    Robert     King         6
# see fewer rows - arguments need to be in the right order if not named
head(chinook_employees, 3)
##   EmployeeId FirstName LastName ReportsTo
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2
## 4          4  Margaret     Park         2
# or if you don't know the right order, 
# name your arguments and you can put them in any order
head(n = 3, x = chinook_employees)
##   EmployeeId FirstName LastName ReportsTo
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2
## 4          4  Margaret     Park         2

2.5.2 Help with functions

Most functions in R have excellent help documentation. To get help on the head() function, type help(head) or ?head. This will display the results in the Help browser window in RStudio. Alternatively you can open the Help browser window directly in RStudio and do a search there. An example of the browser results for head() is in Figure 2.2.

Results of a search for the `head()` function in the RStudio Help browser

Figure 2.2: Results of a search for the head() function in the RStudio Help browser

The help page normally shows the following:

  • Description of the purpose of the function
  • Usage examples, so you can quickly see how it is used
  • Arguments list so you can see the names and order of arguments
  • Details or notes on further considerations on use
  • Expected value of the output (for example head() is expected to return a similar object to its first input x)
  • Examples to help orient you further (sometimes examples can be very abstract in nature and not so helpful to users)

2.5.3 Writing your own functions

Functions are not limited to those that come packaged in R. Users can write their own functions to perform tasks that are helpful to their objectives. Experienced programmers in most languages subscribe to a principle called DRY (Don’t Repeat Yourself). Whenever a task needs to be done repeatedly, it is poor practice to write the same code numerous times. It makes more sense to write a function to do the task.

In this example, a simple function is written which generates a report on a dataframe:

# create df_report function
df_report <- function(df) {
  paste("This dataframe contains", nrow(df), "rows and", 
        ncol(df), "columns. There are", sum(is.na(df)), "NA entries.")
}

We can test our function by using chinook_employees data set (remember that we removed a row containing an NA value earlier).

df_report(chinook_employees)
## [1] "This dataframe contains 7 rows and 4 columns. There are 0 NA entries."

2.5.4 Installing packages

All the common functions that we have used so far exist in the base R installation. However, the beauty of open source languages like R is that users can write their own functions or resources and release them to others via packages. A package is an additional module that can be installed easily; it makes resources available which are not in the base R installation. In this book we will be using functions from both base R and from popular and useful packages. As an example, a fundamental package which we will use in this book is the igraph package for constructing and analyzing graphs.

Before an external package can be used, it must be installed into your package library using install.packages(). So to install igraph, type install.packages("igraph") into the console. This will send R to the main internet repository for R packages (known as CRAN). It will find the right version of igraph for your operating system and download and install it into your package library. If igraph needs other packages in order to work, it will also install these packages.

If you want to install more than one package, put the names of the packages inside a character vector—for example:

my_packages <- c("igraph", "ggraph")
install.packages(my_packages)

Once you have installed a package, you can see what functions are available by calling for help on it, for example using help(package = igraph). One package you may wish to install now is the onadata package, which contains all the data sets used in this book. By installing and loading this package, all the data sets used in this book will be loaded into your R session and ready to work with. If you do this, you can ignore the read.csv() commands later in the book, which download the data from the internet.

2.5.5 Using packages

Once you have installed a package into your package library, to use it in your R session you need to load it using the library() function. For example, to load igraph after installing it, use library(igraph). Often nothing will happen when you use this command, but rest assured the package has been loaded and you can start to use the functions inside it. Sometimes when you load the package a series of messages will display, usually to make you aware of certain things that you need to keep in mind when using the package. Note that whenever you see the library() command in this book, it is assumed that you have already installed the package in that command. If you have not, the library() command will fail.

Once a package is loaded from your library, you can use any of the functions inside it. For example, the degree() function is not available before you load the igraph package but becomes available after it is loaded. In this sense, functions ‘belong’ to packages.

Problems can occur when you load packages that contain functions with the same name as functions that already exist in your R session. Often the messages you see when loading a package will alert you to this. When R is faced with a situation where a function exists in multiple packages you have loaded, R always defaults to the function in the most recently loaded package. This may not always be what you intended.

One way to completely avoid this issue is to get in the habit of namespacing your functions. To namespace, you simply use package::function(), so to safely call degree() from igraph, you use igraph::degree(). Most of the time in this book when a function is being called from a package outside base R, I use namespacing to call that function. This should help avoid confusion about which packages are being used for which functions.

2.5.6 The pipe operator

Even in the most elementary briefing about R, it is very difficult to ignore the pipe operator. The pipe operator makes code more natural to read and write and reduces the typical computing problem of many nested operations inside parentheses.

As an example, imagine we wanted to do the following two operations in one command:

  1. Subset chinook_employees to only the LastName values of those with EmployeeId less than 5
  2. Convert those names to all upper case characters.

Rememering that we have already removed rows with NA values from chinook_employees, one way to do this is:

toupper(subset(chinook_employees$LastName, 
               subset = chinook_employees$EmployeeId < 5))
## [1] "EDWARDS" "PEACOCK" "PARK"

This is nested and needs to be read from the inside out in order to align with the instructions. The pipe operator |> takes the command that comes before it and places it inside the function that follows it (as the first unnamed argument). This reduces complexity and allows you to follow the logic more clearly.

# use the pipe operator to lay out the steps more logically
chinook_employees$LastName |> 
  subset(subset = chinook_employees$EmployeeId < 5) |> 
  toupper() 
## [1] "EDWARDS" "PEACOCK" "PARK"

The pipe operator is very widely used because it helps to make code more readable, it reduces complexity, and it helps orient around a common ‘grammar’ for the manipulation of data. The pipe operator helps you structure your code more clearly around nouns (objects), verbs (functions) and adverbs (arguments of functions). One of the most developed sets of packages in R that follows these principles is the tidyverse family of packages, which I encourage you to explore13.

2.6 Errors, warnings and messages

As I mentioned earlier in this chapter, getting familiar with R can be frustrating at the beginning if you have never programmed before. You can expect to regularly see messages, warnings or errors in response to your commands. I encourage you to regard these as your friend rather than your enemy. It is very tempting to take the latter approach when you are starting out, but over time I hope you will appreciate some wisdom from my words.

Errors are serious problems which usually result in the halting of your code and a failure to return your requested output. They usually come with an indication of the source of the error, and these can sometimes be easy to understand and sometimes frustratingly vague and abstract. For example, an easy-to-understand error is:

subset(chinook_employees, subset = EmployeeId = 5)
Error: unexpected '=' in "subset(salespeople, subset = sales ="

This helps you see that you have used EmployeeId = 5 as a condition to subset your data, when you should have used EmployeeId == 5 for precise equality.

A much more challenging error to understand is:

head[chinook_employees]
Error in head[salespeople] : object of type 'closure' is not subsettable

When first faced with an error that you can’t understand, try not to get frustrated and proceed in the knowledge that it usually can be fixed easily and quickly. Often the problem is much more obvious than you think, and if not, there is still a 99% likelihood that others have made this error and you can read about it online. The first step is to take a look at your code to see if you can spot what you did wrong. In this case, you may see that you have used square brackets [] instead of parentheses () when calling your head() function. If you cannot see what is wrong, the next step is to ask a colleague or do an internet search with the text of the error message you receive, or to consult online forums like https://stackoverflow.com. The more experienced you become, the easier it is to interpret error messages.

Warnings are less serious and usually alert you to something that you might be overlooking and which could indicate a problem with the output. In many cases you can ignore warnings, but sometimes they are an important reminder to go back and edit your code. For example, you may run a model which doesn’t converge, and while this does not stop R from returning results, it is also very useful for you to know that it didn’t converge.

Messages are pieces of information that may or may not be useful to you at a particular point in time. Sometimes you will receive messages when you load a package from your library. Sometimes messages will keep you up to date on the progress of a process that is taking a long time to execute.

2.7 Plotting and graphing

As you might expect in a well-developed programming language, there are numerous ways to plot and graph information in R. If you are doing exploratory data analysis on fairly simple data and you don’t need to worry about pretty appearance or formatting, the built-in plot capabilities of base R are fine. If you need a pretty appearance, more precision, color coding or even 3D graphics or animation, there are also specialized plotting and graphing packages for these purposes. In general when working interactively in RStudio, graphical output will be rendered in the Plots pane, where you can copy it or save it as an image.

2.7.1 Plotting in base R

The simplest plot function in base R is plot(). This performs basic X-Y plotting. As an example, this code will generate a scatter plot of Ozone against Temp in the built-in airquality data set in R, with the results displayed in Figure 2.3. Note the use of the arguments main, xlab and ylab for customizing the axis labels and title for the plot.

# scatter plot of ozone against temp
plot(x = airquality$Temp, y = airquality$Ozone,
     xlab = "Temperature (F)", ylab = "Ozone",
     main = "Scatterplot of Ozone vs Temperature")
Simple scatterplot of `Ozone` against `Temp` in the `airquality` data set

Figure 2.3: Simple scatterplot of Ozone against Temp in the airquality data set

Histograms of data can be generated using the hist() function. This command will generate a histogram of Ozone as displayed in Figure 2.4. Note the use of breaks to customize how the bars appear.

# histogram of ozone
hist(airquality$Ozone, breaks = 10,
     xlab = "Ozone levels", 
     main = "Histogram of Ozone Levels")
Simple histogram of `Ozone` in the `airquality` data set

Figure 2.4: Simple histogram of Ozone in the airquality data set

Box and whisker plots are excellent ways to see the distribution of a variable, and can be grouped against another variable to see bivariate patterns. For example, this command will show a box and whisker plot of Ozone grouped against Month, with the output shown in Figure 2.5. Note the use of the formula and data notation here to define the variable we are interested in and how we want it grouped.

# box plot of Ozone by Month
boxplot(formula = Ozone ~ Month, data = airquality,
        xlab = "Month", ylab = "Ozone levels",
        main = "Boxplot of Ozone Levels by Month")
Simple box plot of `Ozone` grouped against `Month` in the `airquality` data set

Figure 2.5: Simple box plot of Ozone grouped against Month in the airquality data set

These are among the most common plots used for data exploration purposes. They are examples of a wider range of plotting and graphing functions available in base R, such as line plots, bar plots and other varieties which you may see later in this book.

2.7.2 Specialist plotting and graphing packages

By far the most commonly used specialist plotting and graphing package in R is ggplot2. ggplot2 allows the flexible construction of a very wide range of charts and graphs, but uses a very specific command grammar which can take some getting used to. However, once learned, ggplot2 can be an extremely powerful tool. Later in this book we will make a lot of references to ggplot2 and some of its extension packages like ggraph. A great learning resource for ggplot2 is Wickham (2016). Here are some examples of how to recreate the plots from the previous section in ggplot2 using its layered graphics grammar.

To start graphing, the ggplot() function usually requires a data set. You can also define some aesthetic mappings in this initial function, which associate a feature of the chart with an element of the data. Any such aesthetic mappings are inherited by later commands in the layering. In this case, we use the airquality data set, we define our x and y aesthetics and we then use geom_point() to draw a scatter plot with some visual customization. We also use a theme command to obtain a preset look for our chart — in this case a minimal look — and we customize our title and axis labels. The result is in Figure 2.6.

library(ggplot2)

# create scatter of Ozone vs Temp in airquality data set
ggplot(data = airquality, aes(x = Temp, y = Ozone)) +
  geom_point(color = "pink", shape = "diamond", size = 3) +
  theme_minimal() +
  labs(title = "Scatterplot of Ozone vs Temperature",
       x = "Temperature (F)")
Simple scatter plot of `Ozone` against `Temp` in the `airquality` data set using `ggplot2`

Figure 2.6: Simple scatter plot of Ozone against Temp in the airquality data set using ggplot2

To create our histogram of Ozone readings. we use a similar approach with the result in Figure 2.7.

# create histogram of Ozone
ggplot(data = airquality, aes(x = Ozone)) +
  geom_histogram(bins = 10, fill = "lightblue", color = "pink") +
  theme_minimal() + 
  labs(title = "Histogram of Ozone Levels",
       x = "Ozone levels", 
       y = "Frequency")
Simple histogram of `Ozone` in the `airquality` data set using `ggplot2`

Figure 2.7: Simple histogram of Ozone in the airquality data set using ggplot2

And finally, we create our box and whisker plot using the same principles, with the result in 2.8.

# create boxplot of Ozone by Month
ggplot(data = airquality, aes(x = as.factor(Month), y = Ozone)) +
  geom_boxplot(fill = "lightblue", color = "pink") +
  theme_minimal() +
  labs(title = "Boxplot of Ozone Levels by Month",
       x = "Month",
       y = "Ozone levels")
Simple boxplot of `Ozone` against `Month` in the `airquality` data set using `ggplot2`

Figure 2.8: Simple boxplot of Ozone against Month in the airquality data set using ggplot2

2.8 Randomization in R

An important element of many processes in mathematics and statistics is randomization. Being able to generate random numbers is an important part of sampling or initiating processes or algorithms. We will also see later in this book that randomization is an important element of many graph algorithms, including visualization layout algorithms. In the following code we randomly sample 3 rows from the airquality data set using the sample() function.

airquality[sample(1:nrow(airquality), 3), ]
##    Ozone Solar.R Wind Temp Month Day
## 93    39      83  6.9   81     8   1
## 86   108     223  8.0   85     7  25
## 9      8      19 20.1   61     5   9

We do this a second time and note a different output because the sampling is random.

airquality[sample(1:nrow(airquality), 3), ]
##     Ozone Solar.R Wind Temp Month Day
## 86    108     223  8.0   85     7  25
## 41     39     323 11.5   87     6  10
## 137     9      24 10.9   71     9  14

Random number generation can be problematic when you want to precisely replicate work done previously by yourself or others. If the previous work depended on random number generation, then your output will probably not match the previous output and it will be difficult to tell if this is because of an error you made or simply because of different random numbers.

You can control random number generation by manually setting a random seed. In fact, random number generation in statistical software is rarely truly random, and is actually pseudorandom. Pseudorandomization is a simulated process that is ‘seeded’ using a vector of values. By default this vector is generated from the precise system time at a given moment, but its generation can also be controlled manually by a user. A given seed will always produce the same results from randomization, and so manually setting a seed can be an effective way to ensure precise reproducibility of work.

Again, let’s generate three random rows from our airquality data set, but this time we manually set the same seed before each sampling process. You can use any number of your choice to manually set a seed.

# sample 1
set.seed(123)
airquality[sample(1:nrow(airquality), 3), ]
##     Ozone Solar.R Wind Temp Month Day
## 49     20      37  9.2   65     6  18
## 120    76     203  9.7   97     8  28
## 82     16       7  6.9   74     7  21
# sample 2
set.seed(123)
airquality[sample(1:nrow(airquality), 3), ]
##     Ozone Solar.R Wind Temp Month Day
## 49     20      37  9.2   65     6  18
## 120    76     203  9.7   97     8  28
## 82     16       7  6.9   74     7  21

Note that a random seed resets after every call that requires randomization14. Therefore it is important to set a random seed before every random process to ensure full reproducibility.

2.9 Documenting your work using R Markdown

For anyone performing any sort of analysis using a statistical programming language, appropriate documentation and reproducibility of the work is essential to its success and longevity. If your code is not easily obtained or run by others, it is likely to have a very limited impact and lifetime. Learning how to create integrated documents that contain both text and code is critical to providing access to your code and narration of your work.

R Markdown is a package which allows you to create integrated documents containing both formatted text and executed code. It is, in my opinion, one of the best resources available currently for this purpose. This entire book has been created using R Markdown. You can start an R Markdown document in RStudio by installing the rmarkdown package and then opening a new R Markdown document file, which will have the suffix .Rmd.

R Markdown documents always start with a particular heading type called a YAML header, which contains overall information on the document you are creating. Care must be taken with the precise formatting of the YAML header, as it is sensitive to spacing and indentation. Usually a basic YAML header is created for you in RStudio when you start a new .Rmd file. Here is an example.

---
title: "My new document"
author: "Keith McNulty"
date: "25/01/2021"
output: html_document
---

The output part of this header has numerous options, but the most commonly used are html_document, which generates your document as a web page, and pdf_document, which generates your document as a PDF using the open source LaTeX software package. If you wish to create PDF documents you will need to have a version of LaTeX installed on your system. One R package that can do this for you easily is the tinytex package. The function install_tinytex() from this package will install a minimal version of LaTeX which is fine for most purposes.

R Markdown allows you to build a formatted document using many shorthand formatting commands. Here are a few examples of how to format headings and place web links or images in your document:

# My top heading
This section is about this general topic.

## My first sub heading 
To see more information on this sub-topic visit [here](https://my.web.link).

## My second sub heading
Here is a nice picture about this sub-topic.
![](path/to/image)

Code can be written and executed and the results displayed inline using backticks. For example, recalling our chinook_employees dataset from earlier and writing

`r nrow(chinook_employees)`

inline will display 8 in the final document15. Entire code blocks can be included and executed by using triple-backticks. The following code block:

```{r}
# show the first three rows of chinook_employees
head(chinook_employees, 3)
```

will display this output:

##   EmployeeId FirstName LastName ReportsTo
## 1          1    Andrew    Adams        NA
## 2          2     Nancy  Edwards         1
## 3          3      Jane  Peacock         2

The {} wrapping allows you to specify different languages for your code chunk. For example, if you wanted to run Python code instead of R code you can use {python}. It also allows you to set options for the code chunk display separated by commas. For example, if you want the results of your code to be displayed, but without the code itself being displayed, you can use {r, echo = FALSE}.

The process of compiling your R Markdown code to produce a document is known as ‘knitting.’ To create a knitted document, you simply need to click on the ‘Knit’ button in RStudio that appears above your R Markdown code.

If you are not familiar with R Markdown, I strongly encourage you to learn it alongside R and to challenge yourself to write up any practice exercises you take on in this book using R Markdown. Useful cheat sheets and reference guides for R Markdown formatting and commands are available through the Cheatsheets section of the Help menu in RStudio. I also recommend Xie, Dervieux, and Riederer (2020) for a really thorough instruction and reference guide.

2.10 Learning exercises

2.10.1 Discussion questions

  1. Describe the following data types: numeric, character, logical, factor.
  2. Why is a vector known as a homogeneous data structure?
  3. Give an example of a heterogeneous data structure in R.
  4. What is the difference between NA and "NA"?
  5. What operator is used to return named elements of a list and named columns of a dataframe?
  6. Describe some functions that are used to manipulate dataframes.
  7. What is a package and how do you install and use a new package?
  8. Describe what is meant by ‘namespacing’ and why it might be useful.
  9. What is the pipe operator, and why is it popular in R?
  10. What is the difference between an error and a warning in R?
  11. Name some simple plotting functions in base R.
  12. What is R Markdown, and why is it useful to someone performing analysis using programming languages?

2.10.2 Data exercises

  1. Create a character vector called my_names that contains all your first, middle and last names as elements. Calculate the length of my_names.
  2. Create a second numeric vector called which which corresponds to my_names. The entries should be the position of each name in the order of your full name. Verify that it has the same length as my_names.
  3. Create a dataframe called names, which consists of the two vectors my_names and which as columns. Calculate the dimensions of names.
  4. Create a new dataframe new_names with the which column converted to character type. Verify that your command worked using str().
  5. Load the chinook_customers data set via the onadata package or download it from the internet16. Calculate the dimensions of chinook_customers and view the first three rows only.
  6. View a statistical summary of all of the columns of chinook_customers. Determine if there are any missing values.
  7. View the subset of chinook_customers for values of SupportRepId equal to 3.
  8. Install and load the package dplyr. Look up the help for the filter() function in this package and try to use it to repeat the task in the previous question.
  9. Write code to find the last name of the customer with the highest CustomerId where the SupportRepId is equal to 4. Count the number of characters in this last name.
  10. Familiarize yourself with the two functions filter() and pull() from dplyr. Use these functions to try to do the same calculation in the previous question using a single unbroken piped command. Be sure to namespace where necessary.
  11. Create a scatter plot using the built-in mtcars dataset with data from the column named mpg plotted on the \(y\) axis and data from the column named hp plotted on the \(x\) axis.
  12. Using the same mtcars dataset, convert the data in the cyl column to a factor with three levels. Plot a histogram of the count of observations at each of the three cyl levels.
  13. Create a box plot of mpg grouped by cyl.
  14. If you used base plotting functions to answer questions 11-13, try to answer them again using the ggplot2 package. Experiment with different themes and colors.
  15. Knit all of your answers to these exercises into an R Markdown document. Create one version that displays your code and answers, and another that just displays the answers.