Introduction to R 2.1

This week, we continue with R basics but using real life datasets, which we will explore and visualise.

Before we can explore real life datasets, we need to, however, first discuss some more advanced R constructs. We start with lists before we move on to control structures.

Lists are an extension of vectors and allow different types of values to be collected. Due to this, lists are often used to store various types of input and output data and sets. We will not use lists much for now but you might encounter them as superstructures to hold any other data structure when you look for more information on R. Let’s first create a vector of numbers from 1 to 10 with my_vector <- 1:10.

my_vector <- 1:10

Now, let’s create a matrix with numerics from 1 up to 9. Run my_matrix <- matrix(1:9, ncol = 3).

my_matrix <- matrix(1:9, ncol = 3)

R comes preloaded with many example datasets containing real life observations. One of the most commonly used ones is mtcars. As the name tells you, it contains observations about cars. You can look at the first few observations/rows by typing in head(mtcars). Contemplate a bit how cars are described here. The data frame is a good example of using features to describe observations.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Now select the first 10 rows of mtcars. Assign my_df <- mtcars[1:10,].

my_df <- mtcars[1:10,]

Finally, let’s collect the vector, the matrix and the data frame in a list with my_list <- list(my_vector, my_matrix, my_df).

my_list <- list(my_vector, my_matrix, my_df)

Print out my_list.

my_list

## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
## 
## [[3]]
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

You now have a big collector with my_list! If you would like to access any element, you can use the double square brackets. For the first one, try my_list[[1]].

my_list[[1]]

##  [1]  1  2  3  4  5  6  7  8  9 10

I promised to talk about factors, too. Let’s create months <- c(“March”, “April”, “January”, “November”, “January”, “September”, “October”, “September”, “November”, “August”, “January”, “November”, “November”, “February”, “May”, “August”, “July”, “December”, “August”, “August”, “September”, “November”, “February”, “April”). Note that we use " here instead of the euqivalent ’.

months <-  c("March","April","January","November","January", "September","October","September","November","August", "January","November","November","February","May","August", "July","December","August","August","September","November","February","April")

Now, create a month factor with months_f <- factor(months).

months_f <- factor(months)

Print it with months_f. As you can see factors summarize the values in vector into shorter and easier to handle indexes called levels (at the bottom of the result).

months_f

##  [1] March     April     January   November  January   September October  
##  [8] September November  August    January   November  November  February 
## [15] May       August    July      December  August    August    September
## [22] November  February  April    
## 11 Levels: April August December February January July March ... September

A very useful function in R is called table, which counts the frequencies of variables and organises these into a nice table. Try table(months_f). We will come back to this later when we discuss table as function to help us explore data in more detail.

table(months_f)

## months_f
##     April    August  December  February   January      July     March 
##         2         4         1         2         3         1         1 
##       May  November   October September 
##         1         5         1         3

Before continuing, I recommend playing a little with the mtcars dataset using the various functions we have met.

http://www.statmethods.net/management/ is a very useful website that uses mtcars extensively. Try some of its code and run some of the examples in your R environment. You can start a play session in swirl() with play(). If you would like to return to swirl after you finished playing, type in next().

Next we cover loops and control structures, which are common to all programming environments. This is more for information, as we will not use them in this course. Please, note that SWIRL is not the greatest environment to learn more complicated statements like functions, as you have to type everything more or less within the same line. Normally, you would use indenting to make the code more readable. Check out what that means by asking Wikipedia https://en.wikipedia.org/wiki/Programming_style#Indentation.

Control structures execute a piece of code based on a condition. You can recognize them in most programming languages by the keyword if. In R they look like if (test_expression) { statement(s) }, which translates as, if the test is ok, execute the statements.

Let’s try the if-statement. It is not so difficult to understand. Please, define medium <- ‘LinkedIn’.

medium <- 'LinkedIn'

Now, also assign num_views <- 14.

num_views <- 14

Now, type if (medium == ‘LinkedIn’) { print(‘Showing LinkedIn information’) }. What has just happened?

if (medium == 'LinkedIn') { print('Showing LinkedIn information') }

## [1] "Showing LinkedIn information"

Let’s try to confirm whether we are popular with if (num_views > 15) { print(‘You are popular!’) }.

if (num_views > 15) { print('You are popular!') }

You can combine both expressions also logically. & stands for a logical and, while | stands for the logical or. Try the logical and and type in if (num_views > 15 & medium == ‘LinkedIn’) {print(‘You are popular on LinkedIn!’)}. Explain the results!

if (num_views > 15 & medium == 'LinkedIn') { print('You are popular on LinkedIn!')}

Finally, you can also tell the computer to do other work if the condition is not fulfilled. You need to use the keyword else. Run if (num_views > 20) { print(‘You are popular!’) } else { print(‘Try to be more visible!’) }.

if (num_views > 20) { print('You are popular!') } else { print('Try to be more visible!') }

## [1] "Try to be more visible!"

That’s it with respect to control structures. In R you really do not need them very often. Another common programming construct is the loop. In R you need them even less often than if statements. For now, it is enough to simply experience a for-loop that iterates over all elements in a vector. Type in for (l in linkedin) { print(l) }.

for (l in linkedin) { print(l) }

## [1] 16
## [1] 9
## [1] 13
## [1] 5
## [1] 2
## [1] 17
## [1] 14

In R we do not use loops very often. But they can be very useful to perform repeated operations on collections of data, which is something we often want to do. So, for instance, loops could be used to get the square root of each element in a vector or we could use them to calculate the average value of a numeric column in a data frame. In R we have an easier way to do this. We can use the powerful family of apply functions, which iterate over a vector or data frame for us. Check it out on the wonderful site r-bloggers (http://www.r-bloggers.com/using-apply-sapply-lapply-in-r/). Don’t worry if you do not immediately understand it. Most people don’t. It will come with time and practice.

The function apply works on data frames and matrixes. It takes three arguments. The first one is the data frame you would like to work one. We tell apply to traverse row-wise or column-wise by the second argument. 1 stands for row-wise traversal and 2 for column-wise traversal. The third argument is the function. So, for instance, in order to determine the average views for our linkedin and facebook week, we can run apply(my_week_df[c(‘linkedin’,‘facebook’)], 2, mean). mean is an R function to calculate the average. We will discuss it in more detail later.

apply(my_week_df[c('linkedin','facebook')], 2, mean)

## linkedin facebook 
## 10.85714 11.42857

What do you need to change in order to get the average winning for both your poker and your roulette week?

apply(my_week_df[c('poker', 'roulette')], 2, mean)

##     poker  roulette 
##  28.57143 -29.14286

You can also play with R’s built-in character functions and apply (http://www.statmethods.net/management/functions.html). Try apply(my_week_df[‘days’], 2, nchar). This counts the number of character for each entry in the days vector.

apply(my_week_df['days'], 2, nchar)

##      days
## [1,]    6
## [2,]    7
## [3,]    9
## [4,]    8
## [5,]    6
## [6,]    8
## [7,]    6

How about apply(my_week_df[‘days’], 2, toupper)?

apply(my_week_df['days'], 2, toupper)

##      days       
## [1,] "MONDAY"   
## [2,] "TUESDAY"  
## [3,] "WEDNESDAY"
## [4,] "THURSDAY" 
## [5,] "FRIDAY"   
## [6,] "SATURDAY" 
## [7,] "SUNDAY"

For our final apply example, run apply(my_week_df[1], 2, length).

apply(my_week_df[1], 2, length)

## days 
##    7

The final expression counted the number of rows. This will be of interest to you, because you can find out this way how many observations/rows your data contains. With the function nrow(my_week_df), you will get the same result. ncol delivers the number of columns/features.

apply works on data frames. There are several other apply-family functions. You can, for instance, use sapply on a vector of data or a list. Run sapply(c(1,4,9,16), sqrt).

sapply(c(1,4,9,16), sqrt)

## [1] 1 2 3 4

sapply returns a vector. lapply is very similar, but it will return a list rather than a vector. Try lapply(c(1,4,9,16), sqrt).

lapply(c(1,4,9,16), sqrt)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4

We just mentioned another magic word with respect to functions. We said sapply would return a vector. A function does not just take values, but can also return them. In R this is simply the last statement in a function. Check it out under http://www.statmethods.net/management/userfunctions.html.

If you are now wondering why you need all this, you are not alone. ‘Using apply, sapply, lapply in R’ (https://www.r-bloggers.com/using-apply-sapply-lapply-in-r/) is for as long as I can remember among the top viewed articles on rbloggers. However, if you continue to work with R, you will soon find the apply functions very useful.

Much more important than loops and control structures are user-defined functions in R. Just like you can call head, subset, etc. as so-called built-in functions, in any programming language you can define your own functions. In R, use the following syntax my_fun <- function(arg1, arg2) { body }. This means that the function my_fun takes two arguments (arg1, arg2) and executes all body statements. Again, please remember the indentation of programming outside of SWIRL.

Let’s create a function to add 2 to any number with plus_two <- function(x) { x+2 }.

plus_two <- function(x) { x+2 }

Now let’s use plus_two the same way we use built-in functions. Type plus_two(12). You have just made the computer smarter! It learned how to add 2.

plus_two(12)

## [1] 14

Of course, we can also create a function with two arguments by typing in sum_x_y <- function(x, y) { return (x+y) }. What do you think it does? The return statement defines the value the function gives back.

sum_x_y <- function(x,y) { return (x+y) }

And, we can call the function with sum_x_y(-2, 3).

sum_x_y(-2, 3)

## [1] 1

That’s it with regard to the basics of R. Next, we use our knowledge to analyse a real life social dataset about death penalties and explore how easy it is to plot data with R.

We will work through a detailed example of data exploration and analyse the question of racism with regard to death penalties in the USA. We will now use the catdata package, which contains the deathpenalty dataset. A package in R contains lots of functions and data you can reuse. Check out http://www.statmethods.net/interface/packages.html. deathpenalty covers judgements of defendants in cases of multiple murders in Florida between 1976 and 1987. The cases all have features that (a) describe whether the death penalty was handed out (where 0 refers to no, 1 to yes), (b) the race of the defendant and (c) the race of the victim (black is referred as 0, white is 1). Check out the description at http://artax.karlin.mff.cuni.cz/r-help/library/catdata/html/deathpenalty.html

Back to our real life data set on historical death penalties in Florida. deathpenalty is already loaded for you. Normally, the first step in data exploration is, however, to load data from outside into R. Unfortunately, I could not find the original reference for the deathpenalty dataset anymore. Instead, I would like you to go to https://zenodo.org/record/13103#.V8GvBpMrJp8now and download the Vagrant data set as a comma-separated file (extension csv), which is at least as interesting. An explanation of the data can be found at http://openhumanitiesdata.metajnl.com/articles/10.5334/johd.1/. You can open these CSV files in Excel or any other spreadsheet programme you have. If you have one, try it and open it and change a few values. You can also load spreadsheets directly into R, but much easier to use are the csv files directly, as they correspond more or less to the view of data frames. Check out http://www.statmethods.net/input/importingdata.html. You can find CSV files everywhere on the web; especially in so-called data journals. Look up what they are. They are really novel and exciting.

Now all you have to do is call read.csv with the path to your vagrant file. Try it with vagrant_df <- read.csv(‘PATH-TO-VAGRANT-FILE’) in your R environment. In RStudio you can see the vagrant_df afterwards in the environment pane (top right).

If the CSV file has a url, we can also call it directly with vagrant2_df <- read.csv(file = url(‘https://zenodo.org/record/13103/files/MiddlesexVagrants1777-1786.csv’)) in your R environment.

The deathpenalty data set should be pre-loaded in your environment. By typing in head(deathpenalty), we can see the first couple of cases/rows. What types of columns do you see?

head(deathpenalty)

##   DeathPenalty VictimRace DefendantRace Freq
## 1            0          0             0  139
## 2            1          0             0    4
## 3            0          1             0   37
## 4            1          1             0   11
## 5            0          0             1   16
## 6            1          0             1    0

By entering tail(deathpenalty), we can see the last couple of cases/rows.

tail(deathpenalty)

##   DeathPenalty VictimRace DefendantRace Freq
## 3            0          1             0   37
## 4            1          1             0   11
## 5            0          0             1   16
## 6            1          0             1    0
## 7            0          1             1  414
## 8            1          1             1   53

Next, we will try to ask the data a few simple questions. Sometimes the solution will be given. Otherwise, you will have to find it yourself.

With the function mean you can retrieve the average of a vector of numbers or even a whole data frame. mean(1,2,3) would deliver 2. Hopefully, you remember that you can retrieve all values in a column of a data frame with $. Thus, with deathpenalty$Freq you retrieve the whole Freq column from deathpenalty as a vector. You can thus answer what the average frequency of judgements would be by entering mean(deathpenalty$Freq). Try it.

mean(deathpenalty$Freq)

## [1] 84.25

mean is very useful to understand data. When something is deemed average, it falls somewhere between the extreme ends of the scale. An average student might have marks falling in the middle of his or her classmates; an average weight is neither unusually light nor heavy. An average item is typical, and not too unlike the others in the group. You might think of it as an exemplar.

median is another function like mean that summarizes a whole dataset by delivering a central tendency. Like mean, it identifies a value that falls in the middle of a set of data. median splits the upper 50% of a data from the lower 50%. It thus delivers the value that occurs halfway through an ordered list of values. How do you get the median frequency of judgements?

median(deathpenalty$Freq)

## [1] 26.5

What is the lowest number of judgements (min)? Assign the value to min_freq, please.

min_freq <- min(deathpenalty$Freq)

What is the highest number of judgements (max)? Assign the value to max_freq, please.

max_freq <- max(deathpenalty$Freq)

What kind of case combinations had the lowest numbers of judgements? Hopefully, you remember that you can use subset to subset the rows in a data frame. In this case, you will want to look at those rows, for which Freq has the lowest value. You can get the results you want with subset(deathpenalty, Freq==min_freq).

subset(deathpenalty, Freq==min_freq)

##   DeathPenalty VictimRace DefendantRace Freq
## 6            1          0             1    0

Ok, then. Which case combinations had the highest number of judgements? You just need to change the last subset statement a little bit.

subset(deathpenalty, Freq==max_freq)

##   DeathPenalty VictimRace DefendantRace Freq
## 7            0          1             1  414

min and max are measures of the diversity or spread of data. Knowing about the spread provides a sense of the data’s highs and lows, and whether most values are like or unlike the mean and median. The span between the minimum and maximum value is known as the range. Try range(deathpenalty$Freq).

range(deathpenalty$Freq)

## [1]   0 414

The range function returns both the minimum and maximum value. With the diff function you could get the absolute difference. Do you know how? The first and third quartiles, Q1 and Q3, refer to the value below or above which one quarter of the values are found. Along with the median (Q2), the quartiles divide a dataset into four portions, each with the same number of values. Check out quantile(deathpenalty$Freq).

quantile(deathpenalty$Freq)

##     0%    25%    50%    75%   100% 
##   0.00   9.25  26.50  74.50 414.00

There are two more really useful data exploration functions in R. The first one is summary. Try summary(deathpenalty) and describe what is returned.

summary(deathpenalty)

##   DeathPenalty   VictimRace  DefendantRace      Freq       
##  Min.   :0.0   Min.   :0.0   Min.   :0.0   Min.   :  0.00  
##  1st Qu.:0.0   1st Qu.:0.0   1st Qu.:0.0   1st Qu.:  9.25  
##  Median :0.5   Median :0.5   Median :0.5   Median : 26.50  
##  Mean   :0.5   Mean   :0.5   Mean   :0.5   Mean   : 84.25  
##  3rd Qu.:1.0   3rd Qu.:1.0   3rd Qu.:1.0   3rd Qu.: 74.50  
##  Max.   :1.0   Max.   :1.0   Max.   :1.0   Max.   :414.00

summary returns all those value you just tried to find yourself! Oh, well. There is another function str, which returns the structure of a data frame. It is very useful to find out about columns and features of a dataset. Try str(deathpenalty).

str(deathpenalty)

## 'data.frame':    8 obs. of  4 variables:
##  $ DeathPenalty : int  0 1 0 1 0 1 0 1
##  $ VictimRace   : int  0 0 1 1 0 0 1 1
##  $ DefendantRace: int  0 0 0 0 1 1 1 1
##  $ Freq         : int  139 4 37 11 16 0 414 53

These summary statistics work only with what in statistics is called numerical data, which is basically anything measured in numbers. Alternatively, if data is represented by a set of categories, it is called categorical or nominal. In our dataset, we do not really have exciting categorical data, because of the way the data set is constructed. In R, we can compare categorical features with the function table. Try table(deathpenalty$VictimRace) and you will see that both victim races are equally represented.

table(deathpenalty$VictimRace)

## 
## 0 1 
## 4 4

With prop.table(table(deathpenalty$VictimRace)), you could confirm the proportional representation of the values in VictimRace. We will now move on.

We are only getting warm. Let’s make this all a little bit more complicated. We want to know how many black or white people received the death penalty. Remember that a black person is represented by 0 and a white by 1. A death penalty was handed out if DeathPenalty equals 1, with 0 otherwise. Let’s first create a new data frame black_and_deathpenalty, which contains black defendants who received the death penalty. We need the subset function again. Try black_and_deathpenalty <- subset(deathpenalty, DefendantRace == 0 & DeathPenalty == 1). Remember that with & we join conditions with a logical ‘and’ - in this case black defendant and death penalty. Try it.

black_and_deathpenalty <- subset(deathpenalty, DefendantRace == 0 & DeathPenalty == 1)

Similarly, we can get the white defendants who received the death penalty. Get the subset and assign it to white_and_deathpenalty.

white_and_deathpenalty <- subset(deathpenalty, DefendantRace == 1 & DeathPenalty == 1)

Overall, we want to compare the likelihood of white and black defendants to receive the death penalty. In order to achieve this, the next step is to find out about the overall number of black people who received the death penalty. You can get this with the sum function, which we have already met. Use n_black_deathpenalty <- sum(black_and_deathpenalty$Freq).

n_black_deathpenalty <- sum(black_and_deathpenalty$Freq)

Next, find those whites, which were given a death penalty. Assign a new variable n_white_deathpenalty.

n_white_deathpenalty <- sum(white_and_deathpenalty$Freq)

What is therefore the proportion of black people receiving the death penalty? Remember you can get this by dividing the number of black defendants with the death penalty by the total number of defendants with the death penalty.

n_black_deathpenalty / (n_black_deathpenalty + n_white_deathpenalty)

## [1] 0.2205882

That’s quite a low percentage. Do we then not need to worry then about racial biases in the judgements? What other information do we need to come to such a conclusion? Let’s find out next.

Proportionally, how many of the death penalties handed out to black people were for killing a white person? The expression subset(black_and_deathpenalty, VictimRace == 1)$Freq / n_black_deathpenalty is a bit complicated but gets the right result. With $Freq we only selected the Freq column from the subset. Neat, no? You should get back a number between 0 and 1.

subset(black_and_deathpenalty, VictimRace == 1)$Freq / n_black_deathpenalty

## [1] 0.7333333

Finally, how likely is it that a white person killing a black person will receive the death penalty?

subset(white_and_deathpenalty, VictimRace == 0)$Freq / n_white_deathpenalty

## [1] 0

Out of 15 death penalties for black people, 73% for for killing a white person. While none of the 53 death penalties for white people in Florida were given for killing a black person. I hope you can see that there are many powerful functions to explore data directly in R. Another really good way to explore data is not to ask direct questions but to summarize it with graphs and visualisations. Visualisations and graphs are easy to do with R and one of the main reasons why R is so popular. In fact, many people might use another more complex programming language to analyse data but then employ R to create visualisations.

For now, we will just try out some of the built-in graphs of R. Simply, type barplot(deathpenalty$Freq, xlab = ‘Combination’, ylab = ‘Frequency’, names.arg = c(1,2,3,4,5,6,7,8)) and see what happens. Can you explain the graph? What do the arguments do? Your friend the Internet knows of course and you could ask it.

barplot(deathpenalty$Freq, xlab = 'Combination', ylab = 'Frequency', names.arg = c(1,2,3,4,5,6,7,8))

Now, how about boxplot(deathpenalty$Freq, ylab = ‘Frequency’)? What do you see? Ask the Internet!

boxplot(deathpenalty$Freq, ylab = 'Frequency')

The boxplot displays the centre and spread in a format that allows you to quickly obtain a sense of behaviour of the data. The median is denoted by the dark line while the box around it stands for the spread. The boxplot shows one outlier. Can you identify it in the dataset deathpenalty?

Let’s check what we have learned so far.

What is the expression to plot a boxplot of the linkedin views of my_week_df?

boxplot(deathpenalty$Freq, ylab = ‘Frequency’)
plot(my_week_df$linkedin, ylab = ‘Frequency’)
boxplot(my_week_df$linkedin, ylab = ‘Frequency’)
boxplot(mtcars$hp)

boxplot(my_week_df$linkedin, ylab = ‘Frequency’)

What does the lapply function return?

A vector
A data frame
A list
Nothing

A list

Apply the median function to the linkedin and facebook columns of my_week_df. As hint, go back and find the expression how to calculate the mean for these columns and use median as a function instead.

apply(my_week_df[c('linkedin','facebook')], 2, median)

## linkedin facebook 
##       13       13

Please, select the second element of my_list.

my_list[[2]]

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Finally, something for you to talk to about in your working group. Explore the various operations to manage data in R, as explained in http://www.statmethods.net/management/index.html.

Let’s move on to the second part of today.