Introduction to R 1.2

The final new concept for today is the matrix. It has little to do with the movie but is simply a two-dimensional vector. Up to now we only had one dimension, but why not add one more? BTW, we get so-called arrays once we move to more than 2 dimensions but that really goes too far right now.

Let’s try this and move on from your gambling. We will now introduce an example from social analytics brought to us by dataquest.com, which we will come back to again later in the course. You have a LinkedIn account and a Facebook account and want to find out which one has more views and is more successful. You collected the views per day for a particular week in two vectors. Type in first linkedin <- c(16, 9, 13, 5, 2, 17, 14)

linkedin <- c(16, 9, 13, 5, 2, 17, 14)

And now create facebook <- c(17, 7, 5, 16, 8, 13, 14).

facebook <- c(17, 7, 5, 16, 8, 13, 14)

Now, we want to create a single view over these two vectors. We use the matrix function to create a matrix with 2 rows; one for the LinkedIn views and one for the Facebook ones. Type in views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE). Again if you are interested, why not search the web for the matrix command? But for the time being you can also do without further knowledge about how a matrix exactly works. Just try and remember that they exist.

views <- matrix(c(linkedin, facebook), nrow = 2, byrow = TRUE)

Print out views.

views

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]   16    9   13    5    2   17   14
## [2,]   17    7    5   16    8   13   14

We can then ask a couple of good questions against the matrix views without having to reference the vectors it is made of. To find out on which days we had 13 views for either LinkedIn or Facebook, we type views == 13. The == is the Boolean equivalence operator.

views == 13

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
## [1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

When are views less than or equal to 14? Try views <= 14.

views <= 14.

##       [,1] [,2] [,3]  [,4] [,5]  [,6] [,7]
## [1,] FALSE TRUE TRUE  TRUE TRUE FALSE TRUE
## [2,] FALSE TRUE TRUE FALSE TRUE  TRUE TRUE

How often does facebook equal or exceed linkedin times two? This is actually a quite advanced expression in R already. Try it with sum (facebook >= linkedin * 2). Take a moment to think about the components of this expression. Maybe, you want to take a piece of paper and a pen to write down all the components.

sum(facebook >= linkedin * 2)

## [1] 2

Similar to vectors, we can access each element of a matrix, but this time we of course need 2 dimensions. views[1,2] will select the first row’s second element. Try it.

views[1,2]

## [1] 9

Overall, the order is row first and then column. Try views[2,5].

views[2,5]

## [1] 8

The most important structure in R to store and process data are data frames. Just like matrixes, data frames also have rows and columns but can hold different types of variables in each of their columns. Think about it! This way we can record any observation in the world. Any observation will have different attributes we associate with it. For instance, flowers can be of different types and colours. With data frames we can record each observation of flowers by recording it in rows and assign to the columns the various features we observe like colour, type, etc. This way the whole world is for us to record in data frames!

Let’s return to the records of our week and create a data frame to hold all its glorious details.

In order to get our week together into a single data frame, type in my_week_df <- data.frame(days, linkedin, facebook, poker, roulette, happy). This combines all our vectors into the data frame my_week_df. Each row is a day in your life.

my_week_df  <- data.frame(days, linkedin, facebook, poker, roulette, happy)

Just like matrixes, we can select rows and columns with the operator [row, column]. Select row 1, column 2 with my_week_df[1,2].

my_week_df[1,2]

## [1] 16

You can also select entire rows/observations by simply leaving the column selector empty. Try to select the second row with my_week_df[2,].

my_week_df[2,]

##      days linkedin facebook poker roulette happy
## 2 Tuesday        9        7   -50      -50  TRUE

Do you remember how to select multiple elements in a vector? It is similar for data frames. In order to select row 2 to 4, type in my_week_df[2:4,].

my_week_df[2:4,]

##        days linkedin facebook poker roulette happy
## 2   Tuesday        9        7   -50      -50  TRUE
## 3 Wednesday       13        5    20      100 FALSE
## 4  Thursday        5       16  -120     -350  TRUE

Any idea how to select the third column rather than a row? Try it by yourself!

my_week_df[,3]

## [1] 17  7  5 16  8 13 14

Do you know how to select the third and forth column?

my_week_df[,3:4]

##   facebook poker
## 1       17   140
## 2        7   -50
## 3        5    20
## 4       16  -120
## 5        8   240
## 6       13   -50
## 7       14    20

We can also select columns by name. When we created the data frame, the columns were automatically given the name of the vectors they were made of. What do you see when you type my_week_df[,‘happy’]? It shows your mixed emotions over a week of gambling.

my_week_df[,'happy']

## [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Selecting columns by their name is something we will do rather frequently in R. This is why there is a shortcut using the dollar sign. Try happy_vector <- my_week_df$happy to select the happy column and assign it to happy_vector.

happy_vector <- my_week_df$happy

Say we would like to select all days, we were happy. This is easy in R with happy_days_df <- my_week_df[my_week_df$happy,]. Try it!

happy_days_df <- my_week_df[my_week_df$happy,]

Take some time and look at the expression happy_days_df <- my_week_df[my_week_df$happy,]. What kind of parts can you identify? How are the rows/days selected when you were happy? Finally, try and click on happy_days_df in your environment panel. What do you see?

To be honest, I personally found the last expression a bit difficult to understand at first. Fortunately, R has another easy way to select rows from a data frame with the subset function. If you would like to select all the days/rows when you had more views on LinkedIn than Facebook, you can simply type small_my_week_df <- subset(my_week_df, subset = linkedin > facebook). Try it!

small_my_week_df <- subset(my_week_df, subset = linkedin > facebook)

Print out small_my_week_df.

small_my_week_df

##        days linkedin facebook poker roulette happy
## 2   Tuesday        9        7   -50      -50  TRUE
## 3 Wednesday       13        5    20      100 FALSE
## 6  Saturday       17       13   -50       10  TRUE

To select the days when I won as much in poker as in roulette, try subset(my_week_df, subset = poker == roulette).

subset(my_week_df, subset = poker == roulette)

##      days linkedin facebook poker roulette happy
## 2 Tuesday        9        7   -50      -50  TRUE

That’s almost it for data frames. I promise they get more interesting once we start working with real datasets. One more thing you often want to do is to sort a data frame according to one or more of its columns. We have another function for that called order. First find the correct order of all your facebook views by typing in ordered_positions <- order(my_week_df$facebook, decreasing=TRUE). What do you think the decreasing argument does?

ordered_positions <- order(my_week_df$facebook, decreasing=TRUE)

Then, use the ordered_positions to sort the data frame with largest_first_df <- my_week_df[ordered_positions, ].

largest_first_df <- my_week_df[ordered_positions, ]

Now, for the head with head(largest_first_df).

head(largest_first_df)

##       days linkedin facebook poker roulette happy
## 1   Monday       16       17   140      -24 FALSE
## 4 Thursday        5       16  -120     -350  TRUE
## 7   Sunday       14       14    20      100 FALSE
## 6 Saturday       17       13   -50       10  TRUE
## 5   Friday        2        8   240       10 FALSE
## 2  Tuesday        9        7   -50      -50  TRUE

Let’s check what we have learned now.

If I want to find out on how many days my LinkedIn profiles had more views than my facebook one, how do I do that?

linkedin < facebook
facebook < linkedin
linkedin = facebook
linkedin == facebook

facebook < linkedin

What is a data frame in R?

A smaller vector
A movie with Brad Pitt
Data structures that can hold different types of variables in each of their columns
My secret love

Data structures that can hold different types of variables in each of their columns

Do you know how to select the second row and third column of my_week_df? Type in the answer.

my_week_df[2,3]

## [1] 7

Please, select a subset of my_week_df where you won more in poker than roulette. Type in the answer.

subset(my_week_df, subset = poker > roulette)

##       days linkedin facebook poker roulette happy
## 1   Monday       16       17   140      -24 FALSE
## 4 Thursday        5       16  -120     -350  TRUE
## 5   Friday        2        8   240       10 FALSE

Finally, something for you to talk to about in your working group. Check out http://www.r-tutor.com/r-introduction and see how much you understand of the basic concepts mentioned there.

That’s it for the most important concepts around data frames in R. Next week we first cover a few things we do not really need for the time being but that are good to know nevertheless. You will need them for reading some of the code you find online. We start with yet another data type called lists. Afterwards, we look at factors.