In this session, we will learn about the grammar of graphs that Leland Wilkinson introduced in the book Grammar of Graphics, Springer, New York, 2005. We will follow roughly the tutorial at http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html, although the tutorial has many more examples and graphs than we will work with. Later in the session we will apply our knowledge to data from museums.
The grammar of graphs defines independent building blocks for the things we want to do with graphs. Through their combination, the blocks give us any number of possible combinations to create perfect visualisations.
R has an excellent package called ggplot2 that follows the grammar of graphics. Compared to the base package for plotting in R we met earlier it might seem quite verbose at first but it is very flexible, can integrate multiple themes that define the appearance of the graphs and does simply overall an excellent job at producing highly professional graphs. Load the second version of the package with library(ggplot2).
library(ggplot2)
ggplot graphs consist of three distinct elements. (1) They define a geometry such as a scatter plot or a line graph. Geometries define the type of graph you would like to produce. (2) The graph’s aesthetics are the things you can see such as the colour or the position of elements. (3) Scales finally define the degree of the graphical elements.
Now, let’s compare the standard plots with the new ggplot2 package. We use the dabase of prisoners in the Lodz Ghetto. The ghetto (in German Ghetto Litzmannstadt) was a World War II ghetto established by the Nazi German authorities for Polish Jews and Roma following the 1939 invasion of Poland. It was the second-largest ghetto in all of German-occupied Europe after the Warsaw Ghetto. Situated in the city of Lodz, and originally intended as a preliminary step upon a more extensive plan of creating the Judenfrei province of Warthegau, the ghetto was transformed into a major industrial centre, manufacturing much needed war supplies for Nazi Germany and especially for the German Army. The number of people incarcerated in it was augmented further by the Jews deported from the Reich territories. Because of its remarkable productivity, the ghetto managed to survive until August 1944. In the first two years, it absorbed almost 20,000 Jews from liquidated ghettos in nearby Polish towns and villages, as well as 20,000 more from the rest of German-occupied Europe. After the wave of deportations to Chelmo death camp beginning in early 1942, and in spite of a stark reversal of fortune, the Germans persisted in eradicating the ghetto and transported the remaining population to Auschwitz and Chelmno extermination camps, where most were murdered upon arrival. It was the last ghetto in occupied Poland to be liquidated. A total of 204,000 Jews passed through it; but only 877 remained hidden when the Soviets arrived. About 10,000 Jewish residents of Lodz, who used to live there before the invasion of Poland, survived the Holocaust elsewhere. Check it the records of prisoners in Lodz with head(lodz).
head(lodz)
## # A tibble: 6 x 22
## ID Given Surname Gender ReligionRecorded Profession SecondOccupation
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2 Frym… Zlocze… Female <NA> Handschuh… <NA>
## 2 4 Abram Zlocze… Male <NA> "Sch\xcc_… Bote
## 3 6 Chana Zlot Female <NA> Schneider… <NA>
## 4 8 Dres… Zlota Female <NA> "W\xcc_sc… <NA>
## 5 10 Fajga Zlota Female <NA> Schneider… <NA>
## 6 12 Malka Zlota Female <NA> Schneider… <NA>
## # ... with 15 more variables: BirthYear <int>, BirthCity <chr>,
## # BirthCountry <chr>, WorkingSinceYear <int>, IDCardYear <int>,
## # ResidenceCity <chr>, ResidenceCountry <chr>, DeathYear <int>,
## # Age <int>, WorkerNumber <chr>, Photo <chr>, CardNumber <chr>,
## # FactoryNumber <chr>, FactoryName <chr>, FirstNameofParents <chr>
Run a summary of lodz to get more information. That is quite a few data records.
summary(lodz)
## ID Given Surname Gender
## Min. : 2 Length:15128 Length:15128 Length:15128
## 1st Qu.: 7726 Class :character Class :character Class :character
## Median :15404 Mode :character Mode :character Mode :character
## Mean :15555
## 3rd Qu.:23154
## Max. :32760
##
## ReligionRecorded Profession SecondOccupation BirthYear
## Length:15128 Length:15128 Length:15128 Min. :1862
## Class :character Class :character Class :character 1st Qu.:1901
## Mode :character Mode :character Mode :character Median :1912
## Mean :1910
## 3rd Qu.:1922
## Max. :1944
## NA's :670
## BirthCity BirthCountry WorkingSinceYear IDCardYear
## Length:15128 Length:15128 Min. :1939 Min. :1904
## Class :character Class :character 1st Qu.:1943 1st Qu.:1942
## Mode :character Mode :character Median :1943 Median :1942
## Mean :1943 Mean :1942
## 3rd Qu.:1943 3rd Qu.:1943
## Max. :1947 Max. :1944
## NA's :10591 NA's :2883
## ResidenceCity ResidenceCountry DeathYear Age
## Length:15128 Length:15128 Min. :1941 Min. : 1.00
## Class :character Class :character 1st Qu.:1944 1st Qu.:16.00
## Mode :character Mode :character Median :1944 Median :29.00
## Mean :1944 Mean :30.89
## 3rd Qu.:1944 3rd Qu.:43.75
## Max. :1952 Max. :73.00
## NA's :13869 NA's :14858
## WorkerNumber Photo CardNumber
## Length:15128 Length:15128 Length:15128
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## FactoryNumber FactoryName FirstNameofParents
## Length:15128 Length:15128 Length:15128
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
We have already used the most basic of all plots, which is the histogram consisting of rectangles whose area is proportional to the frequency of a variable. Let’s look at the birth years of prisoners. In base-R, this is quickly done with hist() using lodz$BirthYear and xlab=‘Birth Year’.
hist(lodz$BirthYear, xlab = 'Birth Year', main = 'Lodz Ghetto Prisoners')
With ggplot, creating a simple histogram is slightly more complicated. Type in ggplot(lodz, aes(x = BirthYear)) + geom_histogram().
ggplot(lodz, aes(x = BirthYear)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The ggplot version seems more complicated than the base package, but the plot already looks better and you might also have noticed that the bin-width of the histogram is adjusted and gives a more fine-grained overview of the data. We can easily replace the definition of the histogram, too, by simply working with a different aesthetics. Run ggplot(lodz, aes(x = DeathYear)) + geom_histogram() to check the year of death distribution. Please, note that many more rows are removed, as this column contains many more missing values. Also, the formatting of the x-axis can be imroved. Something for later.
ggplot(lodz, aes(x = DeathYear)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
How does ggplot work? The grammar of graphs uses independent building blocks that together form the graph. In our case, the blocks are separated by the plus sign. The first building plot is therefore ggplot, which starts the plot. We tell R to use the lodz data frame with ggplot and to use the height for creating the x-axis look. If you would only enter ggplot, however, R would not know what to do and produce nothing. We need to add at least one geometric object (geom) to draw the plot. These objects can be points (geom_point for scatter plots, etc.) or lines (geom_line for trends, etc.) or many other objects.
A plot must have at least one geom but there are no upper limits apart from what you could sensibly represent in a single graphical space. You add geoms with + just like any other element in ggplot’s grammar of graphs. Type in help.search(‘geom_’, package = ‘ggplot2’) to display all the geometric object alternatives.
help.search('geom_', package = 'ggplot2')
As seen, histograms are used to explore a single variable and its frequency. We often want to compare, however, the relative distribution of 2 or more variables. So, for instance, we are interested in the relationship between birth year and death year of the prisoners. Maybe, this shows that there is a relationship? Or maybe not. Let’s try. To this end, we define the next type of plot called scatterplot based on the points geom. Enter ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point(). Switch back to plots tab in the bottom right window of RStudio to see the output.
ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point()
We have not really got an answer to the question whether there is a relationship, as the plot contains too many points. The graph shows how difficult it is to plot even this fairly small real world dataset. Do you remember our brief discussion of ‘bad data’ in an earlier session? With regards to the lodz data, there seems to be many records missing, which is why we get a warning from ggplot. Also there seem to be some suspicious outliers. If points in a scatterplot are so close, there will be various options to correct this. The easiest is to change the aesthetics to make the points more transparent using the alpha parameter. Furthermore, we would like to change the colour and the size of the points in the next plot. Simply type in ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point(color=‘red’, alpha=0.3).
ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point(color='red', alpha=0.3)
Better, as we can clearly identify the overlaps of points. Please, be aware that variables are set to aesthetics with the aes() function, while fixed aesthetics are set outside the aes() call. This is why the fixed color red and alpha transparency parameter are set outside aes(). This is important and I myself keep confusing this sometimes.
Our plot of lodz shows many data points and significant overlap. Under these conditions, scatterplots become less useful. We have changed the alpha parameter to make the points more visible. Another approch is to bin the points into hexagonal cells, which is less complicated than it sounds. Run the same scatter plot command as above to retrieve the whalers’ height but add a stat_binhex() layer at the end. Much better, no? We will later learn more about ggplot’s additional stat building block.
ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point(color='red', alpha=0.3) + stat_binhex()
Did you notice that we did not change the underlying ggplot in all our plots but only added a geom_point or a stat_binhex? This was enough to define the new look because of the grammar of graphs. What you will often see is that people experiment with the look of ggplot graphs by first defining a basic ggplot with, for instance, p <- ggplot(lodz, aes(x = BirthYear, y = DeathYear)). Try it.
p <- ggplot(lodz, aes(x = BirthYear, y = DeathYear))
Now, we can plot the graph (re-)using p. This time, we want the colour of the points to depend on the type of gender of prisoners. Type in p + geom_point(aes(color=Gender), alpha=0.5)). Some values are missing, as indicated by NA in the legend.
p + geom_point(aes(color=Gender), alpha=0.5)
However, we still do not have a clear idea od the relationship of the birth years and the death years. As said, a plot constructed with ggplot can have more than one geom. Let us add a simple smoother geom that includes a regression line, which measures the degree of relatedness between two continuous variables (here date of journey and height), and a ribbon to measure the degree of confidence we have in the relationship. Type p + geom_point(aes(color=Gender), alpha=0.5) + geom_smooth().
p + geom_point(aes(color=Gender), alpha=0.5) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is a nice blue regression line in the graph now, which is fairly constant. If you zoom into the graph, you can see that the ribbon is fairly close to the regression line. This means there is no relationship between birth year and death year. However, we can clearly see that in Lodz, most deaths occured around 1943, as this is where the regression is concentrated.
For completeness, we continue by introducing a few more geoms. p + geom_text(aes(label=FactoryNumber), size = 1) adds a text label based on the factory number a prisoner worked in.
p + geom_text(aes(label=FactoryNumber), size = 1)
We can see some numbers instead of the dots in the others scatter plots. To see these better, let’s zoom in to the those prisioners who died before 1943. This can be done by limiting the y axis. Type in p + geom_text(aes(label=FactoryNumber), size = 2) + ylim(1941,1942.5). Please, note that we also increase the size.
p + geom_text(aes(label=FactoryNumber), size = 2) + ylim(1941,1942.5)
More useful to us is the shape aesthetics that we can use to represent yet another variable in our graph. Type in p + geom_point(aes(color=Gender, shape=Photo), alpha=0.4) in order to represent the gender using the colour of the points and whether there is a photo of the prisoner using shape.
p + geom_point(aes(color=Gender, shape=Photo), alpha=0.4)
Oh well, the graph is not as useful as hoped. There are simply too many points. Maybe, it would have been better to use test data than real life data, but at least you should get a feeling of the challenges we face with cultural data. In the real life of data, you often try, fail and try again until you get something nice. Let’s try facetting next with facet_wrap layer. Facetting with ggplot could be a solution to escape the many different values for skin colour. Try your luck with p + geom_point(color = ‘red’, alpha=0.4) + facet_wrap(~Gender).
p + geom_point(color = 'red', alpha=0.4) + facet_wrap(~Gender)
If you zoom in to the new graph, you will notice that the genders are fairly evenly distributed and where the NA values are mostly found. But overall the graph is still very complicated, and it is not easy to draw any conclusion from it.
Finally, we could make the graph look even nicer with one of the many themes of ggplot, which changes the look and feel of a graph. The possibilities are endless. Have a look at http://docs.ggplot2.org/current/theme.html.
As a final exercise, let us start from the beginning and define p <- ggplot(lodz). Notice how we now have a simple base plot we can build anything on top, as it does not contain an aesthetics.
p <- ggplot(lodz)
Run p + geom_point(aes(x = BirthYear, y = IDCardYear), color = ‘red’, alpha = 0.4) to quickly plot a new relationship of birth year and the year when the ID card was issued. Again, we can see many suspicious outliers.
p + geom_point(aes(x = BirthYear, y = IDCardYear), color = 'red', alpha = 0.4)
Let us concentrate on the time since 1939 when the Ghetto was established. Run p + geom_point(aes(x = BirthYear, y = IDCardYear), color = ‘red’, alpha = 0.4) + ylim(1939, 1946)
p + geom_point(aes(x = BirthYear, y = IDCardYear), color = 'red', alpha = 0.4) + ylim(1939, 1946)
Again, many values are missing. We do not like this and hence define a new data frame that contains only full records with birth_id <- na.omit(data.frame(BirthYear=lodz\(BirthYear, IDCardYear=lodz\)IDCardYear)). I hope you remember that na.omit removes missing values in a data frame (see our session on communities in digital culture and society). Otherwise, read again http://www.statmethods.net/input/missingdata.html. We could have also run subset to exclude NA values. Do you know how?
birth_id <- na.omit(data.frame(BirthYear=lodz$BirthYear, IDCardYear=lodz$IDCardYear))
Let us remove the data we are not interested in with birth_id <- birth_id[birth_id$IDCardYear>=1939,]. Do you remember what this does?
birth_id <- birth_id[birth_id$IDCardYear>=1939,]
Check the new data with head(birth_id).
head(birth_id)
## BirthYear IDCardYear
## 1 1920 1943
## 2 1925 1942
## 3 1916 1943
## 4 1921 1942
## 5 1925 1943
## 6 1925 1943
Now, let us define p1 <- ggplot(birth_id).
p1 <- ggplot(birth_id)
Running p1 + geom_point(aes(x = BirthYear, y = IDCardYear), color = ‘red’, alpha = 0.4) should not leave us open to complaints anymore.
p1 + geom_point(aes(x = BirthYear, y = IDCardYear), color = 'red', alpha = 0.4)
Next we would like to add a regression line to see whether we finally can find a relationship. Rather than using the smooth block, this time we introduce how to do this by ourselves. We can calculate the regression relationship using predicted_id_year <- predict(lm(IDCardYear ~ BirthYear, data = birth_id)). lm is the function to create a regression model, while predict predicts heights using the model.
predicted_id_year <- predict(lm(IDCardYear ~ BirthYear, data = birth_id))
Well done! You have just calculated your first prediction model. Regressions are one way to predict the behaviour of one continuous variable based on another. In our case we predict the year of the id based on the birth.
Now run p1 + geom_point(aes(x = BirthYear, y = IDCardYear), color = ‘red’, alpha = 0.4) + geom_line(aes(x=BirthYear, y = predicted_id_year)) to add a line-geom object for the regression line. It shows that year of the ID card does slighly increase with the birth year, although the approximation is not great. Nevertheless, congratulations are in order. You have just visulised your first prediction.
p1 + geom_point(aes(x = BirthYear, y = IDCardYear), color = 'red', alpha = 0.4) + geom_line(aes(x=BirthYear, y = predicted_id_year))
Now, to something different to deal with in our context. We move on from Ghettos to museums and your future job as an analyst there. Several arts museums around the world publish their collections’ metadata online. Very popular in the R community is, for instance, New York’s Museum of Modern Art (MoMA) collection data (https://github.com/MuseumofModernArt/collection). Let us take a look at their artworks’ information. Please note that this data is refreshed monthly and fairly large with over 50 MB at the time of writing. So, if you download it now you might see some differences to the preloaded data frame moma_artworks. Check out a small sample of all the records with summary(moma_artworks).
summary(moma_artworks)
## X Title
## Min. : 32 Untitled : 53
## 1st Qu.: 31299 Untitled from Favorite Objects : 5
## Median : 63406 New York City : 4
## Mean : 63796 Poster of the French National Railroad: 4
## 3rd Qu.: 95426 Untitled (page from Sump) : 4
## Max. :128326 (Untitled) : 3
## (Other) :1212
## Artist ConstituentID
## Eugène Atget : 49 229 : 49
## Louise Bourgeois : 39 710 : 39
## Ludwig Mies van der Rohe: 19 7166 : 19
## Marc Chagall : 17 1055 : 17
## Lee Friedlander : 16 2002 : 16
## (Other) :1134 (Other):1134
## NA's : 11 NA's : 11
## ArtistBio Nationality
## (French, 1857–1927) : 49 (American):536
## (American, born France. 1911–2010) : 39 (French) :247
## (American, born 1934) : 24 (German) : 78
## (American, born Germany. 1886–1969): 19 (British) : 70
## (French, born Belarus. 1887–1985) : 17 (Spanish) : 34
## (Other) :1100 (Other) :309
## NA's : 37 NA's : 11
[...]
Type in head(moma_artworks) to take a look at the first rows. For the rest of the MoMA exercises, we take inspiration from http://sebastianbarfort.github.io/sds/posts/2015/09/27/assignment-1.html. Check out his course syllabus in order to understand why R is important to digital social sciences.
head(moma_artworks)
## X Title Artist
## 1 110086 Untitled from Flare Thomas Nozkowski
## 2 95761 I Am Still Alive On Kawara
## 3 90344 Ortogonal (Collage) 8 Alejandro Otero
## 4 13738 Plate (folio 12) from (POEMS) Willem de Kooning
## 5 41430 Mansard House, South End, Boston, Massachusetts Walker Evans
## 6 97070 Mechanical for Safedoor George Maciunas
## ConstituentID ArtistBio Nationality
## 1 4344 (American, born 1944) (American)
## 2 3030 (Japanese, 1933–2014) (Japanese)
## 3 4445 (Venezuelan, 1921–1990) (Venezuelan)
## 4 3213 (American, born the Netherlands. 1904–1997) (American)
## 5 1777 (American, 1903–1975) (American)
## 6 21398 (American, born Lithuania. 1931–1978) (American)
## BeginDate EndDate Gender Date
## 1 (1944) (0) (Male) 2009
## 2 (1933) (0) (Male) 1970
## 3 (1921) (1990) (Male) 1952
## 4 (1904) (1997) (Male) 1967-1988
## 5 (1903) (1975) (Male) 1932
## 6 (1931) (1978) (Male) c. 1973
## Medium
## 1 One from an illustrated book with ten etchings
## 2 One telegram with envelope
## 3 Cut-and-pasted colored paper on colored paper
## 4 <NA>
## 5 Gelatin silver print
## 6 (CONFIRM)
[...]
In this part, let’s pretend you work for MoMA, and your manager has asked you to create a visualisation showcasing the development of the museum stock. Next to the moma_artworks, I have also created the moma_stock data frame. We would like to know how the stock has developed in recent years and visualise it. Get the head of moma_stock.
head(moma_stock)
## X date supply stock
## 1 1 1929-11-01 9 9
## 2 2 1930-01-01 3 12
## 3 3 1930-04-01 2 14
## 4 4 1930-06-01 1 15
## 5 5 1930-10-01 2 17
## 6 6 1931-01-01 2 19
Now define a new ggplot with p1 <- ggplot(moma_stock).
p1 <- ggplot(moma_stock)
Next, we draw a simple red line to represent the stock development with p1 + geom_line(aes(x = as.Date(date), y = stock/10000), color = ‘red’, size = 1) + labs(x = ‘Date’, y = ‘No (10,000)’, title = ‘MoMA Stock’). Please note the new geom labs, which labels the axes and the title as well as your old friend as.Date in geom_line.
p1 + geom_line(aes(x = as.Date(date), y = stock/10000), color = 'red', size = 1) + labs(x = 'Date', y = 'No (10,000)', title = 'MoMA Stock')
## Warning: Removed 1 rows containing missing values (geom_path).
There is a definite jump in holdings from the late 1960s onwards. Or is this simply a recording issue? We cannot know really and would have to do further research. Let’s investigate the development of MoMA departments next. moma_departments contains the stocks per department. Check it out with head(moma_departments).
head(moma_departments)
## X date Department supply stock
## 1 1 1932-01-01 Architecture & Design 2 2
## 2 2 1934-01-01 Architecture & Design 2 4
## 3 3 1934-04-01 Architecture & Design 43 47
## 4 4 1934-09-01 Architecture & Design 4 51
## 5 5 1935-11-01 Architecture & Design 22 73
## 6 6 1935-12-01 Architecture & Design 1 74
Let’s create a new ggplot and set p2 to ggplot(moma_departments).
p2 <- ggplot(moma_departments)
Now we plot with p2 + geom_line(aes(x = as.Date(date), y = stock, color = Department), size = 1) + labs(x = ‘Date’, y = ‘No’, title = ‘MoMA Stock per Department’).
p2 + geom_line(aes(x = as.Date(date), y = stock, color = Department), size = 1) + labs(x = 'Date', y = 'No', title = 'MoMA Stock per Department')
## Warning: Removed 8 rows containing missing values (geom_path).
Well done! But you are rather nervous about showing this to your manager and would like to change the look and feel a bit. That is why you check the ggplot themes with theme_get().
theme_get()
## List of 59
## $ line :List of 6
## ..$ colour : chr "black"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..$ lineend : chr "butt"
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ rect :List of 5
## ..$ fill : chr "white"
## ..$ colour : chr "black"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
[...]
There are a lot of theme options. After some research and some good old cut-and-past, you decide to make the background black and position the legend on the bottom of the graph. You type p2 + geom_line(aes(x = as.Date(date), y = stock, color = Department), size = 1) + labs(x = ‘Date’, y = ‘No’, title = ‘MoMA Stock per Department’) + theme(panel.background = element_rect(fill = “black”), legend.position = ‘bottom’).
p2 + geom_line(aes(x = as.Date(date), y = stock, color = Department), size = 1) + labs(x = 'Date', y = 'No', title = 'MoMA Stock per Department') + theme(panel.background = element_rect(fill = 'black'), legend.position = 'bottom')
Your manager likes it! After your success at MoMA why not start at the TATE Galleries in Britain? Your first task is to compare visitor numbers. You have been given a data frame tate_attendance. Check it out with str(tate_attendance).
str(tate_attendance)
## Classes 'tbl_df', 'tbl' and 'data.frame': 12 obs. of 5 variables:
## $ year : int 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 ...
## $ Britain : num 108577 83305 153279 129129 118294 ...
## $ Modern : num 512807 328504 480753 485799 472907 ...
## $ Liverpool: num 66815 81898 68439 84524 225184 ...
## $ StIves : num 38381 36242 40678 40678 34807 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 5
## .. ..$ year : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Britain : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ Modern : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ Liverpool: list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## .. ..$ StIves : list()
## .. .. ..- attr(*, "class")= chr "collector_number" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
tate_attendance obviously contains the visitor numbers since 2004 for all TATE Galleries in Britain. You are eager to plot it in order to compare visitor developments. But you soon realise that there is an issue with the format, which is quite typical in these cases. Some of the information (in this case the name of the galleries) is encoded in the titles of the columns. You would like to create a data frame that has the gallery names as a value in a column. This means you create ‘tidy data’, as defined by Hadley Wickham in the Journal of Statistical Software (https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf). In tidy data, each row is an observation and each column contains the values of a variable (i.e. an attribute of what we are observing). We have already met the dplyr package to wrangle data. There is also the tidyr package to tidy data. Pleas, load library(tidyr).
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
## The following object is masked from 'package:raster':
##
## extract
The package tidyr allows us to gather the data in the right shape. Check out https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf for a complete overview of all data wrangling and tidying in R. It might seem confusing at first but it is actually quite straight-forward. Type in tate_attendance_gathered <- gather(tate_attendance, variable, value, -year) in order to gather all variables that describe TATE galleries to be part of observations (value) based on the year. This operation is very typical for preparing plots. At the end, we have one observation of TATE attendances per row and no additional information in the column titles.
tate_attendance_gathered <- gather(tate_attendance, variable, value, -year)
Let us check that we tied the data. Run the command to display the structure (str) of tate_attendance_gathered.
str(tate_attendance_gathered)
## Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of 3 variables:
## $ year : int 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 ...
## $ variable: chr "Britain" "Britain" "Britain" "Britain" ...
## $ value : num 108577 83305 153279 129129 118294 ...
Success! Because the visitor numbers represent together the total number of visitors to the TATE Galleries in Britain, you would like to create a staked area geom, which you liked when we tried the Google Ngram stuff. area geoms add up the numbers per year to provide a total. You can do it! Run ggplot(tate_attendance_gathered, aes(x = year, fill = variable)) + geom_area(aes(y = value), color=‘black’, size=0.2, alpha=0.4) + scale_fill_brewer(palette=‘Blues’). Observe the parameters of geom_area and look up what scale_fill_brewer does at http://www.sthda.com/english/wiki/ggplot2-area-plot-quick-start-guide-r-software-and-data-visualization.
ggplot(tate_attendance_gathered, aes(x = year, fill = variable)) + geom_area(aes(y = value), color='black', size=0.2, alpha=0.4) + scale_fill_brewer(palette='Blues')
Wonderful. Corporate blue always works and is a little less risky than your MoMA colours. You career at a museum is guaranteed!
Let’s check what we have learned next.
Which principles is ggplot built upon?
Grammar of graphs
Using ggplot, plot the lodz data with x = BirthYear and y = DeathYear as a scatter plot with blue dots and alpha=0.3.
ggplot(lodz, aes(x = BirthYear, y = DeathYear)) + geom_point(color='blue', alpha=0.3)
Replot the tate_attendance_gathered with a new palette ‘Oranges’.
ggplot(tate_attendance_gathered, aes(x = year, fill = variable)) + geom_area(aes(y = value), color='black', size=0.2, alpha=0.4) + scale_fill_brewer(palette='Oranges')
What is a sign of tidy data?
One observation per row and no additional information in the columns
The great thing about tidyr is that you can gather only certain columns and leave the others alone. Let us try this with mtcars. Print out the head of mtcars to recall how mtcars looks like.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
If we want to gather all the columns from mpg to gear and leave the carb and car columns as they are, we can use mpg:gear in the gather expression above instead of -year. You need to replace tate_attendance with mtcars and can leave the rest as it was.
gather(mtcars, variable, value, mpg:gear)
## carb variable value
## 1 4 mpg 21.000
## 2 4 mpg 21.000
## 3 1 mpg 22.800
## 4 1 mpg 21.400
## 5 2 mpg 18.700
## 6 1 mpg 18.100
## 7 4 mpg 14.300
## 8 2 mpg 24.400
## 9 2 mpg 22.800
## 10 4 mpg 19.200
[...]
That’s it. Please check out online the endless possibilities of visualisations with ggplot.