In this session, we will learn about the grammar of graphs that Leland Wilkinson introduced in the book Grammar of Graphics, Springer, New York, 2005. We will follow roughly the tutorial at http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html, although the tutorial has many more examples and graphs than we will work with. Later in the session we will apply our knowledge to data from museums.

The grammar of graphs defines independent building blocks for the things we want to do with graphs. Through their combination, the blocks give us any number of possible combinations to create perfect visualisations.

R has an excellent package called ggplot2 that follows the grammar of graphics. Compared to the base package for plotting in R we met earlier it might seem quite verbose at first but it is very flexible, can integrate multiple themes that define the appearance of the graphs and does simply overall an excellent job at producing highly professional graphs. Load the second version of the package with library(ggplot2).

library(ggplot2)

ggplot graphs consist of three distinct elements. (1) They define a geometry such as a scatter plot or a line graph. Geometries define the type of graph you would like to produce. (2) The graph’s aesthetics are the things you can see such as the colour or the position of elements. (3) Scales finally define the degree of the graphical elements.

Now, let’s compare the standard plots with the new ggplot2 package. We use the dabase of prisoners in the Lodz Ghetto. The ghetto (in German Ghetto Litzmannstadt) was a World War II ghetto established by the Nazi German authorities for Polish Jews and Roma following the 1939 invasion of Poland. It was the second-largest ghetto in all of German-occupied Europe after the Warsaw Ghetto. Situated in the city of Lodz, and originally intended as a preliminary step upon a more extensive plan of creating the Judenfrei province of Warthegau, the ghetto was transformed into a major industrial centre, manufacturing much needed war supplies for Nazi Germany and especially for the German Army. The number of people incarcerated in it was augmented further by the Jews deported from the Reich territories. Because of its remarkable productivity, the ghetto managed to survive until August 1944. In the first two years, it absorbed almost 20,000 Jews from liquidated ghettos in nearby Polish towns and villages, as well as 20,000 more from the rest of German-occupied Europe. After the wave of deportations to Chelmo death camp beginning in early 1942, and in spite of a stark reversal of fortune, the Germans persisted in eradicating the ghetto and transported the remaining population to Auschwitz and Chelmno extermination camps, where most were murdered upon arrival. It was the last ghetto in occupied Poland to be liquidated. A total of 204,000 Jews passed through it; but only 877 remained hidden when the Soviets arrived. About 10,000 Jewish residents of Lodz, who used to live there before the invasion of Poland, survived the Holocaust elsewhere. Check it the records of prisoners in Lodz with head(lodz).

head(lodz)
## # A tibble: 6 x 22
##      ID Given Surname Gender ReligionRecorded Profession SecondOccupation
##   <int> <chr> <chr>   <chr>  <chr>            <chr>      <chr>
## 1     2 Frym… Zlocze… Female <NA>             Handschuh… <NA>
## 2     4 Abram Zlocze… Male   <NA>             "Sch\xcc_… Bote
## 3     6 Chana Zlot    Female <NA>             Schneider… <NA>
## 4     8 Dres… Zlota   Female <NA>             "W\xcc_sc… <NA>
## 5    10 Fajga Zlota   Female <NA>             Schneider… <NA>
## 6    12 Malka Zlota   Female <NA>             Schneider… <NA>
## # ... with 15 more variables: BirthYear <int>, BirthCity <chr>,
## #   BirthCountry <chr>, WorkingSinceYear <int>, IDCardYear <int>,
## #   ResidenceCity <chr>, ResidenceCountry <chr>, DeathYear <int>,
## #   Age <int>, WorkerNumber <chr>, Photo <chr>, CardNumber <chr>,
## #   FactoryNumber <chr>, FactoryName <chr>, FirstNameofParents <chr>

Run a summary of lodz to get more information. That is quite a few data records.

summary(lodz)
##        ID           Given             Surname             Gender
##  Min.   :    2   Length:15128       Length:15128       Length:15128
##  1st Qu.: 7726   Class :character   Class :character   Class :character
##  Median :15404   Mode  :character   Mode  :character   Mode  :character
##  Mean   :15555
##  3rd Qu.:23154
##  Max.   :32760
##
##  ReligionRecorded    Profession        SecondOccupation     BirthYear
##  Length:15128       Length:15128       Length:15128       Min.   :1862
##  Class :character   Class :character   Class :character   1st Qu.:1901
##  Mode  :character   Mode  :character   Mode  :character   Median :1912
##                                                           Mean   :1910
##                                                           3rd Qu.:1922
##                                                           Max.   :1944
##                                                           NA's   :670
##   BirthCity         BirthCountry       WorkingSinceYear   IDCardYear
##  Length:15128       Length:15128       Min.   :1939     Min.   :1904
##  Class :character   Class :character   1st Qu.:1943     1st Qu.:1942
##  Mode  :character   Mode  :character   Median :1943     Median :1942
##                                        Mean   :1943     Mean   :1942
##                                        3rd Qu.:1943     3rd Qu.:1943
##                                        Max.   :1947     Max.   :1944
##                                        NA's   :10591    NA's   :2883
##  ResidenceCity      ResidenceCountry     DeathYear          Age
##  Length:15128       Length:15128       Min.   :1941    Min.   : 1.00
##  Class :character   Class :character   1st Qu.:1944    1st Qu.:16.00
##  Mode  :character   Mode  :character   Median :1944    Median :29.00
##                                        Mean   :1944    Mean   :30.89
##                                        3rd Qu.:1944    3rd Qu.:43.75
##                                        Max.   :1952    Max.   :73.00
##                                        NA's   :13869   NA's   :14858
##  WorkerNumber          Photo            CardNumber
##  Length:15128       Length:15128       Length:15128
##  Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character
##
##
##
##
##  FactoryNumber      FactoryName        FirstNameofParents
##  Length:15128       Length:15128       Length:15128
##  Class :character   Class :character   Class :character
##  Mode  :character   Mode  :character   Mode  :character
##
##
##
## 

We have already used the most basic of all plots, which is the histogram consisting of rectangles whose area is proportional to the frequency of a variable. Let’s look at the birth years of prisoners. In base-R, this is quickly done with hist() using lodz$BirthYear and xlab=‘Birth Year’.

hist(lodz$BirthYear, xlab = 'Birth Year', main = 'Lodz Ghetto Prisoners')

With ggplot, creating a simple histogram is slightly more complicated. Type in ggplot(lodz, aes(x = BirthYear)) + geom_histogram().

ggplot(lodz, aes(x = BirthYear)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.