This session is about the art of statistics. It will deal mostly with survey data and demonstrate the right statistical tools for the job. We concentrate on data from a survey that evaluates learning and teaching objectives. The survey presents students’ views regarding the teaching and content of a particular module of 500 students. We first explore the survey with visualisations and apply other descriptive statistics techniques before we attempt to capture general relationships with inferential statistics techniques.

But surveys are of course everywhere nowadays. They help with marketing products or evaluate the government’s performance or the mood of a nation. All of these surveys follow similar patterns and types of questioning. They include questions that ask respondents to check a box, fill in a text field or choose from multiple options. Surveys can be conducted online or offline, stand-alone or as part of wider investigations such as interviews of medical drug evaluations.

Our survey data is artificially generated in order to avoid re-using data that contain personal information by respondents. But the data is composed in such a way that it has many of the types of questions that are typical in surveys. Such data is mostly ordinal and numerical, which allows us to apply a range of techniques for statistical inference. If at any moment in time you would like to regenerate the data, you can simply run student_survey <- generate_student_survey().

One of the main challenges of all statistics is the often very specific and confusing language. If you are new to statistics, I can highly recommend reading through http://www.statisticsforhumanities.net while working through this session. We will make frequent use of its ideas and definitions in this session. We refer to the book as Canning (2014) in the rest of this session.

The first important consideration in statistics is the population we consider (https://en.wikipedia.org/wiki/Statistical_population). For our survey, these are all the respondents, which are in our example 500. This way, however, we do not know anything about the other students or the student population as a whole. For this we need inferential statistics, which is about moving from a sample of data to the larger picture in order to gain insights. We will cover inferential statistics later in this session.

There are lots of technical solutions for publishing, disseminating and filling in surveys as well as for survey analysis. We use R of course for the data analysis, focusing on packages and functions dedicated to exploratory data analysis, survey plots and inferential statistics. But the actual survey was conducted with Google Forms, which is a free tool to conduct detailed surveys (https://docs.google.com/forms/). Via an easy-to-use online interface the user can define a set of questions, which can be free or short text, multiple choice or drop-down but can also contain specialised fields such as date and time. The information is collected after submission in an online Google spread sheet. While in this exercise we connect directly to the sheet, any survey data can also be downloaded and then imported into R as a CSV file or a MS Excel file. This way, R can be used to analyse data from online survey tools, which provide these kinds of export options.

In order to import the data from Google Forms we first need to set the sharing permissions of the sheet to view by anybody with the link. Afterwards, we copy and paste the URL a record it in R. Please define survey_url <- “https://docs.google.com/spreadsheets/d/1K5uS3vCcvOUrojKb-uQRB_vhfNeptes-t-VYypf3GlY/”.

survey_url <- "https://docs.google.com/spreadsheets/d/1K5uS3vCcvOUrojKb-uQRB_vhfNeptes-t-VYypf3GlY/"

Next we load a library to retrieve a Google sheet. Run library(gsheet).

library(gsheet)

Now, we load the sheet into a data frame with student_survey <- read.csv(text = gsheet2text(survey_url), stringsAsFactors = FALSE). The new function here is gsheet2text, which does what it says on its label. Please note that in this session we often talk about variables or groups of them rather than features, as this is the label given to columns in data frames in statistics. Features as labels for the data frame columns are more commonly used in machine learning.

student_survey <- read.csv(text = gsheet2text(survey_url) , stringsAsFactors = FALSE)
## No encoding supplied: defaulting to UTF-8.

Take a look at the imported data with View(student_survey).

View(student_survey)

As always when we import data, a good first step is to check its structure with str(student_survey).

str(student_survey)
## 'data.frame':    500 obs. of  12 variables:
##  $ Respondent        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  30 30 25 28 26 18 25 29 29 21 ...
##  $ Gender            : chr  "M" "M" "M" "F" ...
##  $ Height            : int  179 179 182 164 192 172 188 194 182 171 ...
##  $ Overall           : chr  "Neutral" "Strongly Agree" "Agree" "Strongly Agree" ...
##  $ Tutor             : chr  "Agree" "Agree" "Strongly Agree" "Strongly Agree" ...
##  $ Content           : chr  "Strongly Agree" "Agree" "Agree" "Agree" ...
##  $ Assessment        : chr  "Agree" "Agree" "Strongly Agree" "Strongly Agree" ...
##  $ Environment       : chr  "Neutral" "Agree" "Agree" "Agree" ...
##  $ Preparation_Hours : int  4 2 9 7 5 3 7 2 6 4 ...
##  $ Participation_Mins: int  20 31 46 47 270 164 24 104 221 24 ...
##  $ Marks             : int  63 57 79 74 65 58 73 57 68 62 ...

We can see that gsheet has already recognised most of the columns correctly as either numbers or characters. However, there are 6 columns, which should be factors and are assigned characters at the moment. Next, we transform these into factors. The first one is the Gender column and is easy. We simply run student_survey\(Gender <- factor(student_survey\)Gender), which updates the Gender column.

student_survey$Gender <- factor(student_survey$Gender)

Columns 5 to 8 are slightly more complicated. They stand for so-called Likert scales (https://en.wikipedia.org/wiki/Likert_scale). In many surveys, respondents can express preferences (or dislikes) using Likert scales. In our case, students were asked to assess (Q1) the general performance of the module, (Q2) the tutor, (Q3) the content, (Q4) the assessments and (Q5) the environment. You can print out all the Likert questions by simply typing in questions now.

questions
## $Overall
## [1] "Overall I liked the class"
## 
## $Tutor
## [1] "The tutor was well prepared"
## 
## $Content
## [1] "The content of the class was what I expected"
## 
## $Assessment
## [1] "The assessments were appropriate"
## 
## $Environment
## [1] "The teaching environment was good"

Each of the five evaluation questions was assessed using a typical five-level Likert scale. The levels are saved as l. Type in l.

l
## [1] "Strongly Disagree" "Disagree"          "Neutral"          
## [4] "Agree"             "Strongly Agree"

Using a combination of lapply and factor we can transform each of the five assessment columns into factors of levels l. Run student_survey[,5:9] <- data.frame(lapply(student_survey[,5:8], function(x) factor(x, levels = l))) to first select column 5 to 8 from student_survey, then factorize these using the l levels and finally assign the results to student_survey.

student_survey[,5:9] <- data.frame(lapply(student_survey[,5:8], function(x) factor(x, levels = l)))

Running str(student_survey) once more will convince us that all the transformations performed as expected and that we are ready to continue with the exploratory analysis.

str(student_survey)
## 'data.frame':    500 obs. of  12 variables:
##  $ Respondent        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  30 30 25 28 26 18 25 29 29 21 ...
##  $ Gender            : Factor w/ 2 levels "F","M": 2 2 2 1 2 1 2 2 2 1 ...
##  $ Height            : int  179 179 182 164 192 172 188 194 182 171 ...
##  $ Overall           : Factor w/ 5 levels "Strongly Disagree",..: 3 5 4 5 3 5 5 4 5 4 ...
##  $ Tutor             : Factor w/ 5 levels "Strongly Disagree",..: 4 4 5 5 4 3 5 1 5 1 ...
##  $ Content           : Factor w/ 5 levels "Strongly Disagree",..: 5 4 4 4 3 2 3 4 2 1 ...
##  $ Assessment        : Factor w/ 5 levels "Strongly Disagree",..: 4 4 5 5 2 3 5 5 3 3 ...
##  $ Environment       : Factor w/ 5 levels "Strongly Disagree",..: 3 5 4 5 3 5 5 4 5 4 ...
##  $ Preparation_Hours : int  4 2 9 7 5 3 7 2 6 4 ...
##  $ Participation_Mins: int  20 31 46 47 270 164 24 104 221 24 ...
##  $ Marks             : int  63 57 79 74 65 58 73 57 68 62 ...

We begin our investigation with a distribution analysis of the students’ age. First, we plot a histogram with hist(student_survey$Age, main = “Age of Students”, col=“lightskyblue”, xlab=“Age (years)”, freq = FALSE). This should be very familiar to you by now. With freq, we tell hist to plot probability densities rather than absolute numbers. Just try the command and check the y-axis, which now shows probabilities for the bins rather than frequencies.

hist(student_survey$Age, main = "Age of Students", col="lightskyblue",  xlab="Age (years)", freq = FALSE)

These kinds of plots are also called density plots. They are very popular among statisticians to provide a more effective way of to view the distribution of a feature (http://www.statmethods.net/graphs/density.html). So-called kernel density plots overcome the discreteness of the histogram bins by centering a smooth kernel function at each data point then summing to get a density estimate. We can easily add kernel density plot to the histogram with the density function. We add the density plot as a line plot and enter lines(density(student_survey$Age), col=“red”, lty=2, lwd=2).

hist(student_survey$Age, main = "Age of Students", col="lightskyblue",  xlab="Age (years)", freq = FALSE)
lines(density(student_survey$Age), col="red", lty=2, lwd=2)