Art of Statistics 1

This session is about the art of statistics. It will deal mostly with survey data and demonstrate the right statistical tools for the job. We concentrate on data from a survey that evaluates learning and teaching objectives. The survey presents students’ views regarding the teaching and content of a particular module of 500 students. We first explore the survey with visualisations and apply other descriptive statistics techniques before we attempt to capture general relationships with inferential statistics techniques.

But surveys are of course everywhere nowadays. They help with marketing products or evaluate the government’s performance or the mood of a nation. All of these surveys follow similar patterns and types of questioning. They include questions that ask respondents to check a box, fill in a text field or choose from multiple options. Surveys can be conducted online or offline, stand-alone or as part of wider investigations such as interviews of medical drug evaluations.

Our survey data is artificially generated in order to avoid re-using data that contain personal information by respondents. But the data is composed in such a way that it has many of the types of questions that are typical in surveys. Such data is mostly ordinal and numerical, which allows us to apply a range of techniques for statistical inference. If at any moment in time you would like to regenerate the data, you can simply run student_survey <- generate_student_survey().

One of the main challenges of all statistics is the often very specific and confusing language. If you are new to statistics, I can highly recommend reading through http://www.statisticsforhumanities.net while working through this session. We will make frequent use of its ideas and definitions in this session. We refer to the book as Canning (2014) in the rest of this session.

The first important consideration in statistics is the population we consider (https://en.wikipedia.org/wiki/Statistical_population). For our survey, these are all the respondents, which are in our example 500. This way, however, we do not know anything about the other students or the student population as a whole. For this we need inferential statistics, which is about moving from a sample of data to the larger picture in order to gain insights. We will cover inferential statistics later in this session.

There are lots of technical solutions for publishing, disseminating and filling in surveys as well as for survey analysis. We use R of course for the data analysis, focusing on packages and functions dedicated to exploratory data analysis, survey plots and inferential statistics. But the actual survey was conducted with Google Forms, which is a free tool to conduct detailed surveys (https://docs.google.com/forms/). Via an easy-to-use online interface the user can define a set of questions, which can be free or short text, multiple choice or drop-down but can also contain specialised fields such as date and time. The information is collected after submission in an online Google spread sheet. While in this exercise we connect directly to the sheet, any survey data can also be downloaded and then imported into R as a CSV file or a MS Excel file. This way, R can be used to analyse data from online survey tools, which provide these kinds of export options.

In order to import the data from Google Forms we first need to set the sharing permissions of the sheet to view by anybody with the link. Afterwards, we copy and paste the URL a record it in R. Please define survey_url <- “https://docs.google.com/spreadsheets/d/1K5uS3vCcvOUrojKb-uQRB_vhfNeptes-t-VYypf3GlY/”.

survey_url <- "https://docs.google.com/spreadsheets/d/1K5uS3vCcvOUrojKb-uQRB_vhfNeptes-t-VYypf3GlY/"

Next we load a library to retrieve a Google sheet. Run library(gsheet).

library(gsheet)

Now, we load the sheet into a data frame with student_survey <- read.csv(text = gsheet2text(survey_url), stringsAsFactors = FALSE). The new function here is gsheet2text, which does what it says on its label. Please note that in this session we often talk about variables or groups of them rather than features, as this is the label given to columns in data frames in statistics. Features as labels for the data frame columns are more commonly used in machine learning.

student_survey <- read.csv(text = gsheet2text(survey_url) , stringsAsFactors = FALSE)

## No encoding supplied: defaulting to UTF-8.

Take a look at the imported data with View(student_survey).

View(student_survey)

As always when we import data, a good first step is to check its structure with str(student_survey).

str(student_survey)

## 'data.frame':    500 obs. of  12 variables:
##  $ Respondent        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  30 30 25 28 26 18 25 29 29 21 ...
##  $ Gender            : chr  "M" "M" "M" "F" ...
##  $ Height            : int  179 179 182 164 192 172 188 194 182 171 ...
##  $ Overall           : chr  "Neutral" "Strongly Agree" "Agree" "Strongly Agree" ...
##  $ Tutor             : chr  "Agree" "Agree" "Strongly Agree" "Strongly Agree" ...
##  $ Content           : chr  "Strongly Agree" "Agree" "Agree" "Agree" ...
##  $ Assessment        : chr  "Agree" "Agree" "Strongly Agree" "Strongly Agree" ...
##  $ Environment       : chr  "Neutral" "Agree" "Agree" "Agree" ...
##  $ Preparation_Hours : int  4 2 9 7 5 3 7 2 6 4 ...
##  $ Participation_Mins: int  20 31 46 47 270 164 24 104 221 24 ...
##  $ Marks             : int  63 57 79 74 65 58 73 57 68 62 ...

We can see that gsheet has already recognised most of the columns correctly as either numbers or characters. However, there are 6 columns, which should be factors and are assigned characters at the moment. Next, we transform these into factors. The first one is the Gender column and is easy. We simply run student_survey$Gender <- factor(student_survey$Gender), which updates the Gender column.

student_survey$Gender <- factor(student_survey$Gender)

Columns 5 to 8 are slightly more complicated. They stand for so-called Likert scales (https://en.wikipedia.org/wiki/Likert_scale). In many surveys, respondents can express preferences (or dislikes) using Likert scales. In our case, students were asked to assess (Q1) the general performance of the module, (Q2) the tutor, (Q3) the content, (Q4) the assessments and (Q5) the environment. You can print out all the Likert questions by simply typing in questions now.

questions

## $Overall
## [1] "Overall I liked the class"
## 
## $Tutor
## [1] "The tutor was well prepared"
## 
## $Content
## [1] "The content of the class was what I expected"
## 
## $Assessment
## [1] "The assessments were appropriate"
## 
## $Environment
## [1] "The teaching environment was good"

Each of the five evaluation questions was assessed using a typical five-level Likert scale. The levels are saved as l. Type in l.

## [1] "Strongly Disagree" "Disagree"          "Neutral"          
## [4] "Agree"             "Strongly Agree"

Using a combination of lapply and factor we can transform each of the five assessment columns into factors of levels l. Run student_survey[,5:9] <- data.frame(lapply(student_survey[,5:8], function(x) factor(x, levels = l))) to first select column 5 to 8 from student_survey, then factorize these using the l levels and finally assign the results to student_survey.

student_survey[,5:9] <- data.frame(lapply(student_survey[,5:8], function(x) factor(x, levels = l)))

Running str(student_survey) once more will convince us that all the transformations performed as expected and that we are ready to continue with the exploratory analysis.

str(student_survey)

## 'data.frame':    500 obs. of  12 variables:
##  $ Respondent        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age               : int  30 30 25 28 26 18 25 29 29 21 ...
##  $ Gender            : Factor w/ 2 levels "F","M": 2 2 2 1 2 1 2 2 2 1 ...
##  $ Height            : int  179 179 182 164 192 172 188 194 182 171 ...
##  $ Overall           : Factor w/ 5 levels "Strongly Disagree",..: 3 5 4 5 3 5 5 4 5 4 ...
##  $ Tutor             : Factor w/ 5 levels "Strongly Disagree",..: 4 4 5 5 4 3 5 1 5 1 ...
##  $ Content           : Factor w/ 5 levels "Strongly Disagree",..: 5 4 4 4 3 2 3 4 2 1 ...
##  $ Assessment        : Factor w/ 5 levels "Strongly Disagree",..: 4 4 5 5 2 3 5 5 3 3 ...
##  $ Environment       : Factor w/ 5 levels "Strongly Disagree",..: 3 5 4 5 3 5 5 4 5 4 ...
##  $ Preparation_Hours : int  4 2 9 7 5 3 7 2 6 4 ...
##  $ Participation_Mins: int  20 31 46 47 270 164 24 104 221 24 ...
##  $ Marks             : int  63 57 79 74 65 58 73 57 68 62 ...

We begin our investigation with a distribution analysis of the students’ age. First, we plot a histogram with hist(student_survey$Age, main = “Age of Students”, col=“lightskyblue”, xlab=“Age (years)”, freq = FALSE). This should be very familiar to you by now. With freq, we tell hist to plot probability densities rather than absolute numbers. Just try the command and check the y-axis, which now shows probabilities for the bins rather than frequencies.

hist(student_survey$Age, main = "Age of Students", col="lightskyblue",  xlab="Age (years)", freq = FALSE)

These kinds of plots are also called density plots. They are very popular among statisticians to provide a more effective way of to view the distribution of a feature (http://www.statmethods.net/graphs/density.html). So-called kernel density plots overcome the discreteness of the histogram bins by centering a smooth kernel function at each data point then summing to get a density estimate. We can easily add kernel density plot to the histogram with the density function. We add the density plot as a line plot and enter lines(density(student_survey$Age), col=“red”, lty=2, lwd=2).

hist(student_survey$Age, main = "Age of Students", col="lightskyblue",  xlab="Age (years)", freq = FALSE)
lines(density(student_survey$Age), col="red", lty=2, lwd=2)

But we prefer of course ggplot graphs for more advanced plotting. Load the library with library(ggplot2).

library(ggplot2)

To plot both histogram and density plot, we use two geoms. We already know geom_histogram but this time we use for the y aesthetics the keyword ..density.. instead of the default count. geom_density adds the Kernel density curve. Please enter ggplot(student_survey, aes(x=Age)) + geom_histogram(aes(y=..density..), binwidth=1, colour=“black”, fill=“white”) + geom_density(alpha=.2, fill=“#FF6666”) + ggtitle(“Age of Students”) + xlab(“Age (years)”). The plot is taken from an online R cookbook (http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/#histogram-and-density-plots), where further details are explained.

ggplot(student_survey, aes(x=Age)) + geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") + ggtitle("Age of Students") + xlab("Age (years)")

The ggplot clearly demonstrates the advantages of a Kernel probability plot. Compared to the histogram, the trends of the distribution and changes between the ages become much more apparent.

In descriptive, exploratory data analysis we often want to compare several populations with each other. In our case, we might, for instance, try and understand differences between male and female attitudes to teaching and split up the investigation into the two genders. Let us first compare the heights of the two genders with a facet-wrap. Run ggplot(student_survey, aes(x=Height)) + geom_density() + facet_wrap(~Gender) + ggtitle(“Age of Students”) + xlab(“Height (cm)”). It will show that men are in general taller than women. Later in this session, we will try to generalise these kinds of relationships beyond the survey data with inferential statistics. To this end, we will detail the application scenarios for t-test, logistic regression, etc.

ggplot(student_survey, aes(x=Height)) + geom_density() + facet_wrap(~Gender) + ggtitle("Age of Students") + xlab("Height (cm)")

Continuing with our exploratory data analysis, we compare age and gender of students by plotting two density plots on top of each other. The alpha transparency parameter produces nice overlap effects. Try ggplot(student_survey,aes(x=Age, color=Gender)) + geom_density(data=subset(student_survey, Gender==“F”), fill=“red”, alpha=0.2) + geom_density(data=subset(student_survey, Gender==“M”), fill=“green”, alpha=0.2) + ggtitle(“Age of Students”) + xlab(“Age (years)”).

ggplot(student_survey,aes(x=Age, color=Gender)) + geom_density(data=subset(student_survey, Gender=="F"), fill="red", alpha=0.2) + geom_density(data=subset(student_survey, Gender=="M"), fill="green", alpha=0.2) + ggtitle("Age of Students") + xlab("Age (years)")

We can see in the plot that there is not really a significant difference in age with respect to gender but overall there seem to be more men than women in the module.

With the next two plots we begin to investigate the core of our survey, the module assessments of the students. As discussed earlier, the module assessments follow a Likert scale and are nominal data. A good plot for nominal data is the dodged bar plot. Check out the above cookbook link again. Enter ggplot(student_survey, aes(x=Tutor, fill=Gender)) + geom_bar(stat=“count”, position = “dodge”) + ggtitle(questions$Tutor). This kind of graph is often used to compare two populations – in this case genders – and their attitudes.

ggplot(student_survey, aes(x=Tutor, fill=Gender)) + geom_bar(stat="count", position = "dodge") + ggtitle(questions$Tutor)

A bit more exciting is the coxcomb plot, combining bar plot and polar coordinates (http://docs.ggplot2.org/current/coord_polar.html). Run ggplot(student_survey, aes(x = Overall, fill = Overall)) + geom_bar(width = 1) + ggtitle(questions$Overall) + guides(fill=FALSE) + coord_polar(theta = “x”). It looks pretty but might be a bit more difficult to read than more traditional barplots.

ggplot(student_survey, aes(x = Overall, fill = Overall)) + geom_bar(width   = 1) + ggtitle(questions$Overall) + guides(fill=FALSE) + coord_polar(theta = "x")

Bar and coxcomb plots work well with Likert data. The exploration of Likert level scales in R is further supported by the dedicated R Likert package (https://cran.r-project.org/web/packages/likert/likert.pdf). With this package, we can produce beautiful plots that go deep into the details of the students’ assessments. Load the library with library(likert).

library(likert)

In order to explore the statistics of the five assessment questions, we can use the function likert, which provides various statistics about a set of likert items. Please, define student_survey_l <- likert(student_survey[,5:9]).

student_survey_l <- likert(student_survey[,5:9])

Entering student_survey_l returns the proportional distribution of all the answers in the five assessment questions.

student_survey_l

##          Item Strongly Disagree Disagree Neutral Agree Strongly Agree
## 1     Overall              11.6     10.2    23.8  23.8           30.6
## 2       Tutor              15.0     22.8    13.8  25.4           23.0
## 3     Content               7.0     23.0    28.4  30.8           10.8
## 4  Assessment               7.6     18.2    17.6  41.4           15.2
## 5 Environment              11.6     10.2    23.8  23.8           30.6

summary(student_survey_l) displays other typical statistics for Likert scale data such as mean, standard deviation, etc. Try it.

summary(student_survey_l)

##          Item  low neutral high  mean       sd
## 4  Assessment 25.8    17.6 56.6 3.384 1.167592
## 1     Overall 21.8    23.8 54.4 3.516 1.328636
## 5 Environment 21.8    23.8 54.4 3.516 1.328636
## 2       Tutor 37.8    13.8 48.4 3.186 1.404047
## 3     Content 30.0    28.4 41.6 3.154 1.108486

A simple overview plot can be produced with plot(student_survey_l) + ggtitle(“Overview of Student Survey”). Please note the integration of ggplot commands.

plot(student_survey_l) + ggtitle("Overview of Student Survey")

You can also visualise the data as our favourite density curve with the type parameter. Enter plot(student_survey_l, type=“density”). Looking good though we might want to overwrite the default colouring. Do you know how?

plot(student_survey_l, type="density")

Another option is a pretty heatmap showing the distribution of answers in one plot, which you can generate with plot(student_survey_l, type = “heat”).

plot(student_survey_l, type = "heat")

Finally, let us use the likert package to set groups side by side. Let us compare the assessments of females and males using likert’s grouping parameter. Simply run student_survey_l_gender <- likert(student_survey[,5:9], grouping=student_survey$Gender).

student_survey_l_gender <- likert(student_survey[,5:9], grouping=student_survey$Gender)

The plot is generated with plot(student_survey_l_gender) and displays an overview of the gender distribution across all assessments.

plot(student_survey_l_gender)

This concludes our discussion of Likert scale data.

Next we move on to inferential statistics to draw further conclusions on the students’ performance as expressed in the Marks variable of student_survey. Statistical inference is one of the core activities of data analysis. There is an excellent chapter on statistical inference in (Schutt/O’Neil, 2013, Doing data science - Straight talk from the frontline. O’Reilly Media, Inc.). The authors state that “[t]he world we live in is complex, random, and uncertain. At the same time, it’s one big data-generating machine. (…). Statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that allow us to extract meaning and information from data that has been generated by stochastic (random) processes.” (p. 19). Canning (2014) beautifully summarises statistical inference as conclusions through evidence. He provides in Chapter 7 of his book an number of examples for how inferences can fail if we do not follow the rules of statistical (and logical) reasoning.

Every statistical inference project starts again with a detailed descriptive analysis in order to find clues for the relationships that statistical inference tries to discover. We have already earlier presented descriptive analytics methods and visualisations for nominal data such as Likert scales. Next, we present common methods for numerical data such as the Marks variable. We begin with the worst mark. Please, run min(student_survey$Marks). Nobody failed our module, which is good!

min(student_survey$Marks)

## [1] 50

The best mark is returned by max(student_survey$Marks).

max(student_survey$Marks)

## [1] 85

The range of marks is displayed with range(student_survey$Marks).

range(student_survey$Marks)

## [1] 50 85

The marks’ average can be calculated with mean(student_survey$Marks).

mean(student_survey$Marks)

## [1] 66.182

Finally, in order to calculate the standard deviation (https://en.wikipedia.org/wiki/Standard_deviation) as a measure of the spread of the marks we enter sd(student_survey$Marks). It is ~8. The marks are therefore concentrated between 58 (mean – sd) and 74 (mean + sd).

sd(student_survey$Marks)

## [1] 8.443664

Before we go into the details of stastical inference and reasoning, let us discuss a brief example of a more complex descriptive analysis of the data than single values such as mean and standard derivation. We would like to describe and explore the likelihood of marks given the evidence of the observed marks from our module. We start by plotting the histogram of marks with hist(student_survey$Marks, main = “Student Marks”, xlab = “Marks”, col = “lightgrey”). This time we use the count-based version.

hist(student_survey$Marks, main = "Student Marks", xlab = "Marks", col = "lightgrey")

We would like to determine what it takes to reach a 70 (distinction) in class. How likely is this and what is the percentage of students to get a distinction? Let us draw a red line to indicate the distinction mark in the histogram with abline(v=70, col=“red”, lwd = 4).

hist(student_survey$Marks, main = "Student Marks", xlab = "Marks", col = "lightgrey")
abline(v=70, col="red", lwd = 4)

Looking at the red line, it becomes clear that we need to get add all the grey blocks on the right of the red line. We could do this manually, but fortunately R has a function called (empirical) cumulative distribution function. The cumulative distribution function (cdf) is the probability that the variable takes a value less than or equal to x. The empirical cumulative distribution function delivers these marks for an empirical distribution like our module’s marks. Run marks_ecdf <- ecdf(student_survey$Marks) to define the empirical cumulative distribution (https://cran.r-project.org/doc/manuals/R-intro.html).

marks_ecdf <- ecdf(student_survey$Marks)

marks_ecdf(70) returns the percentage of students to the left of the red line in the histogram (up to 70), while 1 - marks_ecdf(70) returns the right percentage (70 to 100). Try 1 - marks_ecdf(70) and you will see that almost a third of our class got a distinction, which is clearly an indicator of our excellent teaching.

1 - marks_ecdf(70)

## [1] 0.326

The quantile function produces the complementary output oof how all the observations divide up into groups. We know it already but just to repeat that it offers a simply way to calculate the partitions of data. quantile(student_survey$Marks, c(0.2, 0.5, 0.8)), for instance, returns the number of students in the lower 20% of the marks, the lower 50% and up to 80%.

quantile(student_survey$Marks, c(0.2, 0.5, 0.8))

## 20% 50% 80% 
##  58  66  74

This concludes our discussion of the descriptive analysis of the students’ marks. We move on to investigate how we can generalise relationships with statistical inference and tests.

Let us next take a look once more at the marks’ distribution with hist(student_survey$Marks, main = “Student Marks”, xlab = “Marks”, freq = FALSE). We can see that the marks are fairly evenly distributed around a centre of an average of ~66 marks.

hist(student_survey$Marks, main = "Student Marks", xlab = "Marks", freq = FALSE)

This time, we can add a Kernel density plot for the marks with lines(density(student_survey$Marks), col=“red”, lty=2, lwd=2). The resulting plot is similar to what is known as a normal distribution (https://en.wikipedia.org/wiki/Normal_distribution).

hist(student_survey$Marks, main = "Student Marks", xlab = "Marks", freq = FALSE)
lines(density(student_survey$Marks), col="red", lty=2, lwd=2)

A normal distribution occurs commonly in nature. The marks in our module more or less follow a normal distribution. Another example is the height of men and women in the module. “If a dataset is normally distributed the mean, median and mode will coincide. Most of the observations will be near this central point with a smaller number far away from the central point. Imagine a crowd of people and consider their heights. Most people seem to be similar height, give or take a few centimetres. A small number of the people are clearly shorter or taller than the average and an even smaller number of people seem to be very short or very tall.” (Canning 2014, p. 25).

I have prepared a test function for you to visualise the typical characteristics of normal curves. Please enter plot_normal().

plot_normal()

The plot contains three normal distributions with three different standard deviations but the same mean. If these would stand for the distribution of marks, then the dark red plot would stand for a module with a wide distribution of marks, while the black one for a module with marks that are fairly close to each other. From all of these, we can see that if a dataset is normally distributed the mean and median will coincide and that most of the observations will be near the centre with fewer further away. The normal distribtion is one of the most famous innovations and concepts of statistical science and takes centre stage in many types of statistical reasoning and hypothesis testing.

Canning (2014) explains in Chapter 8 the rational behind the normal distribution, statistical tests and serveral other key statistical concepts to make sure that statistical reasoning is sound. His first key concept is the null hypothesis. Each time we conduct a statistical test we start with two hypotheses. The first one is the null hypothesis, which states that any difference in data is by chance. The second one is the alternative hypothesis which states that there is a good chance that there really is a difference or something exciting in the data.

The aim of hypothesis testing is generally to be able to reject the null hypothesis and accept the alternative hypothesis. The first aspect of accepting or rejecting the null hypothesis we would like to discuss is the confidence level. Can we be 50% sure that we can reject the null hypothesis? 90% sure? 99.9% sure? Please note in order to assess the realiablity of our conclusions statisticians also speak of significance, which means that the observations we have made are unlikely to have occurred by chance.

How to determine the confidence level is best explained with an example from our survey data. We would like to concentrate on the height of men in the module and establish first the relevant heights with height_men_class <- subset(student_survey, Gender == “M”)$Height.

height_men_class <- subset(student_survey, Gender == "M")$Height

Let us take a look at the distribution by plotting hist(height_men_class, main = “Height of Men in the Class”, xlab = “Height (cm)”). Once again, that normal distribution is jumping in our face. We can see the typical kind of normal distribution of heights with a few outliers of especially tall men.

hist(height_men_class, main = "Height of Men in the Class", xlab = "Height (cm)")

If you read through the Wikipedia article on confidence intervals, you will see that it describes an interval measure to determine our confidence in the mean or other population parameters (https://en.wikipedia.org/wiki/Confidence_interval). In this part of the session, we would like to show that increasing the sample population size reduces the mean error and the confidence interval becomes narrower – looking at the height of men. Let us concentrate on the first 100 men in our module and define their mean with mean_100 <- mean(height_men_class[1:100]).

mean_100 <- mean(height_men_class[1:100])

The standard deviation of the first 100 is sd_100 <- sd(height_men_class[1:100]).

sd_100 <- sd(height_men_class[1:100])

The standard error of the mean is the deviation of the sample mean from the population mean. It is the standard deviation divided by the square root of the population size. Let us run sem_100 <- sd_100 / sqrt(length(height_men_class[1:100])).

sem_100 <- sd_100 / sqrt(length(height_men_class[1:100]))

The upper boundary of the confidence interval is then upper_boundary <- mean_100 + (sd_100 * sem_100).

upper_boundary <- mean_100 + (sd_100 * sem_100)

The lower boundary is lower_boundary <- mean_100 - (sd_100 * sem_100).

lower_boundary <- mean_100 - (sd_100 * sem_100)

Together the boundaries define the confidence interval. Let us add abline(v = upper_boundary, col=“red”, lwd = 4) to the histogram.

hist(height_men_class, main = "Height of Men in the Class", xlab = "Height (cm)")
abline(v = upper_boundary, col="red", lwd = 4)

And also abline(v = lower_boundary, col=“red”, lwd = 4).

hist(height_men_class, main = "Height of Men in the Class", xlab = "Height (cm)")
abline(v = upper_boundary, col="red", lwd = 4)
abline(v = lower_boundary, col="red", lwd = 4)

Using a population of 100, we can confidently say that the average height is between these two boundaries, between ~179 cm and ~190cm. If we increase the population, we will improve our confidence. Let us start again with a population size of 200 and set mean_200 <- mean(height_men_class[1:200]).

mean_200 <- mean(height_men_class[1:200])

Followed by sd_200 <- sd(height_men_class[1:200]).

sd_200 <- sd(height_men_class[1:200])

And finally sem_200 <- sd_200 / sqrt(length(height_men_class[1:200])).

sem_200 <- sd_200 / sqrt(length(height_men_class[1:200]))

The upper boundary is then upper_boundary <- mean_200 + (sd_200 * sem_200).

upper_boundary <- mean_200 + (sd_200 * sem_200)

The lower boundary is lower_boundary <- mean_200 - (sd_200 * sem_200).

lower_boundary <- mean_200 - (sd_200 * sem_200)

We first plot the upper boundary of the confidence interval with abline(v = upper_boundary, col=“darkred”, lwd = 4).

hist(height_men_class, main = "Height of Men in the Class", xlab = "Height (cm)")
abline(v = upper_boundary, col="red", lwd = 4)
abline(v = lower_boundary, col="red", lwd = 4)
abline(v = upper_boundary, col = "darkred", lwd = 4)

Adding abline(v = lower_boundary, col = “darkred”, lwd = 4), too, reveals that our confidence has really increased, as the interval got narrower.

hist(height_men_class, main = "Height of Men in the Class", xlab = "Height (cm)")
abline(v = upper_boundary, col="red", lwd = 4)
abline(v = lower_boundary, col="red", lwd = 4)
abline(v = upper_boundary, col = "darkred", lwd = 4)
abline(v = lower_boundary, col="darkred", lwd = 4)

This exercise visualises nicely how to improve the confidence in our inferential statistics. With an increased sample population size, we can be more confident about making general statements of how the mean will look like for the whole student population.

Let’s test our knowledge next.

What are rating scales in surveys also called?

Likert
Lakert
Lambert
Don’t know

Likert

Print out the mean age of students.

mean(student_survey$Age)

## [1] 24.85

Using the aggregate function, print out the average age (FUN = mean) of students per level of Overall assessment of the module.

aggregate(student_survey$Age ~ student_survey$Overall, FUN = mean)

##   student_survey$Overall student_survey$Age
## 1      Strongly Disagree           23.56897
## 2               Disagree           25.03922
## 3                Neutral           24.89916
## 4                  Agree           25.47059
## 5         Strongly Agree           24.75163

Create a histogram and density plot of the height of students. Use the same ggplot as for the age distribution, but for student heights. The title should be “Height of Students” and x-axis should be called “Height (cm)”.

ggplot(student_survey, aes(x=Height)) + geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") + geom_density(alpha=.2, fill="#FF6666") + ggtitle("Height of Students") + xlab("Height (cm)")

Create a density plot of student_survey_l_gender.

plot(student_survey_l_gender, type = "density")

In a single plot using subset and hist together create a histogram of the female students’ marks captured in the Marks variable. Do not name the title and axes.

hist(subset(student_survey, Gender == "F")$Marks)

Let’s move on to the second part of this session.