Art of Statistics 2

In the last session we have just shown how to develop a stronger degree of confidence in the height of the students. But when probablities are involved there is always a big chance that we get it wrong - even with a larger population. Canning (2014) explains that whatever our level of confidence we can make still make two possible errors. The first one rejects the null hypothesis when it is actully true. This kind of error if called a false positive and assumes, for instance, that you do not pass a difficult module when you actually have passed it. In statistics, this is called a Type I error. The second error we can make is the false negative where we accept the null hypothesis when it is actually false. In statistics this is called a Type II error.

To avoid these errors, all statistical reasoning will deliver a set of summary values that allow you to assess the risk that you got it wrong. In R, these are often produced at the same time as the results of the inference. The most commonly cited value to verify the results of statistical inference is the p-value. The p-value approach involves determining “likely” or “unlikely” by determining the probability — assuming the null hypothesis were true — of observing a more extreme test statistic in the direction of the alternative hypothesis than the one observed. If the p-value is small, then it is unlikely. If the p-value is large then it is likely. Let’s go through an example next to make this clearer and in particular what a large p-value is.

Let us assume, for instance, that we want to find out whether it is common to get 80 in the module. This means our null hypothesis is that it is common to get 80 in this module. To get the p-value, we first fit the existing distribution onto our old friend the normal distribution by shifting the centre of the curve to the mean and then use the standard deviation to calculate the distance from this centre. https://statistics.laerd.com/statistical-guides/standard-score.php explains this process very well.

The p-value generally describes the point on the normal distribution’s curve, where we reach the very unlikely results. We can calculate the p-value of 80 using R’s pnorm which takes the value, the mean and the standard deviation (https://www.r-bloggers.com/normal-distribution-functions/). We need to subtract it from one to get the part of the normal curve, which describes the highly unlikely results. Run 1 - pnorm(80, mean(student_survey$Marks), sd(student_survey$Marks)) to calculate this right-hand p-value.

1 - pnorm(80, mean(student_survey$Marks), sd(student_survey$Marks))

## [1] 0.05086819

You will get a back a value of around 0.05. Often, statisticians will define the significance level to be at 5% (or 0.05). If the p-value is less than 5%, then the null hypothesis is rejected and it is not common to score 80 marks in this module. As we reach a value of around 5% (0.05), we can say that that the null hypothesis can be rejected and it is not common to reach 80. It is too far away from the mean to count to the common cases. An excellent further explanation of the p-value can be found at http://www.dummies.com/education/math/statistics/what-a-p-value-tells-you-about-statistical-data/.

The p-value is thus using mean and standard deviation to relate the empirical obserservations and their distribution to the normal distribuion. If the p-value is very small, it will be very unlikely that there is only a random relationship. This almost concludes our discussion in this section. But before we move on to the next topic, we would like to introduce one final concept related to the normal distribution. The z-score helps with standardising a distribution.

A popular way of standardising the values of a distribution is the z-score, which fits the current distribution to its mean and distances from the mean, given by multiples of standard deviations. Run plot(student_survey$Marks, scale(student_survey$Marks), xlab = “Marks”, ylab = “Distance from Mean”). You will see what I mean with regard to the marks. The z-score is also called standard score because it represents any value as a standardised distance from the mean. z-scores and p-values often work in conjunction.

plot(student_survey$Marks, scale(student_survey$Marks), xlab = "Marks", ylab = "Distance from Mean")

This concludes our discussion of statistical inference for a single value such as the students’ marks or heights. Next we will continue with statistical inference of the relationship of two or more variables and their groups.

We move on now to investigating how two variables relate to each other using inferential statistics. First we investigate the correlation of two numerical variables (https://en.wikipedia.org/wiki/Correlation_and_dependence) and the dependence it shows. In particular, we will analyse the relationship between the hours spent on the class to the marks. Check out Chapter 13 of Canning (2014) for an overview of understanding variable relationships with statistics. Further statistical test we introduce in this part are z-test, t-test, chi-square and Analysis of Variance (ANOVA). All of these are designed to work out relationships in different context such as for discrete or continuous variables or for dependent or independent variables. A statistical test always helps us to identify differences between two or more groups of data. In our case, we could, for instance, evaluate the module performance in terms of differences in the gender of the students.

Our survey data also contains information about how many hours a week the students have spent on the module. Now, we can investigate whether those students that spent more time on the class also got better marks. Plot the hours in relation to the marks with plot(student_survey$Preparation_Hours, student_survey$Marks, col = “red”, xlab = “Hours”, ylab = “Mark”, main = “Student Time spent for a Mark”).

plot(student_survey$Preparation_Hours, student_survey$Marks, col = "red", xlab = "Hours", ylab = "Mark", main = "Student Time spent for a Mark")

We can fairly easily see that increased numbers of hours spent on the module lead to better marks. We say, that the variables correlate positively. The most commonly used correlation is the Pearson correlation, which is also the default for R’s cor function. Running cor(student_survey$Preparation_Hours, student_survey$Marks, method = “pearson”) emphasizes that we are looking at Pearson.

cor(student_survey$Preparation_Hours, student_survey$Marks, method = "pearson")

## [1] 0.9518226

With 0.95 it is almost a perfect correlation, as 0 stands for no correlation and 1 for a perfect one. So, if you want to get a good mark in the module, you better put in the hours. As a comparison, check out that age and marks are not correlated. Run cor(student_survey$Age, student_survey$Marks, method = “pearson”). The result is almost 0. Age and marks are not correlated. Please note that our data is generated to get such perfect results. In the real world correlations so close to 0 or 1 are very uncommon.

cor(student_survey$Age, student_survey$Marks, method = "pearson")

## [1] 0.004586663

Looking back at the plot from earlier, we see that there are a couple of outliers in the graphs of students with few hours but very good marks and of students with many hours but lower marks. If we are concerned about outliers, a good alternative to Pearson is Spearman. Please run cor(student_survey$Preparation_Hours, student_survey$Marks, method = “spearman”) and you can see that the correlation strength improved compared to Pearson. Check out Chapter 15 of Canning (2014) for further details on Pearson and Spearman.

cor(student_survey$Preparation_Hours, student_survey$Marks, method = "spearman")

## [1] 0.9571274

Testing for correlations is just one possible test in inferential statistics to cover the relationship of in this case two continous variables. When we validated earlier our various null hypotheses, we effectively conducted what is called a Z-test in statistics though we did not mention it then. “A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.” (https://en.wikipedia.org/wiki/Z-test). Inferential statistics is full of these kinds of tests to induce relationships. Another popular test is the T-test, used when the population sample is smaller. For a Z-test you need to know, for instance, the standard deviation. For a T-test this is not necessary. The T-test is used to compare the population mean against a sample, or compare the population mean of two distributions with a smaller sample, or when you do not know the population’s standard deviation. “A T-test is a useful test when you have two samples and want to know if the differences between the means are significant. It is useful for small samples. If you have a sample larger than about 30 you should use an alternative such as the F-test.” (Canning 2014, p. 60).

According to https://www.socialresearchmethods.net/kb/stat_t.php, “[t]he t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the post test-only two-group randomized experimental design.” The t-value will be positive if the first mean is larger than the second and negative if it is smaller. In R, we have the t.test function to perform a T-test and compare two distributions. With t.test(student_survey$Marks, last_years_marks), we compare this year’s marks with last year’s (preloaded in your environment).

t.test(student_survey$Marks, last_years_marks)

##
##  Welch Two Sample t-test
##
## data:  student_survey$Marks and last_years_marks
## t = 2.8916, df = 986.17, p-value = 0.003917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4711052 2.4608948
## sample estimates:
## mean of x mean of y
##    66.182    64.716

We get back a lot of information about the two sets of marks. Let us go through this information one by one. Run names(t.test(student_survey$Marks,last_years_marks)) first to find out about the kinds of statistics the t.test returns.

names(t.test(student_survey$Marks,last_years_marks))

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"
## [6] "null.value"  "alternative" "method"      "data.name"

You should see some familiar faces! Take a look at the t-value first with t.test(student_survey$Marks,last_years_marks)$statistic.

t.test(student_survey$Marks,last_years_marks)$statistic

##      t
## 2.8916

According to https://www.socialresearchmethods.net/kb/stat_t.php, we need to look up the t-value in a standard table of significance to determine the relationship. We skip this here, as it is quite complicated. But t.test in R also returns the p-value, which we can use to evaluate the significance of the results. Run t.test(student_survey$Marks,last_years_marks)$p.value.

t.test(student_survey$Marks,last_years_marks)$p.value

## [1] 0.00391702

You get back a very small p-value, which means the two distributions are not identical. The marks have improved, as the tutor has learned the lessons from last year’s class – we assume. In truth, we cannot make an easy causal statement from such statistical testing. We only establish that there is a relationship, but not what has caused it. It might have been that the sun has been shining more often this year than last year, which has improved the marks. Please, compare the chapter 11 on Causality in Doing Data Science by O’Neil and Schutt. It provides an excellent introduction to this topic.

Inferential statistics has tests for all kinds of questions we might want to ask about the relationship of two or more variables of different types. t-test evaluated whether there is a statisically significant difference. The chi-square test of independence on the other hand tests whether two variables are independent. It demonstrates that one variable is not affected by another (http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence) and works with categorical data. “The Chi-square test is used to identify whether or not the distributions of categorical variables are different.” (Canning 2014, p. 55).

chi-square can be used to test whether the observed data differs significantly from the expected data. Your environment contains the values two dice produced for 100 rounds, dice_1 and dice_2. Run chisq.test(dice_1$Freq, dice_2$Freq) to compare them. You might also want to check how they look like in your environment.

chisq.test(dice_1$Freq, dice_2$Freq)

##
##  Pearson's Chi-squared test
##
## data:  dice_1$Freq and dice_2$Freq
## X-squared = 13.5, df = 12, p-value = 0.3338

For a further discussion of chi-square, we consider additional student survey data that can be found in the MASS library (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/survey.html). MASS is often used in statistical analysis in R and contains functions and datasets to support Venables’ and Ripley’s book “Modern Applied Statistics with S” (4th edition, 2002). In MASS’s survey dataset, the Smoke column records the students smoking habit, while the Exer column records their exercise level. Load the library with library(MASS).

library(MASS)

Let us compare Smoke and Exer with survey_table <- table(survey$Smoke, survey$Exer).

survey_table <- table(survey$Smoke, survey$Exer)

Print out survey_table.

survey_table

##
##         Freq None Some
##   Heavy    7    1    3
##   Never   87   18   84
##   Occas   12    3    4
##   Regul    9    1    7

We would like to test the hypothesis whether the students’ smoking habits are independent of their exercise levels at 0.05 significance level. To this end, we apply the chisq.test function with chisq.test(survey_table). You can read on at http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-square-test-independence and correct the data to remove warnings.

chisq.test(survey_table)

## 
##  Pearson's Chi-squared test
##
## data:  survey_table
## X-squared = 5.4885, df = 6, p-value = 0.4828

We find the p-value to be 0.4828. As the p-value is greater than the 0.05, the null hypothesis cannot be rejected. There is a relationship between the smoking habit and exercise level of the students. They are not independent.

Let us finally appply chi-square to our own student survey. With student_survey_table <- table(student_survey$Overall, student_survey$Environment), we can find out whether the teaching environment has an influence on the overall assessment of the module.

student_survey_table <- table(student_survey$Overall, student_survey$Environment)

Running chisq.test(student_survey_table) will deliver a very small p-value. There is no relationship between environment and overall satisfaction. But this is only the case, because we randomly generated the data. Normally you would expect there to be relationship between the two variables.

chisq.test(student_survey_table)

##
##  Pearson's Chi-squared test
##
## data:  student_survey_table
## X-squared = 2000, df = 16, p-value < 2.2e-16

Our next statistical inference test is Analysis of Variance (ANOVA) (http://homepages.inf.ed.ac.uk/bwebb/statistics/ANOVA_in_R.pdf). Canning (2014) has a whole chapter on ANOVA - Chaper 12. ANOVA is used to analyse the differences among group means and their associated values such as variance. It compares the means between groups and determines whether any of these means are significantly different from each other. This indicates whether the groups are different. ANOVA is therefore a diversity test, which should be very interesting to social and cultural analytics.

This test is also called one-way analysis of variance (ANOVA) or one-way ANOVA. You can compare the hours spent and the marks achieved in our module with oneway.test(Marks ~ Preparation_Hours, data = student_survey).

oneway.test(Marks ~ Preparation_Hours, data = student_survey)

##
##  One-way analysis of means (not assuming equal variances)
##
## data:  Marks and Preparation_Hours
## F = 1565.6, num df = 9.00, denom df = 131.34, p-value < 2.2e-16

The resulting p-value is very small, which means the two distributions are not identical and are diverse. This might be surprising at first, but do not forget we are not testing whether there is a relationship. To do that we would need to use correlations. We are interested in finding out whether the distribution marks is significantly different from the distribution of hours and find that they are. Their ranges are different, as are their means, etc.

Linear regression is the final statistical relationship we would like to investigate. With linear regresssion we can predict new values from known data (see Chapter 14 in Canning (2014)).

In R, it is easy to fit a linear regression model with the lm function. Run student_survey_hours_lm <- lm(Marks ~ Preparation_Hours, data = student_survey) to analyse the relationship of hours and marks. This is the same syntax as other prediction functions we have discussed.

student_survey_hours_lm <- lm(Marks ~ Preparation_Hours, data = student_survey)

We can run the ANOVA test again with student_survey_hours_lm to test the significance of the regression. Try anova(student_survey_hours_lm). We get the same results as in oneway.test. Check out http://www.r-tutor.com/elementary-statistics/analysis-variance for examples of the application of ANOVA in electronic marketing and sales analysis.

anova(student_survey_hours_lm)

## Analysis of Variance Table
##
## Response: Marks
##                    Df Sum Sq Mean Sq F value    Pr(>F)
## Preparation_Hours   1  32231   32231    4798 < 2.2e-16 ***
## Residuals         498   3345       7
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

But of course we can also plot linear regressions. Let us start again with plot(student_survey$Preparation_Hours, student_survey$Marks, col = “red”, xlab = “Hours”, ylab = “Mark”, main = “Student Time spent for a Mark”).

plot(student_survey$Preparation_Hours, student_survey$Marks, col = "red", xlab = "Hours", ylab = "Mark", main = "Student Time spent for a Mark")

abline(student_survey_hours_lm) adds the regression line.

plot(student_survey$Preparation_Hours, student_survey$Marks, col = "red", xlab = "Hours", ylab = "Mark", main = "Student Time spent for a Mark")
abline(student_survey_hours_lm)

The model allows us to predict the mark for a student based on the hours she has invested into the module using the predict function. Please enter predict(student_survey_hours_lm, data.frame(Preparation_Hours = 9), interval = “confidence”).

predict(student_survey_hours_lm, data.frame(Preparation_Hours = 9), interval = "confidence")

##        fit      lwr      upr
## 1 79.38055 78.94235 79.81874

The result says that she can expect a 79 within a lower boundary of 78.94 and an upper boundary of 79.82. We got these boundaries back by setting interval = “confidence”. We have now learned enough about the various probability distributions and tests. We are confident that we can move on to a proper parametric statistic analysis.

This concludes our discussion of statistical testing. There are literally 100s of tests for every possible situation and relationship in data. We could only cover some of the more important statistical tests here. Nowadays, with R and other statistical tools most of these test can be done in seconds, which also means that we need to be even more careful that we apply them correctly and use statistics intelligently. Canning (2014) lists some of the considerations for statistical tests in his Chapter 9. These considerations include to never suspend your own knowledge of the underlying data, to always be clear about what you are trying to achieve and finally to use your own imagination when it comes to the data.

In the final part of this session, we will discuss how R supports the fitting of various theoretical distributions to a dataset, which can be considered to be the core of inferential statistics. As discussed earlier in the context of the normal distribution, probability distributions model the likelihood of a certain outcome. This way, we can make inferences from data on the basis of probability. Let us first remind ourselves of some of the characteristics of probability distributions.

As seen earlier in this session, the normal distribution links each outcome to the likelihood according to a curve around the mean and the standard deviation. Another often deployed probability distribution is called binomial and based on Bernoulli trials (https://www.r-bloggers.com/using-r-for-introductory-statistics-chapter-5-probability-distributions/), used to model two discrete events. A Bernoulli trial is an experiment, which can have one of two possible outcomes such as either head or tails for a coin or hitting a red or black slot on the roulette table. Independent repeated Bernoulli trials give rise to the Binomial (probability) distribution, as every trial results in either a success, with probability p, or a failure, with probability 1-p. These must be the only two outcomes for a trial and form together the Bernoulli distribution.

In R, we visualise the Binomial distribution with the function dbinom. Let us run barplot(dbinom(x = seq(0,20), size = 20, prob = 0.5)). Size defines the number of trials and prob the probability of the succesful outcome (p). In this case success and failure are both as likely.

barplot(dbinom(x = seq(0,20), size = 20, prob = 0.5))

As you can see prob = 0.5 simulates a fair coin with a likelihood of 50% for either head or tail. The binomial distribution resembles here a normal distribution. If we reduce the success outcome to prob = 0.1, we see that the curve moves to the left. Run barplot(dbinom(x = seq(0,20), size = 20, prob = 0.1)).

barplot(dbinom(x = seq(0,20), size = 20, prob = 0.1))

The normal distribution is a continuous distribution and used to model continous properties/variables like the marks in our module. The binomial distributions is a discrete distribution that assigns non-zero probabilities only to discrete numbers such as the genders in our class. Can you see how we could make a Binominal distribution look like the histograms of some of our survey variables and model these therefore with a Binominal probability distribution by playing with the prob parameter? This is what parametric statisitcal inference aims to do.

A final probability distribution we would like to discuss is called Poisson after a French mathematician (https://en.wikipedia.org/wiki/Poisson_distribution). In statistics, things are often named after their inventors. Poisson distributions are discrete distributions like the binomial distribution but are used for cases where the outcomes do not seem to follow any obvious regularity. A Poisson distribution could, for instance, model how many hits a website gets in an hour or how many emails I get in a day from students.

In practice, Poisson thus addresses problems such as “if there are twelve cars crossing a bridge per minute on average, find the probability of having seventeen or more cars crossing the bridge in a particular minute.” http://www.r-tutor.com/elementary-statistics/probability-distributions/poisson-distribution). From the knowledge that twelve cars cross the bridge on average Poisson models the probablity of any number of cars crossing the bridge though there is no obvious regularity. R’s poisson probability function is ppois(16, lambda=12, lower=FALSE). Run it and find out that the probability of 16 cars rather than 12 is about 10%. How would ppois look like to infer from the knowledge that on average 100 students receive a distinction in class the likelihood that 120 achieve a distinction this year?

ppois(16, lambda=12, lower=FALSE)

## [1] 0.101291

Normal, Binomial and Poisson are just three examples for probability distributions. There are many, many more. Our aim in this section is to find the right one to build a model of our data. Canning (2014) provides an excellent example why we need probability distributions to make statistical inferences in Chapter 7.5 of his book. He explains with the example of the normal distribution. If we know that the height of humans more or less follows a normal distribution, we can conclude based on that evidence that very tall men over 250 cm are outliers to this distribution and not evidence that allows us to come to good conclusions about the height of men. The probability distributions like the normal curve allow us therefore not just to test our hypothesis but also to build a good estimation of the properties of our data as a whole and make predictions on unknown data based on the data we have. Probability distributions thus develop a kind of theory of the actual state of our variables.

To understand how we can make use of probability distributions to develop a theory of the relationships in our dataset, we turn to the final variable of our survey dataset, which we have not discussed yet. The column Participation_mins records the total amount of minutes a student has participated in class. We got this with total surveillance in class using CCTV cameras. As we have to use this variable quite a bit now we first assign it to separate vector with an easy to remember name. Please assign pm <- student_survey$Participation_Mins.

pm <- student_survey$Participation_Mins

Plot the distribution of Participation_mins with hist(pm, xlab = “Minutes of Participation”, main = “Participation in Class”). We can see that most students did not participate in class, while there are a few who participated extensively. These are represented in a fairly heavy tail. Distributions of this type are therefore also called heavy-tailed or long-tailed. This participation behaviour is to be expected in a module of this size but we definitely cannot fit the normal distribution to this data. Also, Binominal and Poisson will not work, as Participation_mins is continuous.

hist(pm, xlab = "Minutes of Participation", main = "Participation in Class")

We could now try to find a theoretical probablity distribution by hand and fit it to our data, but fortunately we have R to do the hard work for us. To find the right theoretical distributions, we make use of the fitdistrplus package, which extends the fitdistr() function (of the MASS package) with several functions to help the fit of a parametric distribution. Load the package with library(fitdistrplus).

library(fitdistrplus)

With plotdist(pm), we get a quick overview of the distribution. It returns a histogram and a cumulative distribution function. In both we can clearly identify the shape of a long-tailed distribution.

plotdist(pm)

Next we run the descdist function, which delivers important descriptive statistics on the empirical distribution and also plots how different theoretical distributions fit our data on a skewness-kurtosis plot. Kurtosis measures how heavy the tail is. You can see in the graph that it is fairly heavy in our case. Skewness stands for how symmetric a distribution is. In our case it is cleared skewed to the left. Try descdist(pm, boot = 1000), where boot stands for the additional samples we would like to add.

descdist(pm, boot = 1000)

## summary statistics
## ------
## min:  19   max:  735
## median:  59
## mean:  102.738
## estimated sd:  99.20951
## estimated skewness:  1.814726
## estimated kurtosis:  7.429931

The gamma distribution (https://en.wikipedia.org/wiki/Gamma_distribution) seems like a good fit. Let us fit it with fg <- fitdist(pm, “gamma”).

fg <- fitdist(pm, "gamma")

In order to compare the gamma distribution with other possibilities, we also fit Weibull with fw <- fitdist(pm, “weibull”).

fw <- fitdist(pm, "weibull")

And finally we are fitting lognormal with fl <- fitdist(pm, “lnorm”).

fl <- fitdist(pm, "lnorm")

A standard way of comparing whether these distributions fit is the Q–Q plot, where Q stands for quantile (http://www.dummies.com/programming/big-data/data-science/quantile-quantile-qq-plots-graphical-technique-for-statistical-data/). A q-q plot is a plot of the quantiles of the first dataset against the quantiles of the second dataset. Please plot qqcomp(list(fw, fl, fg), legendtext = c(“Weibull”, “Lognormal”, “gamma”)).

qqcomp(list(fw, fl, fg), legendtext = c("Weibull", "Lognormal", "gamma"))

The straight line corresponds to our student participation distribution. From the plot we can see that Lognormal does not fit the distribution very well, while both gamma and Weibull fit except for a long tail where the plots deviate strongly. There are better distributions than gamma to fit long tails but we can ignore those here, as gamma fits well enough.

The next step in parametric statistics is to fit the parameters of the theoretical curve. In this case, these are two. The first one is the shape of the curve. We retrieve it with fg_shape <- fg$estimate[[“shape”]].

fg_shape <- fg$estimate[["shape"]]

The second parameter we need is the rate, which we get with fg_rate <- fg$estimate[[“rate”]].

fg_rate  <- fg$estimate[["rate"]]

We have found a function “gamma” and its required parameters to build a theoretical model for student participation. With this theory, we can simulate a lot more student participation minutes. Let us see how many students we need to get to 10,000 minutes of class participation. Run sum(rgamma(10000, shape = fg_shape, rate = fg_rate)). rgamma is R’s gamma distribution.

sum(rgamma(10000, shape = fg_shape, rate = fg_rate))

## [1] 1040173

These are 5 times more students than King’s College has. That might be a bit optimistic for the future of the module but it is just an example! Nevertheless, we plot the corresponding histogram with hist(rgamma(10000, shape = fg_shape, rate = fg_rate), xlab = “Minutes of Participation”, main = “Participation in Class”).

hist(rgamma(10000, shape = fg_shape, rate = fg_rate), xlab = "Minutes of Participation", main = "Participation in Class")

We have finished our last discussion of today’s session. Fitting theoretical distributions to empiricial data like pm is already very advanced statistical reasoning. Well done!

Let’s check what we have learned next.

What is the aim of inferential statistics?

Make money
Describe properties of a population
Infer properties about a population
Don’t know

Infer properties about a population

Using the aggregate function, print out the average marks (FUN = mean) of students per level of Overall assessment of the module.

aggregate(student_survey$Marks ~ student_survey$Overall, FUN = mean)

##   student_survey$Overall student_survey$Marks
## 1      Strongly Disagree             65.82759
## 2               Disagree             66.74510
## 3                Neutral             65.16807
## 4                  Agree             66.95798
## 5         Strongly Agree             66.31373

What was the test again that checked whether two distributions are statistically speaking different? Run it for this year’s marks compared to last year’s.

t.test(student_survey$Marks, last_years_marks)

##
##  Welch Two Sample t-test
##
## data:  student_survey$Marks and last_years_marks
## t = 2.8916, df = 986.17, p-value = 0.003917
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4711052 2.4608948
## sample estimates:
## mean of x mean of y
##    66.182    64.716

Next, we would like to build a linear regression model for the age (input) compared to the height (target) of students. Create student_height_age_lm with lm.

student_height_age_lm <- lm(Height ~ Age, data = student_survey)

Let’s plot this linear regression. We start with a new plot. Just use col=“red” but leave all the axes and titles empty.

plot(student_survey$Age, student_survey$Height, col = "red")

Use abline to add the regression line.

plot(student_survey$Age, student_survey$Height, col = "red")
abline(student_height_age_lm)

Thank you for your attention. We went through a whirlwind tour of the art of statistics, starting from descriptive statistics and finishing with quite advanced parametric statistics. I hope you enjoyed it.