Stat 481: Project 2

Summary of Problem

The Professional Bowlers Association maintains records of each of their bowlers. It is in question whether three of the bowlers in their dataset have different average bowling scores, and whether the bowlers have much variability in their scores.

Data Description

The dataset contains an identifying number for each bowler as well as their scores from various games and the corresponding game numbers. Each bowler bowled in 56 games and following are the descriptive statistics for their scores.

##                Bowler 7 Bowler 8 Bowler 12
## Minimum        156.0000 144.0000  126.0000
## First Quartile 186.7500 193.7500  179.0000
## Median         206.5000 208.0000  190.5000
## Mean           209.5357 208.4286  194.7500
## Third Quartile 229.2500 221.0000  209.5000
## Maximum        276.0000 264.0000  263.0000
##                              Bowler 7 Bowler 8 Bowler 12
## Standard Deviation of Score: 29.69418 25.86529  26.53043

Histograms and boxplots representing the scores of each bowler:

Differences in Average Score

First, it must be determined whether the bowlers have differences in mean score. To do this, we regress score by bowler.

## 
## Call:
## lm(formula = Score ~ Bowler)
## 
## Coefficients:
## (Intercept)      Bowler8     Bowler12  
##     209.536       -1.107      -14.786

Bowler8 and Bowler12 are dummy variables which should be set equal to 1 when true. If one wishes to predict the score for Bowler7, both Bowler8 and Bowler12 will hold a value of 0, meaning the resulting predicted score is 209.536. The corresponding ANOVA table for this model is:

## Analysis of Variance Table
## 
## Response: Score
##            Df Sum Sq Mean Sq F value   Pr(>F)   
## Bowler      2   7596  3798.2  5.0538 0.007409 **
## Residuals 165 124004   751.5                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can conclude scores are different from each bowler using the following hypothesis test:

H0: µ7 = µ8 = µ12
H1: At least one is different
P-value = 0.007409 (from ANOVA table)
P-value is less than α = 0.05, so reject H0
All three means are not equal

Next, we can use a multiple comparison test to determine which specific means are different. In this case, Bonferroni is the best test to use because we know ahead of time that we want to run pairwise comparisons. The results of this test are:

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  Score and Bowler 
## 
##    7     8    
## 8  1.000 -    
## 12 0.015 0.027
## 
## P value adjustment method: bonferroni

This table gives p-values which can be used for each pair-specific hypothesis test. For example, following is a hypothesis test for a difference in µ7 and µ8:

H0: µ7 = µ8
H1: µ7 ≠ µ8
P-value = 1
P-value is greater than α = 0.05, so do not reject H0

There is no difference between the mean scores of Bowler 7 and Bowler 8
Similarly, tests for differences between Bowler 7 and Bowler 12 and between Bowler 8 and Bowler 12 can be conducted. Because the corresponding p-values for these tests from the above table are both less than an α of 0.05, we would reject the null hypotheses in both of these cases, concluding that differences in mean scores for both of these pairs are present.

Testing Assumptions

We must test the regression assumptions for the model used above.

Normality of Residuals

This line is nearly linear, suggesting normality. Just to be sure, we can more formally conduct a Shapiro-Wilk hypothesis test.

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residual
## W = 0.98914, p-value = 0.2247

H0: εi ∼ Normal
H1: εi ≁ Normal
P-value = 0.2247
P-value is greater than α = 0.05, so do not reject H0
The errors follow a normal distribution, so this assumption is met

Equal Variance of Residuals

We use this assumption later when testing for significant variability, so it’s important to verify its factuality. We can do so by conducting Levene’s test.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.7429 0.1782
##       165

H0: Population variances are equal
H1: Population variances are not equal
P-value = 0.1782
P-value is greater than α = 0.05, so do not reject H0
The population variances are equal, so this assumption is met

Variability in Bowlers

Next, we can test to see if there is evidence of significant variability in scores. It is assumed in the linear model that the variances of all three bowlers are equal, so we must run a hypothesis test to determine whether that value is significantly different from zero.

To conduct this test, we must use the random effects model. However, the F-statistic for the random effects model will be the same as the F-statistic in the model shown above because it is still equal to MSTR/MSE. Therefore, we can use the p-value of 0.007409 once again for this test. The assumption checking is the same as for the first model above.

H0: στ2 = 0
H1: στ2 > 0 (Variance cannot be negative)
P-value = 0.007409
P-value is less than α = 0.05, so reject H0
The variance in scores for each bowler is non-zero

To estimate the value of στ2, one can use the formula:
To compute ñ:
Finally:

The variance in mean scores for each bowler is about 54.41.

Conclusions

Of the three bowlers in the provided dataset, Bowler 7 and Bowler 8 are the only ones without a significant difference in mean scores. This makes sense, as their means shown in the descriptive statistics table are very close to each other. The means of Bowler 7 and 12 and Bowler 8 and 12 have a significant difference. The variances in mean scores of these three bowlers are assumed to be equal and proven to be non-zero. Their variance, στ2, can be estimated to about 54.41.

R Code

Please see my R code for this project:

dat = read.csv("C:/Users/Britney/Documents/R/STAT 481 Project 2/P2_Dataset2.csv")
dat
library(moments)
#Seperating out each individual bowler
bowler7 = dat[c(1:56),c(1:3)]
bowler7
bowler8 = dat[c(57:112),c(1:3)]
bowler8
bowler12 = dat[c(113:168),c(1:3)]
bowler12
#Summary statistics for each bowler
summary(bowler7)
sd(bowler7$Score)
kurtosis(bowler7$Score)
skewness(bowler7$Score)
summary(bowler8)
sd(bowler8$Score)
kurtosis(bowler8$Score)
skewness(bowler8$Score)
summary(bowler12)
sd(bowler12$Score)
kurtosis(bowler12$Score)
skewness(bowler12$Score)
#Histograms and boxplots for each bowler's scores
par(mfrow=c(1,3))
hist(bowler7$Score, main = "Bowler 7 Histogram ", xlab = "Score", xlim=c(100,300))
hist(bowler8$Score, main = "Bowler 8 Histogram", xlab = "Score", xlim=c(100,300))
hist(bowler12$Score, main = "Bowler 12 Histogram", xlab = "Score", xlim=c(100,300))
boxplot(bowler8$Score, main = "Bowler 8 Boxplot", ylab = "Score", ylim=c(100,300))
boxplot(bowler7$Score, main = "Bowler 7 Boxplot", ylab = "Score", ylim=c(100,300))
boxplot(bowler12$Score, main = "Bowler 12 Boxplot", ylab = "Score", ylim=c(100,300))
attach(dat)
#Turn bowler into a factor instead of an int
Bowler=as.factor(Bowler)
class(Bowler)
#Question 1: Differences in means
#Linear model to predict score
fit = lm(Score~Bowler)
fit
summary(fit)
anova(fit)
#Bonferroni is best since we want pairwise and we know beforehand
#Bonferroni Test
pairwise.t.test(x = Score,
                g = Bowler,
                p.adj = "bonferroni")
#Provides p-values
#Question 2: Variability
fit2 = lm(Score~Bowler)
summary(fit2)
anova(fit2)
#Checking equal variance assumption
library(car)
leveneTest(fit) 
#Checking normality of errors
par(mfrow=c(1,1))
qqnorm(fit$residual)
qqline(fit$residual)
shapiro.test(fit$residual)

knitr::opts_chunk$set(echo = TRUE)
Avatar
Britney Scott
Masters Candidate

I am a Masters Candidate at the University of Illinois at Chicago. I will graduate with a Masters of Science in Business Analytics in December 2020. I earned a Bachelor of Science in Statistics and a Bachelor of Science in Marketing in December 2019.

Related