Stat 481: Project 1

Summary of Problem

The cost of a subscriber with heart disease’s claim needs to be predicted through a statistical regression as accurately as possible. In order to do so, the insurance company collected data on six independent variables over a two year period: the age, gender, number of drugs prescribed, emergency room visits, other diseases, and duration of treatment. Each patient was also given a unique identification number in the dataset.

Data Description

Data collected includes information about 383 individuals. The cost of claims ranged from $10.6 to $45447.7, with a median of $528.6. Descriptive statistics for the cost were found to be as follows.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.6   169.5   528.6  2816.0  1858.7 45447.7

Following are graphical representations of the response variable.

The six predictor variables included the following:

  • Age: Age of subscriber (years)
  • Gender: Gender of subscriber: 1 if male; 0 otherwise
  • Drugs: Number of tracked drugs prescribed
  • Emergency: Number of emergency room visits
  • Comorbidities: Number of other diseases that the subscriber had during period
  • Duration: Number of days of duration of treatment condition.

Following are histograms for each of the predictor variables, as well as a summary of the descriptive statistics for each.

##               Min. 1st Qu. Median        Mean 3rd Qu. Max.
## Age             39    55.0     59  58.5091384      64   70
## Gender           0     0.0      0   0.2245431       0    1
## Drugs            0     0.0      0   0.4151436       0    6
## Emergency        0     2.0      3   3.3002611       5   12
## Comorbidities    0     0.0      1   3.5013055       5   30
## Duration         0    34.5    150 160.5639687     282  359

Initial Regression Analysis

Without any adjustments, the regression analysis using R outputs the following:

## 
## Call:
## lm(formula = Cost ~ Age + Gender + Drugs + Emergency + Comorbidities + 
##     Duration, data = df, na.action = na.exclude)
## 
## Coefficients:
##   (Intercept)            Age         Gender          Drugs      Emergency  
##      5427.743       -124.948      -1304.825        -83.740       1105.955  
## Comorbidities       Duration  
##       220.079          3.775

It is necessary to check for multicollinearity to be sure none of the predictor variables are correlated with each other.

##           Age        Gender         Drugs     Emergency Comorbidities 
##      1.020317      1.013285      1.336626      1.347070      1.464050 
##      Duration 
##      1.474743

None of the predictor variables have a variance inflation factor (VIF) greater than ten, so all can be included in the regression model.

Testing Assumptions

Next, it is important to determine whether this model meets the four main assumptions of regression analysis: linearity, independence, normality of residuals, and equal variance of residuals.

Linearity

The following plot of residuals against predictor variables shows a clear pattern.

In order for this assumption to be met, we should see no patterns in how the residuals are spread about the abline in the above graphs. However, this does not seem to be the case. For example, the graph for the Emergency variable shows a clear pattern of observations below the abline with the residual values decreasing at emergency visit count increases. This assumption is not met.

Independence

Because this dataset does not involve any time series data, testing for this assumption is not required.

Normality of Residuals

Following is a QQ-Plot of the residuals.

To consider this assumption met, the points should fall approximately on the QQ-line. However, they deviate quite a bit from the line. To further support this, a Shapiro-Wilk test for normality returns a very low p-value, suggesting we reject the idea that the errors are normally distributed.

## 
##  Shapiro-Wilk normality test
## 
## data:  fit$residual
## W = 0.66973, p-value < 2.2e-16

Equal Variance of Residuals

In a scenario where this assumption is met, the red line on the above plot would be approximately horizontal at zero with points randomly spread about it. Since neither of these is the case, this model does not meet the equal variance assumption.

Box-Cox Transformation

A Box-Cox Transformation will be used to address the fact that only one of four regression assumptions was met by the initial model. Running a Box-Cox test on the data in R reveals a lambda of 0, meaning the best way to transform the Y-variable is to take its natural log. Running a second regression analysis with this as the dependent variable results in the following.

## 
## Call:
## lm(formula = logCost ~ Age + Gender + Drugs + Emergency + Comorbidities + 
##     Duration, data = df, na.action = na.exclude)
## 
## Coefficients:
##   (Intercept)            Age         Gender          Drugs      Emergency  
##      6.477509      -0.027208      -0.307886       0.118807       0.202887  
## Comorbidities       Duration  
##      0.092829       0.003161

Testing for multicollinearity again reveals no issues, as no VIF values exceed ten.

##           Age        Gender         Drugs     Emergency Comorbidities 
##      1.020317      1.013285      1.336626      1.347070      1.464050 
##      Duration 
##      1.474743

Testing Assumptions Again

Regression assumptions must be tested again with the new regression line.

Linearity

The above plots look much better in the second regression model. No clear patterns exist in the distribution of the residuals about the abline, so the assumption is met.

Independence

Because this dataset does not involve any time series data, testing for this assumption is not required.

Normality of Residuals

The QQ-Plot of the second regression line is much better.

This line is nearly linear, suggesting normality. More formally, the p-value of a Shapiro-Wilk test is much higher, and is greater than a standard alpha of 0.5.

## 
##  Shapiro-Wilk normality test
## 
## data:  fit2$residual
## W = 0.9927, p-value = 0.05888

Equal Variance of Residuals

The plot looks much better for the transformed regression. In this case, the red line is almost linear, and the points around it do not seem to fit any pattern.

Because of this, the equal variance assumption can be considered met as well.

Forward Selection

To further perfect the model, forward or backward selection can be used. This allows a subset of the predictor variables to be chosen which best predict the cost of the claim. In this case, I chose to use forward selection. Following are the results of the stepAIC function in R.

## 
## Call:
## lm(formula = logCost ~ Age + Gender + Drugs + Emergency + Comorbidities + 
##     Duration, data = df, na.action = na.exclude)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9870 -1.0648 -0.1599  0.9300  3.9529 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.4775088  0.7136507   9.077  < 2e-16 ***
## Age           -0.0272083  0.0120614  -2.256   0.0247 *  
## Gender        -0.3078858  0.1887999  -1.631   0.1038    
## Drugs          0.1188073  0.0929438   1.278   0.2019    
## Emergency      0.2028873  0.0395292   5.133 4.59e-07 ***
## Comorbidities  0.0928292  0.0178549   5.199 3.30e-07 ***
## Duration       0.0031614  0.0007764   4.072 5.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.532 on 376 degrees of freedom
## Multiple R-squared:  0.2913, Adjusted R-squared:   0.28 
## F-statistic: 25.76 on 6 and 376 DF,  p-value: < 2.2e-16

Only variables which have a significance of 0.10 or lower should be included in this model. In the rightmost column, it is shown that the Gender and Drugs variables do not meet this requirement, so should not be included in the final regression equation.

“Best” Model

One final regression using the transformed Y variable and only the significant predictor variables must be constructed. This regression analysis produces the following results.

## 
## Call:
## lm(formula = logCost ~ Age + Emergency + Comorbidities + Duration, 
##     data = df, na.action = na.exclude)
## 
## Coefficients:
##   (Intercept)            Age      Emergency  Comorbidities       Duration  
##      6.319267      -0.025801       0.222831       0.090972       0.003141

Conclusions

The “best” regression line is superior to the first. The R2 of the “best” model is a little over seven percentage points higher than the original one, meaning it explains more of the variance in the claim costs. The adjusted R2 is also larger for the “best” model by a little over seven percentage points.

##                    Original Model Best Model
## R-Squared               0.2115871  0.2830895
## Adjusted R-Squared      0.1990060  0.2755032

It has been determined that Age, Emergency, Comorbidities, and Duration are the predictor variables that have an impact on the cost of a subscriber’s claim. The y-intercept of this line is 6.319267. This means that a subscriber who is zero years old, has had no emergency room visits, has no other diseases, and had a treatment duration of zero days has a predicted total claim cost of about $6.32. The variable with the highest impact on claim cost per unit change is Emergency, while Duration has the lowest impact per unit change.

R Code

Please see my R code for this project:

#Read data and convert to a dataframe
data = read.csv("C:/Users/Britney/Documents/R/STAT 481 Project 1/P1_Dataset1F.csv")
df = data.frame(data)

#Some descriptive statistics
library(moments)
summary(df$Cost)
kurtosis(df$Cost)
skewness(df$Cost)
Age <- summary(df$Age)
Gender <- summary(df$Gender)
Drugs <- summary(df$Drugs)
Emergency <- summary(df$Emergency)
Comorbidities <- summary(df$Comorbidities)
Duration <- summary(df$Duration)
rbind(Age, Gender, Drugs, Emergency, Comorbidities, Duration)

#Histograms of predictor variables
par(mfrow = c(2,3))
hist(df$Age, main = "Histogram of Age", xlab= "Age")
hist(df$Gender, main = "Histogram of Gender", xlab= "Gender")
hist(df$Drugs, main = "Histogram of Drugs", xlab= "Drugs")
hist(df$Emergency, main = "Histogram of Emergencies", xlab= "Emergency")
hist(df$Comorbidities, main = "Histogram of Comorbidities", xlab= "Comorbidities")
hist(df$Duration, main = "Histogram of Duration", xlab= "Duration")

#Obtaining sample size
length(df$ID)


#Run a regression to predict cost
fit <- lm(Cost ~ Age + Gender + Drugs + Emergency + Comorbidities + Duration, data = df, na.action=na.exclude)
fit

#Checking for multicollinearity
library(car)
vif(fit)

#Check the model assumptions

#Linearity check
par(mfrow=c(2,3))
plot(x = df$Age, y = fit$residual, xlab = "Age", 
     ylab = "Residuals", main = "Predictor Variable: Age")
abline(h=0)
plot(x = df$Gender, y = fit$residual, xlab = "Gender", 
     ylab = "Residuals", main = "Predictor Variable: Gender")
abline(h=0)
plot(x = df$Drugs, y = fit$residual, xlab = "Drugs", 
     ylab = "Residuals", main = "Predictor Variable: Drugs")
abline(h=0)
plot(x = df$Emergency, y = fit$residual, xlab = "Emergency",
     ylab = "Residuals", main = "Predictor Variable: Emergency")
abline(h=0)
plot(x = df$Comorbidities, y = fit$residual, xlab = "Comorbidities", 
     ylab = "Residuals", main = "Predictor Variable: Comorbidities")
abline(h=0)
plot(x = df$Duration, y = fit$residual, xlab = "Duration", 
     ylab = "Residuals", main = "Predictor Variable: Duration")
abline(h=0)

#Normality of residuals
par(mfrow = c(1,1))
qqnorm(fit$residual)
qqline(fit$residual)
shapiro.test(fit$residual)

#Equal variance check
par(mfrow = c(1,1))
plot(fit, 1)

#Box-Cox transformation to fix non-normal errors
library(MASS)
bc <- boxcox(Cost ~ Age + Gender + Drugs + Emergency + Comorbidities + Duration,
             data=df,
             plotit=F,
             lambda = seq(-3, 3, by=0.125))
maxyentry <- which.max(bc$y)
bc$x[maxyentry]
#Returns lambda = 0, so use ln(Cost)
df$logCost <- log(df$Cost)
#New model using log(y)
fit2 <- lm(logCost ~ Age + Gender + Drugs + Emergency + Comorbidities + Duration, data=df, na.action=na.exclude)
fit2

#Multicolinnearity check
vif(fit2)

#Redo assumption checks for new model

#Linearity check
par(mfrow=c(2,3))
plot(x = df$Age, y = fit$residual, xlab = "Age", 
     ylab = "Residuals", main = "Predictor Variable: Age")
abline(h=0)
plot(x = df$Gender, y = fit2$residual, xlab = "Gender", 
     ylab = "Residuals", main = "Predictor Variable: Gender")
abline(h=0)
plot(x = df$Drugs, y = fit2$residual, xlab = "Drugs", 
     ylab = "Residuals", main = "Predictor Variable: Drugs")
abline(h=0)
plot(x = df$Emergency, y = fit2$residual, xlab = "Emergency",
     ylab = "Residuals", main = "Predictor Variable: Emergency")
abline(h=0)
plot(x = df$Comorbidities, y = fit2$residual, xlab = "Comorbidities", 
     ylab = "Residuals", main = "Predictor Variable: Comorbidities")
abline(h=0)
plot(x = df$Duration, y = fit2$residual, xlab = "Duration", 
     ylab = "Residuals", main = "Predictor Variable: Duration")
abline(h=0)

#Normality of residuals check
par(mfrow = c(1,1))
qqnorm(fit2$residual)
qqline(fit2$residual)
shapiro.test(fit2$residual)

#Equal variance check
plot(fit2, 1)

#Build the "best" model
stepmodel <- stepAIC(fit2, direction = "forward", 
                     trace = FALSE)
summary(stepmodel)
bestmodel <- lm(logCost ~ Age + Emergency + Comorbidities + Duration, data=df, na.action=na.exclude)
bestmodel

#Find r-squared for the original and best model
summary(fit)$r.squared
summary(bestmodel)$r.squared
summary(fit)$adj.r.squared
summary(bestmodel)$adj.r.squared
knitr::opts_chunk$set(echo = TRUE)
Avatar
Britney Scott
Masters Candidate

I am a Masters Candidate at the University of Illinois at Chicago. I will graduate with a Masters of Science in Business Analytics in December 2020. I earned a Bachelor of Science in Statistics and a Bachelor of Science in Marketing in December 2019.