Stat 382: Final Project

Introduction

The dataset used in this assignment includes weather data from Kaggle. The data includes weather obserevations taken hourly from Szeged, Hungary between 2006 and 2016. This totals to 96,453 unique observations. All temperature measurements are in Celcius, and measurements are taken using the metric system. I downloaded the data as a .csv file, which I was able to load into R using the read.csv function. Following is a list of the columns included in the dataset. I renamed a few of the variables I will be working with a lot for ease of use.

## 'data.frame':    96453 obs. of  12 variables:
##  $ Formatted.Date        : chr  "2006-04-01 00:00:00.000 +0200" "2006-04-01 01:00:00.000 +0200" "2006-04-01 02:00:00.000 +0200" "2006-04-01 03:00:00.000 +0200" ...
##  $ Summary               : chr  "Partly Cloudy" "Partly Cloudy" "Mostly Cloudy" "Partly Cloudy" ...
##  $ Precip.Type           : chr  "rain" "rain" "rain" "rain" ...
##  $ Temp                  : num  9.47 9.36 9.38 8.29 8.76 ...
##  $ ApparentTemp          : num  7.39 7.23 9.38 5.94 6.98 ...
##  $ Humidity              : num  0.89 0.86 0.89 0.83 0.83 0.85 0.95 0.89 0.82 0.72 ...
##  $ WindSpeed             : num  14.12 14.26 3.93 14.1 11.04 ...
##  $ Wind.Bearing..degrees.: num  251 259 204 269 259 258 259 260 259 279 ...
##  $ Visibility..km.       : num  15.8 15.8 15 15.8 15.8 ...
##  $ Loud.Cover            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pressure              : num  1015 1016 1016 1016 1017 ...
##  $ Daily.Summary         : chr  "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." ...

For this analysis, I began by looking at the summary statistics for four of the quantitative columns in the dataset.

##                   Min.     1st Qu.    Median        Mean    3rd Qu.       Max.
## Pressure       0.00000 1011.900000 1016.4500 1003.235956 1021.09000 1046.38000
## Temp         -21.82222    4.688889   12.0000   11.932678   18.83889   39.90556
## ApparentTemp -27.71667    2.311111   12.0000   10.855029   18.83889   39.34444
## Humidity       0.00000    0.600000    0.7800    0.734899    0.89000    1.00000
## WindSpeed      0.00000    5.828200    9.9659   10.810640   14.13580   63.85260

I also wanted to ensure that that none of these variables contained many missing values. To do this, I used the is.na() function. This code scans for missing values within each variable and returns a table. The table will count the missing values as "TRUE" and the values that are present as "FALSE." There are no missing values for anny of the 5 selected fields, as there are no values of "TRUE" for any. All 96,453 observations are accounted for.

##              FALSE
## Pressure     96453
## Temp         96453
## ApparentTemp 96453
## Humidity     96453
## WindSpeed    96453

Outliers

Next, I wanted to check for outliers, both visually and using an IQR calculation. First, I created boxplots and histograms for each of these five variables.

It is difficult to see from the histograms of Temperature and Apparent Temperature where the outliers are, but it appears that Humidity may have some outliers on the low end and Wind Speed may have some on the high end, as they are left and right skewed respectively. The histogram for Pressure shows little variation across observations, as almost all data points fall into a single column.

Boxplots are useful because they show outliers with dots on either end. It looks like there are outliers on the low end of Temperature, Apparent Temperature, and Humidity, and outliers on the high end of Wind Speed. Pressure has a couple of outliers on either end, but again shows very little variation within the boxplot.

I then chose to find all of the actual outliers using an IQR calculation. The code here finds the first and third quartiles, which are used to find the inner quartile range (IQR). Outliers are defined to be 1.5 times the IQR less than the first quartile or 1.5 time the IQR more than the third, so the code then looks to see which values meet those conditions. As evident by the boxplots, some of the variables - most notably the wind speed and pressure - have a large number (thousands) of outliers. Therefore, I chose not to list all of them, and instead included a table which reports the number of outliers for each variable and the minimum and maximum values they hold.

##               Number of Outliers Min Outlier Max Outlier
## Pressure                 4400.00        0.00       48.30
## Temperature                44.00      -21.82      -16.67
## Apparent Temp              22.00      -27.72      -22.74
## Humidity                   46.00        0.00        0.16
## Wind Speed               3028.00       26.60       63.85

Correlations

Next, it's important to check for correlation between the colulmns in the dataset. Below is a color and size coded chart of the correlations. I isolated the 5 columns of interest into their own data frame, and then calculated the correlation values. Next, I used the corrplot() function and plugged in the correlations. The code then visually displays the correlations. A blue circle means a positive correlation, while a red means negative. The larger the circle, the higher the correlation.

The highest correlation by far was between Temperature and Apparent Temperature, which is positive. This makes sense, since Apparent Temperature is the temperature perceived by humans and Temperature is the actual Temperature. Also, there is a somewhat strong negative correlation between the Humidity and both the Temperature and the Apparent Temperature. This makes sense given what scientists know about the relationship between the two. According to The Chicago Tribune,

"Relative humidity changes when temperatures change. Because warm air can hold more water vapor than cool air, relative humidity falls when the temperature rises if no moisture is added to the air."

Regression

I began my regression analysis with a very simple linear regression. I wanted to see how accurately we could predict Apparent Temperature in the area with only the Temperature. Following is the linear model using these two columns. Here, the code is using attempting to create a linear equation to predict the Apparent Temperature given the Temperature. It uses the trends in the provided dataset to do so.

## 
## Call:
## lm(formula = dat$ApparentTemp ~ dat$Temp)
## 
## Coefficients:
## (Intercept)     dat$Temp  
##      -2.410        1.112

With this model, the Apparent Temperature is equal to -2.41 degrees Celcius when the actual Temperature is 0 degrees Celcius. For every one degree increase in Temperature, Apparent Temperature increases by 1.112 degrees.

The following plot includes the values of Temperature and Apparent Temperature from the original dataset. In red, you can see the regression line. The regression line was created using the lm() command by minimizing the residuals.

The ANOVA table for this regression model is shown below. The mean squared error of this model is 2.

## Analysis of Variance Table
## 
## Response: dat$ApparentTemp
##              Df   Sum Sq  Mean Sq F value    Pr(>F)    
## dat$Temp      1 10874176 10874176 6469964 < 2.2e-16 ***
## Residuals 96451   162107        2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Apparent Temperature is the temperature perceived by humans. In addition to the actual temperature, factors such as Humidity, Pressure and Wind Speed can affect the temperature that a person perceives, so I chose to include those variables in the regression as well. I decided to exclude Pressure from this analysis, as we saw from the boxplots and histograms that there is little variation in Pressure between the observations anyways.

## 
## Call:
## lm(formula = dat$ApparentTemp ~ dat$Temp + dat$Humidity + dat$WindSpeed + 
##     dat$Pressure)
## 
## Coefficients:
##   (Intercept)       dat$Temp   dat$Humidity  dat$WindSpeed   dat$Pressure  
##    -2.5300486      1.1259596      1.0573057     -0.0946956      0.0001954

The final regression equation for Apparent Temperature is -2.53005 + 1.12596(Temperature) + 1.05731(Humidity) - 0.09470(WindSpeed) + 0.00020(Pressure). For a 1 unit increase in Humidity, the Apparent Temperature increases just over 1 degree Celcius. Humidity is given as a decimal in this dataset, so for a 10% increase in humidity levels, the Apparent Temperature rises by 0.106 degrees Celcius. For an increase in Wind Speed of 1 km/hour, the temperature is perceived as about 0.095 degrees Celcius lower. Pressure is measured in millibars in this dataset, so an increase in Preasure of 1 millibar causes the perceived temperature to increase by 0.0002 degrees Celcius. This is seemingly a small difference, but with Pressure varying from 0 to 1046.38 millibars in the dataset, it can make a difference of a couple degrees.

## Analysis of Variance Table
## 
## Response: dat$ApparentTemp
##                  Df   Sum Sq  Mean Sq    F value    Pr(>F)    
## dat$Temp          1 10874176 10874176 9.3327e+06 < 2.2e-16 ***
## dat$Humidity      1    11512    11512 9.8805e+03 < 2.2e-16 ***
## dat$WindSpeed     1    38165    38165 3.2755e+04 < 2.2e-16 ***
## dat$Pressure      1       50       50 4.3148e+01 5.102e-11 ***
## Residuals     96448   112379        1                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the above ANOVA table for the second model, the mean squared error of this model is 1, less than that of the first model. The inclusion of Humidity, Pressure and Wind Speed helps to better predict the Apparent Temperature.

## 
## Call:
## lm(formula = dat$ApparentTemp ~ dat$Temp + dat$Humidity + dat$WindSpeed + 
##     dat$Pressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3229 -0.7156 -0.1073  0.6837  5.3715 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)   -2.530e+00  3.883e-02  -65.154  < 2e-16 ***
## dat$Temp       1.126e+00  4.772e-04 2359.502  < 2e-16 ***
## dat$Humidity   1.057e+00  2.393e-02   44.182  < 2e-16 ***
## dat$WindSpeed -9.470e-02  5.249e-04 -180.420  < 2e-16 ***
## dat$Pressure   1.954e-04  2.975e-05    6.569  5.1e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.079 on 96448 degrees of freedom
## Multiple R-squared:  0.9898, Adjusted R-squared:  0.9898 
## F-statistic: 2.344e+06 on 4 and 96448 DF,  p-value: < 2.2e-16

I also used the stepAIC() function above to see whether each of the independent variables were significant within the model, and found that they all had very low p-values, indicating they are all significant. Shown is the backward elimination, but forward selection results in the same conclusion. Backwards elmination in the stepAIC() function works by deleting each variable from the linear model and seeing whether the changes are significant or not. If deletion results in a significant change to the model, the variable is significant and should be kept, as is the case with all four predictor variables here.

Conclusion

In conclusion, many factors go into play when measuring people's perception of the temperature. In addition to the temperature itself, factors in the environment such as wind speed, humidity, and air pressure can all influence the apparent temperature. In the future, it would be interesting to continue this analysis on some of the other columns in the dataset. For example, the dataset includes a categorical variable for precipitation (rain, snow, or null) which could be turned into dummy variables and included in the model. It would also be interesting to see how accurately the model created here predicts apparent temperature in other cities. Location-specific factors which were held constant here such as longitude/latitude or elevation could hinder its accuracy as-is in different locations.

Avatar
Britney Scott
Masters Candidate

I am a Masters Candidate at the University of Illinois at Chicago. I will graduate with a Masters of Science in Business Analytics in December 2020. I earned a Bachelor of Science in Statistics and a Bachelor of Science in Marketing in December 2019.

Related