Stat 382: Final Project
The dataset used in this assignment includes weather data from Kaggle. The data includes weather obserevations taken hourly from Szeged, Hungary between 2006 and 2016. This totals to 96,453 unique observations. All temperature measurements are in Celcius, and measurements are taken using the metric system. I downloaded the data as a .csv file, which I was able to load into R using the read.csv function. Following is a list of the columns included in the dataset. I renamed a few of the variables I will be working with a lot for ease of use.
## 'data.frame': 96453 obs. of 12 variables: ## $ Formatted.Date : chr "2006-04-01 00:00:00.000 +0200" "2006-04-01 01:00:00.000 +0200" "2006-04-01 02:00:00.000 +0200" "2006-04-01 03:00:00.000 +0200" ... ## $ Summary : chr "Partly Cloudy" "Partly Cloudy" "Mostly Cloudy" "Partly Cloudy" ... ## $ Precip.Type : chr "rain" "rain" "rain" "rain" ... ## $ Temp : num 9.47 9.36 9.38 8.29 8.76 ... ## $ ApparentTemp : num 7.39 7.23 9.38 5.94 6.98 ... ## $ Humidity : num 0.89 0.86 0.89 0.83 0.83 0.85 0.95 0.89 0.82 0.72 ... ## $ WindSpeed : num 14.12 14.26 3.93 14.1 11.04 ... ## $ Wind.Bearing..degrees.: num 251 259 204 269 259 258 259 260 259 279 ... ## $ Visibility..km. : num 15.8 15.8 15 15.8 15.8 ... ## $ Loud.Cover : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Pressure : num 1015 1016 1016 1016 1017 ... ## $ Daily.Summary : chr "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." "Partly cloudy throughout the day." ...
For this analysis, I began by looking at the summary statistics for four of the quantitative columns in the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## Pressure 0.00000 1011.900000 1016.4500 1003.235956 1021.09000 1046.38000 ## Temp -21.82222 4.688889 12.0000 11.932678 18.83889 39.90556 ## ApparentTemp -27.71667 2.311111 12.0000 10.855029 18.83889 39.34444 ## Humidity 0.00000 0.600000 0.7800 0.734899 0.89000 1.00000 ## WindSpeed 0.00000 5.828200 9.9659 10.810640 14.13580 63.85260
I also wanted to ensure that that none of these variables contained many missing values. To do this, I used the is.na() function. This code scans for missing values within each variable and returns a table. The table will count the missing values as "TRUE" and the values that are present as "FALSE." There are no missing values for anny of the 5 selected fields, as there are no values of "TRUE" for any. All 96,453 observations are accounted for.
## FALSE ## Pressure 96453 ## Temp 96453 ## ApparentTemp 96453 ## Humidity 96453 ## WindSpeed 96453
Next, I wanted to check for outliers, both visually and using an IQR calculation. First, I created boxplots and histograms for each of these five variables.
It is difficult to see from the histograms of Temperature and Apparent Temperature where the outliers are, but it appears that Humidity may have some outliers on the low end and Wind Speed may have some on the high end, as they are left and right skewed respectively. The histogram for Pressure shows little variation across observations, as almost all data points fall into a single column.
Boxplots are useful because they show outliers with dots on either end. It looks like there are outliers on the low end of Temperature, Apparent Temperature, and Humidity, and outliers on the high end of Wind Speed. Pressure has a couple of outliers on either end, but again shows very little variation within the boxplot.
I then chose to find all of the actual outliers using an IQR calculation. The code here finds the first and third quartiles, which are used to find the inner quartile range (IQR). Outliers are defined to be 1.5 times the IQR less than the first quartile or 1.5 time the IQR more than the third, so the code then looks to see which values meet those conditions. As evident by the boxplots, some of the variables - most notably the wind speed and pressure - have a large number (thousands) of outliers. Therefore, I chose not to list all of them, and instead included a table which reports the number of outliers for each variable and the minimum and maximum values they hold.
## Number of Outliers Min Outlier Max Outlier ## Pressure 4400.00 0.00 48.30 ## Temperature 44.00 -21.82 -16.67 ## Apparent Temp 22.00 -27.72 -22.74 ## Humidity 46.00 0.00 0.16 ## Wind Speed 3028.00 26.60 63.85
Next, it's important to check for correlation between the colulmns in the dataset. Below is a color and size coded chart of the correlations. I isolated the 5 columns of interest into their own data frame, and then calculated the correlation values. Next, I used the corrplot() function and plugged in the correlations. The code then visually displays the correlations. A blue circle means a positive correlation, while a red means negative. The larger the circle, the higher the correlation.
The highest correlation by far was between Temperature and Apparent Temperature, which is positive. This makes sense, since Apparent Temperature is the temperature perceived by humans and Temperature is the actual Temperature. Also, there is a somewhat strong negative correlation between the Humidity and both the Temperature and the Apparent Temperature. This makes sense given what scientists know about the relationship between the two. According to The Chicago Tribune,
"Relative humidity changes when temperatures change. Because warm air can hold more water vapor than cool air, relative humidity falls when the temperature rises if no moisture is added to the air."
I began my regression analysis with a very simple linear regression. I wanted to see how accurately we could predict Apparent Temperature in the area with only the Temperature. Following is the linear model using these two columns. Here, the code is using attempting to create a linear equation to predict the Apparent Temperature given the Temperature. It uses the trends in the provided dataset to do so.
## ## Call: ## lm(formula = dat$ApparentTemp ~ dat$Temp) ## ## Coefficients: ## (Intercept) dat$Temp ## -2.410 1.112
With this model, the Apparent Temperature is equal to -2.41 degrees Celcius when the actual Temperature is 0 degrees Celcius. For every one degree increase in Temperature, Apparent Temperature increases by 1.112 degrees.
The following plot includes the values of Temperature and Apparent Temperature from the original dataset. In red, you can see the regression line. The regression line was created using the lm() command by minimizing the residuals.
The ANOVA table for this regression model is shown below. The mean squared error of this model is 2.
## Analysis of Variance Table ## ## Response: dat$ApparentTemp ## Df Sum Sq Mean Sq F value Pr(>F) ## dat$Temp 1 10874176 10874176 6469964 < 2.2e-16 *** ## Residuals 96451 162107 2 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Apparent Temperature is the temperature perceived by humans. In addition to the actual temperature, factors such as Humidity, Pressure and Wind Speed can affect the temperature that a person perceives, so I chose to include those variables in the regression as well. I decided to exclude Pressure from this analysis, as we saw from the boxplots and histograms that there is little variation in Pressure between the observations anyways.
## ## Call: ## lm(formula = dat$ApparentTemp ~ dat$Temp + dat$Humidity + dat$WindSpeed + ## dat$Pressure) ## ## Coefficients: ## (Intercept) dat$Temp dat$Humidity dat$WindSpeed dat$Pressure ## -2.5300486 1.1259596 1.0573057 -0.0946956 0.0001954
The final regression equation for Apparent Temperature is -2.53005 + 1.12596(Temperature) + 1.05731(Humidity) - 0.09470(WindSpeed) + 0.00020(Pressure). For a 1 unit increase in Humidity, the Apparent Temperature increases just over 1 degree Celcius. Humidity is given as a decimal in this dataset, so for a 10% increase in humidity levels, the Apparent Temperature rises by 0.106 degrees Celcius. For an increase in Wind Speed of 1 km/hour, the temperature is perceived as about 0.095 degrees Celcius lower. Pressure is measured in millibars in this dataset, so an increase in Preasure of 1 millibar causes the perceived temperature to increase by 0.0002 degrees Celcius. This is seemingly a small difference, but with Pressure varying from 0 to 1046.38 millibars in the dataset, it can make a difference of a couple degrees.
## Analysis of Variance Table ## ## Response: dat$ApparentTemp ## Df Sum Sq Mean Sq F value Pr(>F) ## dat$Temp 1 10874176 10874176 9.3327e+06 < 2.2e-16 *** ## dat$Humidity 1 11512 11512 9.8805e+03 < 2.2e-16 *** ## dat$WindSpeed 1 38165 38165 3.2755e+04 < 2.2e-16 *** ## dat$Pressure 1 50 50 4.3148e+01 5.102e-11 *** ## Residuals 96448 112379 1 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the above ANOVA table for the second model, the mean squared error of this model is 1, less than that of the first model. The inclusion of Humidity, Pressure and Wind Speed helps to better predict the Apparent Temperature.
## ## Call: ## lm(formula = dat$ApparentTemp ~ dat$Temp + dat$Humidity + dat$WindSpeed + ## dat$Pressure) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.3229 -0.7156 -0.1073 0.6837 5.3715 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2.530e+00 3.883e-02 -65.154 < 2e-16 *** ## dat$Temp 1.126e+00 4.772e-04 2359.502 < 2e-16 *** ## dat$Humidity 1.057e+00 2.393e-02 44.182 < 2e-16 *** ## dat$WindSpeed -9.470e-02 5.249e-04 -180.420 < 2e-16 *** ## dat$Pressure 1.954e-04 2.975e-05 6.569 5.1e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.079 on 96448 degrees of freedom ## Multiple R-squared: 0.9898, Adjusted R-squared: 0.9898 ## F-statistic: 2.344e+06 on 4 and 96448 DF, p-value: < 2.2e-16
I also used the stepAIC() function above to see whether each of the independent variables were significant within the model, and found that they all had very low p-values, indicating they are all significant. Shown is the backward elimination, but forward selection results in the same conclusion. Backwards elmination in the stepAIC() function works by deleting each variable from the linear model and seeing whether the changes are significant or not. If deletion results in a significant change to the model, the variable is significant and should be kept, as is the case with all four predictor variables here.
In conclusion, many factors go into play when measuring people's perception of the temperature. In addition to the temperature itself, factors in the environment such as wind speed, humidity, and air pressure can all influence the apparent temperature. In the future, it would be interesting to continue this analysis on some of the other columns in the dataset. For example, the dataset includes a categorical variable for precipitation (rain, snow, or null) which could be turned into dummy variables and included in the model. It would also be interesting to see how accurately the model created here predicts apparent temperature in other cities. Location-specific factors which were held constant here such as longitude/latitude or elevation could hinder its accuracy as-is in different locations.