- Mastering Machine Learning with R
- Cory Lesmeister
- 2211字
- 2021-07-09 21:28:20
Univariate linear regression
We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as, Y = B0 + B1x + e. We can state it as the expected value of Y being a function of the parameters B0 (the intercept) plus B1 (the slope) times x, plus an error term. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences RSS = (Y1 – y1)2 + (Y2 – y2)2, which, with simple substitution, yields (10 – 12)2 + (20 – 18)2 = 8.
I once remarked to a peer during our Lean Six Sigma Black Belt training that it's all about the sum of squares; understand the sum of squares and the rest will flow naturally. Perhaps that is true, at least to some extent.
Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, R, and any other software for that matter, will give us a solution regardless of the inputs. However, just because the math makes sense and a high correlation or R-squared statistic is reported, doesn't mean that the conclusion is valid.
To drive this point home, a look at the famous Anscombe dataset available in R is in order. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties, but when plotted, show something very different. I have used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is a good place to start with the following R code should you have a similar need. It is a brief tangent before moving on to serious modeling.
> #call up and explore the data > data(anscombe) > attach(anscombe) > anscombe x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.46 6.58 2 8 8 8 8 6.95 8.14 6.77 5.76 3 13 13 13 8 7.58 8.74 12.74 7.71 4 9 9 9 8 8.81 8.77 7.11 8.84 5 11 11 11 8 8.33 9.26 7.81 8.47 6 14 14 14 8 9.96 8.10 8.84 7.04 7 6 6 6 8 7.24 6.13 6.08 5.25 8 4 4 4 19 4.26 3.10 5.39 12.50 9 12 12 12 8 10.84 9.13 8.15 5.56 10 7 7 7 8 4.82 7.26 6.42 7.91 11 5 5 5 8 5.68 4.74 5.73 6.89
As we shall see, each of the pairs has the same correlation coefficient of 0.816
. The first two are as follows:
> cor(x1, y1) #correlation of x1 and y1 [1] 0.8164205 > cor(x2, y1) #correlation of x2 and y2 [1] 0.8164205
The real insight here, as Anscombe intended, is when we plot all the four pairs together, as follows:
> par(mfrow=c(2,2)) #create a 2x2 grid for plotting > plot(x1, y1, main="Plot 1") > plot(x2, y2, main="Plot 2") > plot(x3, y3, main="Plot 3") > plot(x4, y4, main="Plot 4")
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
The output of the preceding code is as follows:
As we can see, Plot 1 appears to have a true linear relationship, Plot 2 is curvilinear, Plot 3 has a dangerous outlier, and Plot 4 is driven by the one outlier. There you have it, a cautionary tale of sorts.
Business understanding
The data collected measures two variables. The goal is to model the water yield (in inches) of the Snake River Watershed in Wyoming as a function of the water content of the year's snowfall. This forecast will be useful in managing the water flow and reservoir levels as the Snake River provides the much needed irrigation water for the farms and ranches of several western states. The dataset snake
is available in the alr3
package (note that alr stands for applied linear regression):
> install.packages("alr3") > library(alr3) > data(snake) > attach(snake) > dim(snake) [1] 17 2 > head(snake) X Y 1 23.1 10.5 2 32.8 16.7 3 31.8 18.2 4 32.0 17.0 5 30.4 16.3 6 24.0 10.5
Now that we have 17
observations, data exploration can begin. But first, let's change X
and Y
into meaningful variable names, as follows:
> names(snake) = c("content", "yield") > attach(snake) #reattach data with new names > head(snake) content yield 1 23.1 10.5 2 32.8 16.7 3 31.8 18.2 4 32.0 17.0 5 30.4 16.3 6 24.0 10.5 > plot(content, yield, xlab="water content of snow", ylab="water yield")
The output of the preceding code is as follows:
This is an interesting plot as the data is linear, and has a slight curvilinear shape driven by two potential outliers at both ends of the extreme. As a result, a transformation of the data or deletion of an outlying observation may be warranted.
To perform a linear regression in R, one uses the lm()
function to create a model in the standard form of fit = lm(Y~X). You can then test your assumptions using various functions on your fitted model by using the following code:
> yield.fit = lm(yield~content) > summary(yield.fit) Call: lm(formula = yield ~ content) Residuals: Min 1Q Median 3Q Max -2.1793 -1.5149 -0.3624 1.6276 3.1973 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.72538 1.54882 0.468 0.646 content 0.49808 0.04952 10.058 4.63e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.743 on 15 degrees of freedom Multiple R-squared: 0.8709, Adjusted R-squared: 0.8623 F-statistic: 101.2 on 1 and 15 DF, p-value: 4.632e-08
With the summary()
function, we can examine a number of items including the model specification, descriptive statistics about the residuals, the coefficients, codes to model significance, and a summary on model error and fit. Right now, let's focus on the parameter coefficient estimates, see if our predictor variable has a significant p-value, and if the overall model F-test has a significant p-value. Looking at the parameter estimates, the model tells us that the yield
is equal to 0.72538
plus 0.49808
times the content
. It can be stated that for every one unit change in the content, the yield will increase by 0.49808
units. F-statistic
is used to test the null hypothesis that the model coefficients are all 0.
Since the p-value
is highly significant, we can reject the null and move on to the t-test for content, which tests the null hypothesis that it is 0. Again, we can reject the null. Additionally, we can see Multiple R-squared
and Adjusted R-squared
values. The Adjusted R-squared
will be covered under the multivariate regression topic, so let's zero in on Multiple R-squared
; here we see that it is 0.8709
. In theory, it can range from 0 to 1 and is a measure of the strength of the association between X and Y. The interpretation in this case is that 87 percent of the variation in the water yield can be explained by the water content of snow. On a side note, R-squared is nothing more than the correlation coefficient of [X, Y] squared.
We can recall our scatterplot, and now add the best fit line produced by our model using the following code:
> plot(content, yield) > abline(yield.fit, lwd=3, col="red")
The output of the preceding code is as follows:
A linear regression model is only as good as the validity of its assumptions, which can be summarized as follows:
- Linearity: This is a linear relationship between the predictor and the response variables. If this relationship is not clearly present, transformations (log, polynomial, exponent and so on) of the X or Y may solve the problem.
- Non-correlation of errors: A common problem in the time series and panel data where en = betan-1; if the errors are correlated, you run the risk of creating a poorly specified model.
- Homoscedasticity: Normally the distributed and constant variance of errors, which means that the variance of the errors is constant across the different values of inputs. Violations of this assumption can create biased coefficient estimates, leading to statistical tests for significance that can be either too high or too low. This, in turn, leads to the wrong conclusion. This violation is referred to as heteroscedasticity.
- No collinearity: No linear relationship between two predictor variables, which is to say that there should be no correlation between the features. This, again, can lead to biased estimates.
- Presence of outliers: Outliers can severely skew the estimation and, ideally, must be removed prior to fitting a model using linear regression; this again can lead to a biased estimate.
As we are building a univariate model not dependent on time, we will concern ourselves only with linearity and heteroscedasticity. The other assumptions will become important in the next section. The best way to initially check the assumptions is by producing plots. The plot()
function, when combined with a linear model fit, will automatically produce four plots allowing you to examine the assumptions. R produces the plots one at a time and you advance through them by hitting the Enter key. It is best to examine all four simultaneously and we do it in the following manner:
> par(mfrow=c(2,2)) > plot(yield.fit)
The output of the preceding code is as follows:
The two plots on the left allow us to examine the homoscedasticity of errors and nonlinearity. What we are looking for is some type of pattern or, more importantly, that no pattern exists. Given the sample size of only 17 observations, nothing obvious can be seen. Common heteroscedastic errors will appear to be u-shaped, inverted u-shaped, or will cluster close together on the left side of the plot and become wider as the fitted values increase (a funnel shape). It is safe to conclude that no violation of homoscedasticity is apparent in our model.
The Normal Q-Q plot in the upper-right corner helps us to determine if the residuals are normally distributed. The Quantile-Quantile (Q-Q), represent the quantile values of one variable plotted against the quantile values of another. It appears that the outliers (observations 7, 9, and 10), may be causing a violation of the assumption. The Residuals vs Leverage plot can tell us what observations, if any, are unduly influencing the model; in other words, if there are any outliers we should be concerned about. The statistic is Cook's distance or Cook's D, and it is generally accepted that a value greater than one should be worthy of further inspection.
What exactly is further inspection? This is where art meets science. The easy way out would be to simply delete the observation, in this case number 9, and redo the model. However, a better option may be to transform the predictor and/or the response variables. If we just delete observation 9, then maybe observations 10 and 13 would fall outside the band of greater than 1. I believe that this is where domain expertise can be critical. More times than I can count, I have found that exploring and understanding the outliers can yield valuable insights. When we first examined the previous scatterplot I pointed out the potential outliers and these happen to be observations number 9 and number 13. As an analyst, it would be critical to discuss with the appropriate subject matter experts to understand why this would be the case. Is it a measurement error? Is there a logical explanation for these observations? I certainly don't know, but this is an opportunity to increase the value that you bring to an organization.
Having said that, we can drill down on the current model by examining, in more detail, the Normal Q-Q plot. R does not provide confidence intervals to the default Q-Q plot, and given our concerns in looking at the base plot, we should check the confidence intervals. The qqPlot()
function of the car
package automatically provides these confidence intervals. Since the car
package is loaded along with the alr3
package, I can produce the plot with one line of code as follows:
> qqPlot(yield.fit)
The output of the preceding code is as follows:
According to the plot, the residuals are normally distributed. I think this can give us some confidence to select the model with all the observations. Clear rationale and judgment would be needed to attempt other models. If we could clearly reject the assumption of normally distributed errors, then we would probably have to examine the variable transformations and/or observation deletion.