Error variation. Total variation in y. Deviation of y scores from the mean of y. Does this look familiar at all?? It should. Oh no! Not again!! Yes, again. Imagine a predictor X that is dichotomous and a criterion variable dependent that is continuous. Technically speaking, the equivalence depends on a couple of details--how you decide to code the dichotomous predictor and if the two groups are of equal size.
But the bottom line is that ANOVA is a special case of regression analysis, because under certain conditions, they are equal. Regression analysis is more general, however, because it is not limited to independent variables that are dichotomous.
Although b and r are not exactly equal, their tests are equivalent. If we have test r, we don't really need to test b. However, a little later on, we will want to test them separately. You should arrive at the same t-test value for both. When we test b for significance, we are testing the null hypothesis that in the population, the slope is 0. The population slope is represented by the Greek letter for b, b or "beta". So, our statistical hypotheses are this. Below are the equations for testing b for significance.
As usual, we need an estimate of the standard error of our statistic b to get to the t-test. In the above equations, s b is the standard error of b, representing the estimate of the variability of the slope from sample to sample.
Notice that x's are involved in the computation of the standard error of b. That is because we use x's to get an estimate of b, the slope. Standardized b The slope, b, is interpreted as the number of units or points on the scale of increase in y as a result of 1 unit of increase in x. The slope then depends on what scales we are using for x and y.
If we measure fat intake with the percentage of daily recommended intake rather than the grams of fat, we would wind up with a different slope, because 1 percentage change is not the same as one gram change. Similarly, if we used some different scale of measurement for cholesterol the slope would be different.
Notice that in these situations, the actual variables being measured fat intake and cholesterol are identical either way, it is just that they are represented by a different scale. Think about measuring height in feet and inches. They are both measures of height and they are equivalent measures, but they use different scales. Although using different scales of measurement will affect the slope value, the relationship between x and y will be the same. The correlation will be the same regardless of which scaling you use.
This creates a problem, because the slope value is not that informative sometimes. We need a standardized measure of the slope, just like the standardized measure of association, correlation.
To ensure that you don't fall victim to the most common mistakes, we review a set of seven different cautions here. Master these and you'll be a master of the measures! The coefficient of determination r 2 and the correlation coefficient r quantify the strength of a linear relationship. Consider the following example. If you didn't understand that r 2 and r summarize the strength of a linear relationship, you would likely misinterpret the measures, concluding that there is no relationship between x and y.
But, it's just not true! There is indeed a relationship between x and y — it's just not linear. The lower plot better reflects the curved relationship between x and y. What is this all about? We'll learn when we study multiple linear regression later in the course that the coefficient of determination r 2 associated with the simple linear regression model for one predictor extends to a "multiple coefficient of determination," denoted R 2 , for the multiple linear regression model with more than one predictor.
The lowercase r and uppercase R are used to distinguish between the two situations. Statistical software typically doesn't distinguish between the two, calling both measures " R 2. A large r 2 value should not be interpreted as meaning that the estimated regression line fits the data well. Another function might better describe the trend in the data. Consider the following example in which the relationship between year to , by decades and population of the United States in millions is examined:.
The correlation between year and population is 0. This and the r 2 value of The plot suggests, though, that a curve would describe the relationship even better. That is, the large r 2 value of Its large value does suggest that taking into account year is better than not doing so.
It just doesn't tell us that we could still do better. Again, the r 2 value doesn't tell us that the regression model fits the data well. This is the most common misuse of the r 2 value! When you are reading the literature in your research area, pay close attention to how others interpret r 2. I am confident that you will find some authors misinterpreting the r 2 value in this way.
And, when you are analyzing your own data make sure you plot the data — 99 times out of a , the plot will tell more of the story than a simple summary measure like r or r 2 ever could. The coefficient of determination r 2 and the correlation coefficient r can both be greatly affected by just one data point or a few data points.
Consider the following example in which the relationship between the number of deaths in an earthquake and its magnitude is examined. The correlation between deaths and magnitude is 0. This is not a surprising result. The second plot is a plot of the same data, but with the one unusual data point removed. In some situations the variables under consideration have very strong and intuitively obvious relationships, while in other situations you may be looking for very weak signals in very noisy data.
The decisions that depend on the analysis could have either narrow or wide margins for prediction error, and the stakes could be small or large. For example, in medical research, a new drug treatment might have highly variable effects on individual patients, in comparison to alternative treatments, and yet have statistically significant benefits in an experimental study of thousands of subjects.
Even in the context of a single statistical decision problem, there may be many ways to frame the analysis, resulting in different standards and expectations for the amount of variance to be explained in the linear regression stage. We have seen by now that there are many transformations that may be applied to a variable before it is used as a dependent variable in a regression model: deflation, logging, seasonal adjustment, differencing. All of these transformations will change the variance and may also change the units in which variance is measured.
Logging completely changes the the units of measurement: roughly speaking, the error measures become percentages rather than absolute amounts, as explained here. Deflation and seasonal adjustment also change the units of measurement, and differencing usually reduces the variance dramatically when applied to nonstationary time series data.
Therefore, if the dependent variable in the regression model has already been transformed in some way, it is possible that much of the variance has already been "explained" merely by that process. With respect to which variance should improvement be measured in such cases: that of the original series, the deflated series, the seasonally adjusted series, the differenced series, or the logged series? You cannot meaningfully compare R-squared between models that have used different transformations of the dependent variable, as the example below will illustrate.
Moreover, variance is a hard quantity to think about because it is measured in squared units dollars squared, beer cans squared…. It is easier to think in terms of standard deviations , because they are measured in the same units as the variables and they directly determine the widths of confidence intervals.
This is equal to one minus the square root of 1-minus-R-squared. Here is a table that shows the conversion:. You should ask yourself: is that worth the increase in model complexity? That begins to rise to the level of a perceptible reduction in the widths of confidence intervals.
When adding more variables to a model, you need to think about the cause-and-effect assumptions that implicitly go with them, and you should also look at how their addition changes the estimated coefficients of other variables.
Do they become easier to explain, or harder? Your problems lie elsewhere. That depends on the decision-making situation, and it depends on your objectives or needs, and it depends on how the dependent variable is defined. The following section gives an example that highlights these issues.
If you want to skip the example and go straight to the concluding comments, click here. Return to top of page. An example in which R-squared is a poor guide to analysis: Consider the U. Suppose that the objective of the analysis is to predict monthly auto sales from monthly total personal income. I am using these variables and this antiquated date range for two reasons: i this very silly example was used to illustrate the benefits of regression analysis in a textbook that I was using in that era, and ii I have seen many students undertake self-designed forecasting projects in which they have blindly fitted regression models using macroeconomic indicators such as personal income, gross domestic product, unemployment, and stock prices as predictors of nearly everything, the logic being that they reflect the general state of the economy and therefore have implications for every kind of business activity.
Perhaps so, but the question is whether they do it in a linear, additive fashion that stands out against the background noise in the variable that is to be predicted, and whether they adequately explain time patterns in the data, and whether they yield useful predictions and inferences in comparison to other ways in which you might choose to spend your time. There is no seasonality in the income data. In fact, there is almost no pattern in it at all except for a trend that increased slightly in the earlier years.
This is not a good sign if we hope to get forecasts that have any specificity. By comparison, the seasonal pattern is the most striking feature in the auto sales, so the first thing that needs to be done is to seasonally adjust the latter. Seasonally adjusted auto sales independently obtained from the same government source and personal income line up like this when plotted on the same graph:. The strong and generally similar-looking trends suggest that we will get a very high value of R-squared if we regress sales on income, and indeed we do.
Here is the summary table for that regression:. However, a result like this is to be expected when regressing a strongly trended series on any other strongly trended series , regardless of whether they are logically related.
Here are the line fit plot and residuals-vs-time plot for the model:. The residual-vs-time plot indicates that the model has some terrible problems. First, there is very strong positive autocorrelation in the errors, i. In fact, the lag-1 autocorrelation is 0. It is clear why this happens: the two curves do not have exactly the same shape.
The trend in the auto sales series tends to vary over time while the trend in income is much more consistent, so the two variales get out-of-synch with each other. This is typical of nonstationary time series data. And finally, the local variance of the errors increases steadily over time. The reason for this is that random variations in auto sales like most other measures of macroeconomic activity tend to be consistent over time in percentage terms rather than absolute terms, and the absolute level of the series has risen dramatically due to a combination of inflationary growth and real growth.
As the level as grown, the variance of the random fluctuations has grown with it. Confidence intervals for forecasts in the near future will therefore be way too narrow, being based on average error sizes over the whole history of the series. So, despite the high value of R-squared, this is a very bad model. One way to try to improve the model would be to deflate both series first.
This would at least eliminate the inflationary component of growth, which hopefully will make the variance of the errors more consistent over time. Here is a time series plot showing auto sales and personal income after they have been deflated by dividing them by the U. This does indeed flatten out the trend somewhat, and it also brings out some fine detail in the month-to-month variations that was not so apparent on the original plot.
In particular, we begin to see some small bumps and wiggles in the income data that roughly line up with larger bumps and wiggles in the auto sales data. If we fit a simple regression model to these two variables, the following results are obtained:.
0コメント