Akaike’s information criterionis a measure of the goodness of fit of an estimated statistical model. The AIC was developed by Hirotsugu Akaike under the name of ``an information criterion" in 1971. The AIC is a tool i.e. a method of comparing two or more candidate regression models. The AIC methodology attempts to find the model that best explains the data with a minimum of parameters. (i.e. in keeping with the law of parsimony)
The AIC is calculated using the “likelihood function” and the number of parameters ( Likelihood function : not on course). The likelihood value is generally given in code output, as a complement to the AIC. Given a data set, several competing models may be ranked according to their AIC, with the one having the lowest AIC being the best. (Although, a difference in AIC values of less than two is considered negligible).
\[\mbox{AIC} = 2p - 2\ln(L)\]
An alternative to the AIC is the Schwarz BIC, which additionally takes into account the sample size \(n\).
\[\mbox{BIC} = p\ln{n} - 2\ln(L)\]
Multicollinearity occurs when two or more independent in the model are highly correlated and, as a consequence, provide redundant information about the response when placed together in a model.
(Everyday examples of multicollinear independent variables are height and weight of a person, years of education and income, and assessed value and square footage of a home.)
Overfitting occurs when a statistical model does not adequately describe of the underlying relationship between variables in a regression model. Overfitting generally occurs when the model is excessively complex, such as having too many parameters (i.e. predictor variables) relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
* In multiple regression, two or more predictor variables are colinear if they show strong
linear relationships. This makes estimation of regression coefficients impossible. It can
also produce unexpectedly large estimated standard errors for the coefficients of the X
variables involved.
* Multicollinearity occurs when two or more predictors in the model are correlated
and provide redundant information about the response.
* This is why an exploratory analysis of the data should be first done to see if any collinearity
among explanatory variables exists.
* Multicolinearity is suggested by non-significant results in individual tests on the regression
coefficients for important explanatory (predictor) variables.
* Multicolinearity may make the determination of the main predictor variable having an
effect on the outcome difficult.
* When choosing a predictor variable you should select one that might be correlated with
the criterion variable, but that is not strongly correlated with the other predictor variables.
* Examples of pairs of multicollinear predictors are years of education and income, height and weight of a
person, and assessed value and square footage of a house.
* However, correlations amongst the predictor variables are not unusual. The term multi-
collinearity is used to describe the situation when a high correlation is detected between
two or more predictor variables.
* Such high correlations cause problems when trying to draw inferences about the relative
contribution of each predictor variable to the success of the model.
There are two types of multicollinearity:
\begin{enumerate} * Structural multicollinearity * Data-based multicollinearity \end{enumerate} Structural multicollinearity is a mathematical artifact caused by creating new predictors from other predictors such as, creating the predictor x2 from the predictor x. Data-based multi- collinearity, on the other hand, is a result of a poorly designed experiment, reliance on purely observational data, or the inability to manipulate the system on which the data are collected. In the case of structural multicollinearity, the multicollinearity is induced by what you have done. Data-based multicollinearity is the more troublesome of the two types of multicollinearity. Unfortunately it is the type we encounter most often!
Multicollinearity leads to decreased reliability and predictive power of statistical models, and hence, very often, confusing and misleading results.
* Increased standard error of estimates of the regression coefficients (i.e. decreased reliability of fitted
model).
* Often confusing and misleading results.
* Multicollinearity will be dealt with in a future component of this course: Variable Selection Procedures.
* This issue is not a serious one with respect to the
usefulness of the overall model, but it does affect any attempt to interpret the meaning of the partial regression
coefficients in the model.
* This issue is not a serious one with respect to the usefulness of the overall model, but it does affect any attempt to interpret the meaning of the partial regression coefficients in the model.
In statistics, the occurrence of several independent variables in a multiple regression model are closely correlated to one another. Multicollinearity can cause strange results when attempting to study how well individual independent variables contribute to an understanding of the de- pendent variable. In general, multicollinearity can cause wide confidence intervals and strange pâvalues for independent variables.
You can also assess multicollinearity in regression in the following ways:
[(1)] Examine the correlations and associations (nominal variables) between independent variables to detect a high level of association. High bivariate correlations are easy to spot by running correlations among your variables. If high bivariate correlations are present, you can delete one of the two variables. However, this may not always be sufficient.
[(2)] Regression coefficients will change dramatically according to whether other variables are included or excluded from the model. Play around with this by adding and then removing variables from your regression model.
[(3)] The standard errors of the regression coefficients will be large if multicollinearity is an issue.
[(4)] Predictor variables with known, strong relationships to the outcome variable will not achieve statistical significance. In this case, neither may contribute significantly to the model after the other one is included. But together they contribute a lot. If you remove both variables from the model, the fit would be much worse. So the overall model fits the data well, but neither X variable makes a significant contribution when it is added to your model last. When this happens, multicollinearity may be present.
All variables involved in the linear relationship will have a small tolerance.
The VIF provides an index that measures how much the variance (the square of the estimateâs standard deviation) of an estimated regression coefficient is increased because of collinearity.
You should consider the options to break up the multicollinearity: collecting additional data, deleting predictors, using different predictors, or an alternative to least square regression.
When those \(R^2\) and VIF values are high for any of the variables in your model, multi- collinearity is probably an issue.
When VIF is high there is high multicollinearity and instability of the b and beta coefficients. It is often difficult to sort this out.
%===============================================%
A common rule of thumb is that if the VIF is greater than 5 then multicollinearity is high. Also a VIF level of 10 has been proposed as a cut off value.
The variance inflation factor (VIF) is used to detect whether one predictor has a strong linear association with the remaining predictors (the presence of multicollinearity among the predictors).
VIF measures how much the variance of an estimated regression coefficient increases if your predictors are correlated (multicollinear). VIF = 1 indicates no relation; VIF > 1, otherwise.
A common rule of thumb is that if the VIF is greater than 5 then multicollinearity is high. Also a VIF level of 10 has been proposed as a cut off value.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% % %### Interpreting Variance Inflation Factors} % % * We learned previously that the standard errors, and hence the variances, of the estimated % coefficients are inflated when multicollinearity exists. %
%
% * So, the variance inflation factor for the estimated coefficient b k , denoted \(VIF_k\) , is just the % factor by which the variance is inflated. % * Variance inflation factors k greater than 4 suggest that the multicollinearity should be % investigated. % * Variance inflation factors greater than 10 are taken as an indication that the multicollinearity may be unduly influencing the least squares estimates. %
% %### The Variance Inflation Factor (VIF)} % % In many statistics programs, the results are shown both as an individual R2 value (distinct from the overall R2 of the model) and a Variance Inflation Factor (VIF). When those R2 and VIF values are high for any of the variables in your model, multicollinearity is probably an issue. When VIF is high there is high multicollinearity and instability of the b and beta coefficients. It is often difficult to sort this out.
% %### Multicollinearity} %When choosing a predictor variable you should select one that might be correlated with the criterion variable, but that is not strongly correlated with the other predictor variables. However, correlations amongst the predictor variables are not unusual. The term multicollinearity (or collinearity) is used to describe the situation %when a high correlation is detected between two or more predictor variables. % %Such high correlations cause problems when trying to draw inferences about the relative contribution of each predictor variable to the success of the model. SPSS provides you with a means of checking for this and we describe this below. % %### Variance Inflation Factor (VIF)} % %The variance inflation factor (or âVIFâ) provides us with a measure of how much the variance for a given regression coefficient is increased compared to if all predictors were uncorrelated. To understand what the variance inflation factor is, and what it measures, we need to examine the computation of the standard error of a regression coefficient. %
%The variance inflation factor (or ``VIF“) provides us with a measure of how much the variance for a given regression coefficient is increased compared to if all predictors were uncorrelated. To understand what the variance inflation factor is, and what it measures, we need to examine the computation of the standard error of a regression coefficient.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Tolerance is simply the reciprocal of VIF, and is computed as
\[ Tolerance = \frac{1}{VIF}\]
Whereas large values of VIF were unwanted and undesirable, since tolerance is the reciprocal of VIF, larger than not values of tolerance are indicative of a lesser problem with collinearity. In other words, we want large tolerances.
\end{document}