For any given datasets there are usually a range of possible regression models. Model selection is dependent on included variables and transformations. This article presents criteria useful in selecting the 'best' model and hopes to provide a systematical approach.
Usual assumptions of normality and independence are not always guaranteed. The methods described checks for validity of these models assumptions prior to drawing insights from model. Suppose we have linear regression in a general form.
$$ { Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k} +\epsilon $$and we make n independent observations, y1,y2,y3 … yn, on Y. We can write as yi observations as,
$$ { y_i = \beta_0 + \beta_1 {x_i}_1 + \beta_2 {x_i}_2 + ... + \beta_k {x_i}_k} +\epsilon_i $$where xij is the set of the jith independent variable for th ith observation, i = 1,2,.., n. Defined as matrix, with x0 =1.
$$ \left[\begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix} \right] = \left[\begin{matrix} 1 & {x_1}_1 & ... & {x_1}_k \\ 1 & {x_2}_1 & ... & {x_2}_k \\ \vdots & \vdots & \ddots & \vdots \\ 1 & {x_n}_1 & ... & {x_n}_k \end{matrix} \right] \left[\begin{matrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_k \end{matrix} \right] + \left[\begin{matrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{matrix} \right] $$Given these definitions we will proceed to model selection, before solving matrix with least-square estimation in Part 2.
Model Selection and Checking
To determine the ‘best’ regression equation from a multiple regression model that involves k regressors, X1,X2,X3 … Xk, there exist contradictory criteria.
- We should include as many regressors as possible to be useful.
- We should exclude as many regressors to save cost in collecting data.
SELECTING CRITERIA
Selecting the best regression model (essential regressors) is reasonable compromise between the above two extremes. Here are five popular criteria for assessing the best model.
- Coefficients of multiple determination (R^2). – Maximise
$$ { R^2 = \frac{ \hat Y' \hat Y - n \bar Y^2 } { Y'Y - n \bar Y^2 } = \frac{ SSRegression}{SSTotal} } $$
$$ { = 1 - \frac{ (Y - \hat Y)' ( Y- \hat Y)} { Y'Y - n \bar Y^2 } = 1 - \frac{ SSError}{SSTotal} } $$
The proportion of Y (about Y hat) explained by the regressors in the model is defined by coefficients of multiple determination, R^2.
If the SSRegression becomes larger or the SSError decreases the R^2 increase.
By including additional independent variables (regressors) increases R^2, even if coefficient maybe not too different from zero. So maximizing the number of variables included in model may also maximize R^2. We should therefore be careful in selecting model based on larger R^2.
* To safeguard we can penalize inclusion of new variables unless they make significant contribution, adjusted coefficient of multiple determination. n -1 is total degrees of freedom, n-p is residual degrees of freedom for model ith.
- Mean square residual. – Minimise
$$ { MSE_i = 1 - \frac{SSE_i} { n-p}} $$
where (n-p) is the residual for Model. The most appropriate model is the smallest MSE value or the lowest unexplained variation per degree of freedom. The equivalence of MSE to R^2 shown as,
$$ { \bar R^2_i = 1 - \frac{SSError_i/(n-p)} { SSTotal/(n-1) } = 1 - \frac{n-1} { n-p } (1- R^2_i)} $$
- Cm statistic. – Minimise $$ { C_m = \frac{SSE_i} {s^2} - (n-2m) } $$
- PRESS statistic – Minimise
$$ { PRESS_i = \sum_{j=1}^n (y_j+ \hat y_j )^2 = \sum_{j=0}^n ( \frac{e_j} {1- {h_j}_j} )^2 } $$
Prediction Error of Sum Squares (PRESS), future responses (predicted) are used to estimate the mean response.
Steps
– Set regressors in given model and compute the prediction of each variable.
– Omit observation j, from the data and refit model to this reduce dataset.
– Predict missing observation with refitted model.
– Repeat steps above fore each variable of model.
- AIC. – Minimise
$$ { AIC = -2 log-likelihood + 2p } $$
Akaike’s Information Criterion. Understands that fitting more and more covariates can only reduce the residual sum of squares and Mean Square Error thus increasing R^2. AIC uses a penalty term to discourage the introduction of unnecessary covariates.
Similar To MSE, Cm tries to minimise the unexplained variance, though the square residuals and number of regressors *m and s^2 the mean square residual. Cm is closely related to R. The proportion of variation explaied by the regression
The following are standard procedures in choosing the ‘best’ model.
- All possible regressions. The process of fitting all the equation using the combination of regressors and then selecting the ‘best’ model based on the previous criteria approach can be very time consuming and requires large computing power.
- Backward elimination. Starting with all regressors and then choosing a good model by eliminating regressor variables with no or small effect on response. Using the previous statistics criteria, however partial F-statistic or AIC is recommended.
- Forward selection. Starting with only the intercept term fitted in the model. Like the backward elimination use of F-Statistic or AIC as the regressors are added to check contribution. The regressor variable with the highest simple correlation and F-value that is greater than Falpha enters the model first.
- Stepwise regression. Similar to forward selection, the procedure to starts with just intercept fitted. However, at each step all regressors entered into the model are reassessed by using a backward elimination approach.
Moving on from model selection, we will investigate model adequacy with residual analysis, diagnostic checks and Multi-collinearity
Model Adequacy
In determining a linear model the usual assumption in the specification are not always guaranteed. For least squares the estimation of parameters as well as prediction depend on the validity of theses assumption. Therefore, it is critical to check if any assumptions that are violated and apply the necessary measures if needed.
The checks on these adequacy of models assumptions are based on the analysis of these residuals. When the model is correctly fitted the residuals should confirm the assumptions and not show contradictions.
- Residual Analysis \[ \text{$e_j =y_j - \hat{j} $, where $j = 1,2, ...,n$} \]
- Diagnostic Checks
- Multicollinearity
Measuring the extent of departure from th regression assumptions, we will use it to detect any violation in assumption. A measure of variability is
$$ { \frac {\sum_{j=1}^n (e_j - \hat e )^2} {n - p} = \frac {\sum_{j=1}^n e_j^2} {n - p} = \frac {SS3} {n - p} = MSE = s^2, } $$unbiased least-square estimate of the error variance,
$$ Var[e = \sigma^2]$$working with scaled residuals, may convey more information than ordinary, shown as.
$$ { d_j =\frac {e_j} {\sqrt {MSE}} = \frac {e_j} {s} }$$for j = 1,2,..,n.
In addition to the residual analysis to verify the model assumptions, here are some further graphical and numerical diagnostic methods are useful in the understanding of the sample data.
$$ {C_l = \frac {q^2_l } {p} \left ( \frac {{h_l}_l} {1-{h_l}_l} \right ) } $$Cook's distance, is based on the change in estimated regression coefficients due to the exclusion of an influential observation from analysis.
$$ { DEFITS =\frac {\hat y_l - { \hat y_-l}_l} {\hat {\sigma_(}_l \sqrt { {h_l}_l}} }$$DEFITS, is the difference in the fitted value standardised
Aside from error distributions and residuals, there are other types of problems associated with fitting a regression model. Such as Multicollinearity.
Multicollinearity occurs when there is a linear or near-linear relationship between two or more regressors, identification of Multicollinearity methods can be described with the following methods.
Correlation matrix provide sample correlation coefficients of the regressors, these coefficients show the strength of the linear relationship between the pairs of regressors. A general adopted rule is that if a correlation coefficients is greater than 0.8 in absolute value then there is a strong linear association which is indicative of presence of multicollinearity. Therefore, one may simply check the of-diagonal elements of the correlation matrix to detect multicollinearity.
The characteristic roots of (X'X) matrix can be used to measure the extent of multicollinearity in a given data set. The presence of one or more small characteristic roots indicates one or more linear dependencies among the regressors. The condition number of (X'X), defined as:
$$ {\Psi = \frac{\lambda^*} {\lambda_* } }$$where, lambda 1 and lambda 2 are the largest and smallest characteristic roots of (X'X). Usually, if characteristic roots < 100, the problem of multicollinearity is not serious and if 100 <= characteristic roots < 1000, the problem is moderate to strong. When characteristic roots > 1000 then there is an acute problem of multicollinearity.
Possible solutions could be to include additional data or respecication of model. Collection of more data, with appropriate sampling design for use in analysis so that the effect of multicollinearity is removed. Respecication, for avoiding redundant variables reduces the chance of collinear effect and the affects of multicollinearity. Identifying if regressors are nearly linearly independent is possible with some functions as follow,
$$ { X^*_1 =X_1, X_2, X_3 \text{ or } X_2 = \frac{X_1 + X_2} { X_3 } } $$Conclusion
In this post we have shown, selection a 'best' model is not easy given nature is complicated. Showing that the regressors that determine a model are potentially many and how assumptions with models are often unrealistic. Verification of regression are critical and methods for selecting regressors for a 'best' model. Sound inference on model is made when model is carefully selected and its assumptions are duly met.