Sankhya: The Indian Journal of Statistics

2003, Volume 65, Pt. 1, 1--22

Two-Dimensional Graphical Representation Of Regression Submodels


Jean-Daniel Rolle, Haute Ecole de Gestion de Fribourg, Switzerland

SUMMARY. In the framework of regression, consider the set of regression submodels (or simply models). By submodel, we mean one or more response variables and  a subset of the potential regressors. Imagine the submodels as points in some space. How can we ``project'' these points onto a (two-dimensional) map so as to visualize and compare them, with intent to isolate a small cluster of submodels having desirable properties? The core of the idea developed here is geometrical. To illustrate this, let us consider the one-response case: each submodel is characterized by two clouds of points: $\cc_X\subset \real^{p-1}$, and $\cc_Z\subset \real^{p}$, $p-1$ being the number of regressors in the submodel. These clouds will provide the two coordinates of the map: $\cc_X$ (the regressor cloud) will yield a ``complexity'' coordinate, and $\cc_Z$ (the all-data cloud) a ``lack of fit'' coordinate. A submodel is said to be too ``complex'' if it has too many regressors and/or if its regressors are handicapped by near-linear dependencies (multicollinearity). The measures used for model complexity and lack of fit, have to be validated and mixed in some way. To this aim, we balance complexity and lack of fit by defining a new criterion having the form $LC_p=\log [\hbox{RSE}] + \theta \log [(p-1)/\lambda_{p-1}]$, where $\theta$ $(0\le\theta\le 1$) is a constant, $\lambda_{p-1}$ is the smallest eigenvalue of the sample correlation matrix of the  $p-1$ regressors, and RSE is the residual squared error, that is, the sum of the squared residuals for a $(p-1)$-regressor submodel.

AMS (1991) subject classification. 62J05.

Key words and phrases. Regression, map of submodels, variable selection, lack of fit, multicollinearity.

Full paper (PDF)