|
In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non-parametric methods. In particular, least squares estimates for regression models are highly non-robust to outliers. Heteroskedastic errors One instance in which robust estimation should be considered is when there is a strong suspicion of heteroskedasticity. In the homoskedastic model, it is assumed that the variance of the error term is constant for all values of x. Heteroscedasticity allows the variance to be dependent on x, which is more accurate for many real scenarios. For example, the variance of expenditure is often larger for individuals with higher income than for individuals with lower incomes. Software packages usually default to a homoskedastic model, even though such a model may be less accurate than a heteroskedastic model. Presence of outliers Another common use of robust estimation is when the data contain outliers. In the presence of outliers, least squares estimation is inefficient and can be biased. Because the least squares predictions are dragged towards the outliers, and because the variance of the estimates is artificially inflated, the result is that outliers can be masked. (In many situations, including some areas of geostatistics and medical statistics, it is precisely the outliers that are of interest.) Although it is sometimes claimed that least squares (or classical statistical methods in general) are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model. In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate. The reduction of the type I error rate has been labelled as the conservatism of classical methods. Other labels might include inefficiency or inadmissability. History and unpopularity of robust regression Despite their superior performance over least squares estimation in many situations, robust methods for regression are still not widely used. Several reasons may help explain their unpopularity (Hampel et al 1986, 2005). One possible reason is that there are several competing methods and the field got off to many false starts. Also, computation of robust estimates is much more computationally intensive than least squares estimation; in recent years however, this objection has become less relevant as computing power has increased greatly. Another reason may be that some popular statistical software packages failed to implement the methods (Stromberg, 2004). The belief of many statisticians that classical methods are robust may be another reason. Although take-up of robust methods has been slow, modern mainstream statistics text books often include discussion of these methods (for example, the books by Seber and Lee, and by Faraway). Also, modern statistical software packages such as R and S-PLUS include considerable functionality for robust estimation (see, for example, the books by Venables and Ripley, and by Marrona et al). It is possible that these methods will come into wider use in the future. Least squares alternatives The simplest methods of estimating parameters in a regression model that are less sensitive to outliers than the least squares estimates, is to use least absolute deviations. Even then, gross outliers can still have a considerable impact on the model, motivating research into even more robust approaches. In 1973, Huber introduce M-estimation for regression (see robust statistics for a description of M-estimation). The M in M-estimation stands for "maximum likelihood type". The method is robust to outliers in the response variable, but turned out not to be resistant to outliers in the explanatory variables (leverage points). In fact, when there are outliers in the explanatory variables, the method has no advantage over least squares. In the 1980s, several alternatives to M-estimation were proposed as attempts to overcome the lack of resistance. See the book by Rousseeuw and Leroy for a very practical review. Least median of squares and least trimmed squares both appeared to be viable alternatives. However, both of these methods are inefficient, producing parameter estimates with high variability. Another proposed solution was S-estimation. This method finds a line (plane or hyperplane) that minimizes a robust estimate of the scale of the residuals (and it is "scale" from which the method gets the S in its name). This method is highly resistant to leverage points, and is robust to outliers in the response. However, this method was also found to be inefficient. MM-estimation attempts to retain the robustness and resistance of S-estimation, whilst gaining the efficiency of M-estimation. The method proceeds by finding a highly robust and resistant S-estimate that minimizes an M-estimate of the scale of the residuals (the first M in the method's name). The estimated scale is then held constant whilst a close-by M-estimate of the parameters is located (the second M). Parametric alternatives Another approach to robust estimation of regression models is to replace the normal distribution with a heavy-tailed distribution. A t-distribution with between 4 and 6 degrees of freedom has been reported to be a good choice in various practical situations. Bayesian robust regression, being fully parametric, relies heavily on such distributions (see, for example Gelman et al, 2003). Under the assumption of t-distributed residuals, the distribution is a location-scale family. That is, . The degrees of freedom of the t-distribution is sometimes called the kurtosis parameter. An alternative parametric approach is to assume that the residuals follow a mixture of normal distributions; in particular, a contaminated normal distribution in which the majority of observations are from a specified normal distribution, but a small proportion are from a normal distribution with much higher variance. That is, residuals have probability of coming from a normal distribution with variance and probability of coming from a normal distribution with variance for some Typically, . This is sometimes called the -contamination model. Parametric approaches have the advantage that likelihood theory provides an 'off the shelf' approach to inference (although for mixture models such as the -contamination model, the usual regularity conditions might not apply), and it is possible to build simulation models from the fit. However, such parametric models still assume that the underlying model is literally true. As such, they do not account for skewed residual distributions or finite observation precisions. Example: educational expenditure Rousseeuw and Leroy (1987) describe a data set on educational expenditure. The data can be found via the Classic data sets page. The plot shows per capita income versus educational expenditure. The two lines are the least squares (LS) fit and an MM-estimated fit, using bisquare weight functions with 85% efficiency at the normal. The analysis was performed in R. The least squares fit is consistently above the MM-estimate, having been dragged upwards by a small number of outliers. The MM-estimate has effectively ignored these outliers so that the line fits the bulk of the data more closely. What is not clear from the graph is that the LS estimate is highly inefficient, the residual estimate of scale being 18.8 compared to 9.57 for the MM-estimated model. Outlier detection One consequence of the line being dragged towards the outliers and the scale being overestimated by least squares is that the method effectively masks outliers by making them look more ordinary than they are. Plots of the residuals, scaled by the respective LS and MM-estimated scale paramters appear in the plot below. The variable on the horizontal axis, Index, is simply the observation number as it appears in the data set. The horizontal reference lines are at -2 and 2, so that any point beyond these lines can be considered an outlier if the data are assumed normal. Clearly, the LS method makes the outliers look closer to the bulk of the data than the MM method. See Rousseeuw and Leroy (1987) for many such plots. Examination of standard diagnostic measures (including DFBETAS, DFFITS, Cook's distance and hat values) for the LS fit suggests several outliers. In particular, observations 3, 42, 43 and 50 appear to be unusual. Deleting these observations, refitting the model and examining the diagnostic measures again suggests that more outliers have been identified; observations 44 and 49. This phenomenon is known as masking: the outliers in the data inflate the residual variance so that some outliers look ordinary. Refitting with a few outliers removed reduces the residual variance and reveals more outliers. Whist in one or two dimensions outlier detection using classical methods can be performed manually, with large data sets and in high dimensions the problem of masking can make identification of many outliers impossible. Robust methods automatically detect these observations, offering a serious advantage over classical methods when outliers are present. | |||||||
|
| ||||||||
![]() |
|
| |