Bias and Efficiency Loss Due to Categorizing an Explanatory Variable
It is a common situation in biomedical research that one or more variables are known to be associated with the outcome of interest. Researchers often discretize some variables and fit a regression model using these discretized variables. Although convenient for illustration purposes, such an approach can be biased and lead to loss of efficiency. In this article, we consider the situation of a regression model with two explanatory variables under an assumption of multivariate normality. We investigate the effect of dichotomizing or categorizing one variable on the estimate of the coefficient of the other continuous variable and on prediction from the models. Algebraic expressions are presented for the asymptotic bias and variance of the coefficient of the continuous explanatory variable and for the residual sum of squares for prediction. Some numerical examples are presented in which we find that the bias of the coefficient of the continuous explanatory variable is always smaller for the categorized model than that for the dichotomized model. The size of the test of a zero coefficient for the continuous variable only depends on the correlations between the response variable, the discretized variable, and the continuous variable. The size of the test for the categorized model is always smaller than for the dichotomized model, however, both can differ substantially from the nominal level if the correlation between the response and the categorical variable or between the two explanatory variables is high. The (predictive) relative efficiency of models also only depends on correlations amongst the three variables. There is a substantial loss of efficiency due to categorization if the correlation between the categorized and response variable is high. The predictive relative efficiency is always higher for the categorized model. The relative predictive efficiency due to dichotomization depends on the choice of cut points, with the least loss of efficency being achieved at the median.
If you experience problems downloading a file, check if you have the proper application to view it first. In case of further problems read the IDEAS help page. Note that these files are not on the IDEAS site. Please be patient as the files may be large.
As the access to this document is restricted, you may want to look for a different version under "Related research" (further below) or search for a different version of it.
Volume (Year): 83 (2002)
Issue (Month): 1 (October)
|Contact details of provider:|| Web page: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description|
|Order Information:|| Postal: http://www.elsevier.com/wps/find/supportfaq.cws_home/regional|
References listed on IDEAS
Please report citation or reference errors to , or , if you are the registered author of the cited work, log in to your RePEc Author Service profile, click on "citations" and make appropriate adjustments.:
- Lausen, Berthold & Schumacher, Martin, 1996. "Evaluating the effect of optimized cutoff values in the assessment of prognostic factors," Computational Statistics & Data Analysis, Elsevier, vol. 21(3), pages 307-326, March.
When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:83:y:2002:i:1:p:248-263. See general information about how to correct material in RePEc.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: (Dana Niculescu)
If references are entirely missing, you can add them using this form.