Squeezing the Data? the Effect of Data Handling Practices in Stratification Research

Friday, 20 July 2018: 17:45
Oral Presentation
Florian HERTEL, University Hamburg, Germany
Henning LOHMANN, University of Hamburg, Germany
Before being presented in a research paper or on a conference, data undergo a tedious generative process handling missing information. Secondary quantitative analyses are commonly based on a sub-sample of available observations. It is well-known that missing data points either due to item non-response (INR) or unit non-response (UNR) can bias inadvertently the outcome of empirical inquiry. We study the extent of such error induced for different outcomes with regard to a focal variable in stratification research: Social origins.

In case of longitudinal household panel data, the problem of missing data becomes even more complex. Longitudinal household data offers more information to address the problems accompanying INR and UNR because earlier data points can be used to extrapolate missing items and other household members’ data can serve as proxy in case of UNR. This advantage, however, could easily become a pitfall if assumptions about the underlying process that generated the missing data are not only wrong but also bias estimators. The opposite strategy of simply ignoring partial observations (i.e. list-wise deletion) might also bias results by curtailing the representativity of the results.

Based on the large body of literature on imputation techniques, we study the effect of various strategies of handling missing information in panel data. We compare results of stratification analyses using social origin as a predictor variable across several specifications obtained by applying the “persistence” approach (i.e. carrying forward or backward older information), the chained-regression imputation approach based on the same time point and, additionally, on prior information, using proxy information from other household members, and employing retrospective versus prospective information. Results are compared to those obtained by restricting the analysis sample to observed values. As litmus test for the effect of data handling practices, we employ three different applications from educational, social mobility and labor market research.