Seminar: Martin Hyllienmark, Department of Statistics, Stockholm University

Seminar

Date: Wednesday 28 February 2024

Time: 13.00 – 14.00

Location: Campus Albano, Lecture room 15, house 2, level 2

Selection bias in big data in official statistics from a practitioner’s point of view

Abstract

Which is better: simple random sampling (SRS) with nonresponse or a large dataset with selection bias?

Selection bias is increasingly problematic in surveys. Inspired by Meng’s (2018) paper Statistical paradises and paradoxes in big data, where he highlights statistical issues that bigness of data sets incur, we simulated sequences of growing populations with two different sampling designs: simple random sampling (SRS) with nonresponse and nonprobability data with selection bias (“big data”). The results showed that nonresponse combined with selection bias led to a constant bias and MSE only if the proportion of respondents was constant as the population size grew. Furthermore, there was a trade-off between bias and coverage probability caused by the amount of data available. By applying weights, it was possible to improve the results. The results in this study suggest that the data sources of choice may depend on the application of the statistics. Tools for comparing different data sources were investigated and discussed from a practical point of view.