Seminar: Hannes Waldetoft, Dept. of Statistics, Uppsala University

Seminar

Date: Wednesday 23 October 2024

Time: 13.00 – 14.00

Location: Campus Albano, lecture room 30, house 4, level 2

Transformer assisted survey sampling for efficient finite population statistics in highly imbalanced textual data: public hate crime estimation

Abstract

Estimating population parameters in finite populations of text documents can be challenging in cases where obtaining the labels (true values) for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators. This is done by training a classifier and then using the model predictions as an auxiliary variable in the estimators. The applicability is demonstrated on Swedish hate crime statistics, which are based on Swedish police reports, for which approximately 1.5 million are being filed annually. Estimates of the yearly number of hate crimes and the police's under-reporting are derived using probability proportional to size (pps) sampling, regression estimation, and stratified random sampling. We conclude that if labeled training data is available, the proposed method can provide efficient estimates with reduced time spent on manual annotation.