Seminarium: Miriam Hurtado Bodell, Linköpings universitet

onsdag 19 maj 2021 13.00 – 14.00


From Documents to Data: a Framework for Total Corpus Quality

Tid: 19 maj, 2021, kl. 13-14
Plats: This seminar is given online. E-mail Dan Hedlin if you want to attend.


As digitized large-scale textual corpora and novel methodologies are increasingly becoming available, researchers are rediscovering textual sources’ potential for inquiries into social and cultural phenomena. Yet while textual corpora show great promise to enrich our knowledge of the social, empirical research faces challenges on how to avoid particular “garbage in-garbage out” problems: our scientific inferences are only as good as the quality of our data analyzed. This paper argues that an evaluation of a processed machine-readable corpus with regard to its quality is pivotal for later social science inquiries. The paper proposes a framework of total corpus quality, which identifies three dimensions that impact the potential of using large corpora for research. Our conceptual framework helps to diagnose and understand errors in studies based on large-scale textual analyses.