The Potential and Limit of Google Ngram Data and Other Historical Corpora in Sociological Research

Tuesday, 12 July 2016
Location: Hörsaal 26 (Main Building)
Distributed Paper
Jeffrey SWINDLE, University of Michigan, USA
The recent explosion of corpora of millions of digital historical texts such as the Google Ngram database invites many new opportunities for social research. Central to such inquiry are keyword analyses of how often a given term has appeared over time. The creators of many of these databases of historical texts claim that their data allow for quantitative measurement of cultural change. Many scholars, journalists, and others now use these databases to support such arguments. The rub is that these claims almost invariably ignore issues of representation bias in historical texts and suffer from significant measurement error. I outline how these massive, powerful historical corpora can be exploited more accurately and appropriately for sociologically inquiry. In doing so, I summarize research on historical literacy rates, publishing industries, and newspaper reporting practices that inform issues of representation bias, and I show several instances of common measurement errors made by many using historical corpora. I then include two empirical examples of what types of claims these data can support and how researchers can minimize measurement error with these data. The first involves the appearances of various terms that have been used to refer to the “Third World” and the second is a measure of when labels for Americans of African descent have shifted from “Negro” to “Black” to “African American.”