Separating the Wheat from the Chaff: A Topic and Keyword-Based Procedure for Identifying Research-Relevant Text

Science Direct
2021

Social scientists are using computational tools to expand their content research beyond what is humanly readable. This often requires filtering corpora for complex research concepts. The commonly used off-the-shelf filtering techniques are untested at this task. Dictionaries may not recognize language outside of investigators’ expectations and thresholding on topic proportions from topic models may fail to identify brief references to concepts. We develop a typology of texts as they relate to a research concept and use this to structure a filtering procedure. We compare our procedure's performance with dictionary-only and topic-proportion-only approaches on two corpora—government speeches and academic articles—and two research concepts—housing crisis and inequality. Our procedure outperforms overall and on each type of relevant text in the typology. An open-source software package is available for implementing the procedure. This provides researchers with a more structured and tested approach for filtering text data. Additionally, the types-of-text typology analysis provides a unique examination of what constitutes a filtered dataset, allowing researchers to consider how conclusions may be affected.

Authors

  • Alexandra Schofield
  • David Mimno
  • Rens Wilderom

Publication Type

Journal Name

Poetics: Journal of Empirical Research on Culture, the Media and the Arts