Separating the Wheat from the Chaff: A Topic and Keyword-Based Procedure for Identifying Research-Relevant Text

Science Direct

2021

poetics.jpg

Social scientists are using computational tools to expand their content research beyond what is humanly readable. This often requires filtering corpora for complex research concepts. The commonly used off-the-shelf filtering techniques are untested at this task. Dictionaries may not recognize language outside of investigators’ expectations and thresholding on topic proportions from topic models may fail to identify brief references to concepts. We develop a typology of texts as they relate to a research concept and use this to structure a filtering procedure. We compare our procedure's performance with dictionary-only and topic-proportion-only approaches on two corpora—government speeches and academic articles—and two research concepts—housing crisis and inequality. Our procedure outperforms overall and on each type of relevant text in the typology. An open-source software package is available for implementing the procedure. This provides researchers with a more structured and tested approach for filtering text data. Additionally, the types-of-text typology analysis provides a unique examination of what constitutes a filtered dataset, allowing researchers to consider how conclusions may be affected.

Authors

Alexandra Schofield
Alicia Eads
David Mimno
Rens Wilderom

Publication Type

Article

Journal Name

Poetics: Journal of Empirical Research on Culture, the Media and the Arts

Universal Navigation

Universal Navigation2

Main menu

Separating the Wheat from the Chaff: A Topic and Keyword-Based Procedure for Identifying Research-Relevant Text

Article

poetics.jpg

Authors

Publication Type

Journal Name

Footer Main-Menu

Footer Secondary Menu

Contact Us

Footer Accessibility Menu

Universal Navigation

Universal Navigation2

Main menu

Search form

Separating the Wheat from the Chaff: A Topic and Keyword-Based Procedure for Identifying Research-Relevant Text

Article

poetics.jpg

Authors

Publication Type

Journal Name