The University of Massachusetts Amherst
University of Massachusetts Amherst

Search Google Appliance


Arthur Spirling - "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It"

CSSI Seminar
Friday, March 31, 2017 - 3:00pm
Computer Science Building, Room 150/151


The UMass Computational Social Science Institute invites you to our weekly seminar, co-sponsored with the Center for Data Science. Please note the non-standard time.

Arthur Spirling
Associate Professor of Politics and Data Science, New York University

Title:  Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

Abstract: Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts. We make easy-to-use software available for this purpose.

Bio: Arthur Spirling is Associate Professor of Politics and Data Science. He received a bachelor's and master's degree from the London School of Economics, and a master's degree and PhD from the University of Rochester. Spirling's research centers on quantitative methods for analyzing political behavior, and he is particularly interested in institutional development and the use of text-as-data. His work on these subjects has appeared in outlets such as the American Political Science Review, the American Journal of Political Science and the Journal of the American Statistical Association. He has guest edited an edition of Legislative Studies Quarterly devoted to 'British Political Development', an area in which he continues to be active. Before coming to NYU, Spirling was an Assistant Professor and then the John L Loeb Associate Professor of the Social Sciences at Harvard University. There he received university-wide awards for graduate student mentoring and undergraduate teaching. He also directed the IQSS Program on Text Research. At NYU he coordinates the university-wide 'Text-as-Data' seminar speaker series.

Refreshments at 2:45 p.m.