University of Massachusetts Amherst

Search Google Appliance


Texts Come from People - How Demographic Factors Influence NLP Models

April 26

Computer Science Building, Room 151
Abstract: The way we express ourselves is heavily influenced by our demographic background. I.e., we don't expect teenagers to talk the same way as retirees. Natural Language Processing (NLP) models, however, are based on a small demographic sample and approach all language as uniform. As a result, NLP models perform worse on language from demographic groups that differ from the training data, i.e., they encode a demographic bias. This bias harms performance and can disadvantage entire user groups.

Sociolinguistics has long investigated the interplay of demographic factors and language use, and it seems likely that the same factors are also present in the data we use to train NLP systems.

In this talk, I will show how we can combine statistical NLP methods and sociolinguistic theories. I present ongoing research into large-scale statistical analysis of demographic language variation to detect factors that influence the performance (and fairness) of NLP systems, and how we can incorporate demographic information into statistical models to address both problems.

The results of this research benefit practitioner in both NLP and sociolinguistics, as well as society, by creating fairer NLP applications. It furthermore has the potential to improve language-based commercial applications such as machine translation, educational tools, or personal assistants by making them more attuned to individual language differences.

Bio: Dirk Hovy is currently a postdoc at the University of Copenhagen. His research interests include the interaction of extra-linguistic factors, language use, and statistical models. He received his PhD in Computer Science from the University of Southern California, where he worked on unsupervised relation extraction. Dirk also holds an MA in sociolinguistics from the University of Marburg, Germany, where he worked on language variation. Dirk has authored multiple papers on a variety of NLP topics, including semantic and morphological analysis (supersenses, named entities, and POS tagging), annotation, NLP for social media, domain adaptation, and demographic factors. He also published several tutorials on programming topics.

Dirk recently shared best paper awards at EACL 2014 and *SEM 2014 for the work with his colleagues in Copenhagen. Outside of research, Dirk enjoys cooking, tango, and leather-crafting, as well as picking up heavy things and putting them back down. You can find an updated biography and more at

A reception will be held at 12:40 in the atrium, outside the presentation room.