Building Evaluation Scales For NLP Using Item Response Theory
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. This talk introduces Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of ability. In this talk, I will give an introduction to Item Response Theory and describe an IRT gold-standard test set for Recognizing Textual Entailment. Our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
John Lalor is a 2nd year Ph.D. student at UMass Amherst, working with Prof. Hong Yu in the Bio-NLP lab. His research interests include NLP and its applications in the medical domain. Prior to UMass, he received an M.S. degree from DePaul University and a B.B.A. from the University of Notre Dame.
Personal website: http://jplalor.github.io