The University of Massachusetts Amherst
University of Massachusetts Amherst

Search Google Appliance


Raw Data Considered Harmful: Systems and Algorithms for Synthetic Training Set Management

Wednesday, February 27, 2019 - 4:00pm
Computer Science Building, Room 150/151

Speaker: Bill Howe
A reception for attendees will be held at 3:30 p.m. in CS 150.


Abstract: Synthetic datasets have long been thought of as second-rate, to be used only when "real" data is unavailable.  But this perspective assumes that raw data is clean, unbiased, and trustworthy, which it rarely is.  Synthetic data has been frequently proposed to protect privacy, but as deployment of automated decision tools continues to accelerate, additional use cases emerge:  Properly generated synthetic datasets derived from minimal adjustments of raw data enable early stage product development and collaboration, afford reproducibility, increase dataset diversity in AI research, focus attention on problems of national priority, and avoid propagating systematic discrimination.  

In this talk, I'll describe systems and algorithms to generate and manage curated synthetic datasets for ML and AI applications in the public sector.  Building on significant prior work in the privacy and fairness literature, our focus is on building tools that avoid making simplifying assumptions about the structure of the data or the expertise of the user, while exploiting the fact that the target application is typically to train a predictive model.  I'll also touch on the governance infrastructure to legally protect the source data and describe some of the applications we are pursuing in housing, education, and mobility.  I'll wrap up with some thoughts about using data perturbation to enforce regulation in an increasingly automated ecosystem.


Bio:  Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, curation, analytics, and visualization in the sciences. As Founding Associate Director of the UW eScience Institute, Howe played a leadership role in the Data Science Environment program at UW through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW's Data Science for Social Good Program. With support from the MacArthur Foundation and Microsoft, Howe directs UW's participation in the Cascadia Urban Analytics Cooperative, where he focuses on responsible data science. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he co-authored what have remained the most-cited papers from VLDB 2010 and SIGMOD 2012. He has received two Jim Gray Seed Grant awards from Microsoft Research and two "Best of Conference'' invited papers from VLDB Journal. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.