Abstract: The first decade of data science was about developing new technology for extracting signal from large, noisy, and heterogeneous datasets. The next decade will be about epistemic issues: How do we know if the answers we give are actually useful? For example, we know our training data is biased; how do we avoid propagating discrimination when we use this data? How do we ensure the privacy of individuals as we mosaic different datasets together? As data science techniques and technologies become increasingly democratized, how do we avoid junk science --- spurious, non-reproducible findings? How do we curate and expose existing data to make them "safe" for useful science?
In this talk, I'll describe some work we're doing on systems and algorithms to help answer these questions. To combat bias and protect privacy, we generate synthetic datasets that can selectively suppress certain causal relationships while preserving others (potentially in combination with differential privacy techniques). To automate curation, we combine distant supervision and co-learning methods to provide high-quality labels with zero training data, and show that this approach outperforms even state-of-the-art supervised methods. To help automate claim verification, we use a claim checked against one dataset to help disambiguate schema mappings for other datasets.
I'll show how these systems and algorithms are being deployed in various contexts, including the Trusted Data Collaborative, a public-private partnership emphasizing transportation data. I'll also describe how these features might be combined into a new kind of database system that plays a more active role in decision-making.
Bio: Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, curation, analytics, and visualization in the sciences. As Founding Associate Director of the UW eScience Institute, Howe played a leadership role in the Data Science Environment program at UW through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW's Data Science for Social Good Program. With support from the MacArthur Foundation and Microsoft, Howe directs UW's participation in the Cascadia Urban Analytics Cooperative, where he focuses on responsible data science. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he co-authored what have remained the most-cited papers from VLDB 2010 and SIGMOD 2012. He has received two Jim Gray Seed Grant awards from Microsoft Research and two "Best of Conference'' invited papers from VLDB Journal. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.