Data Science versus Internet Censorship: case studies from ICLab
Information Controls Lab (ICLab; https://iclab.org/) is a project focused on collecting and analyzing reliable information controls data on the Internet at scale. “Information controls” means traffic manipulation within the network, motivated by a desire to control or monitor what people can access, say, and do online. We have primarily focused on overt and covert blocking of access to specific websites, but we have also detected surveillance and malware injection, and we have plans to broaden our monitoring to include popular chat protocols, file sharing, online multiplayer games, etc.
Over the past three years, we have collected roughly 40 terabytes of data from tests conducted at vantage points all over the world; more data rolls in at a rate of roughly a terabyte a month. To analyze this volume of data, we have developed heuristic, supervised, and unsupervised algorithms that minimize human effort, avoid false censorship alerts (it is particularly important not to confuse obscure network errors with censorship) and classify censored websites by their content. I will present case studies of a detection algorithm for DNS-based censorship, a clustering algorithm that distinguishes “block pages” from HTTP error pages, and a text classification algorithm that finds politically sensitive websites hidden within a long list of pornographic websites (reverse-engineered from a “parental controls” device used in Germany). I will also describe some unsolved research questions that we have, that data scientists might find interesting.
Zachary Weinberg is a post-doctoral fellow with ICLab and the Center for Data Science. Before earning his PhD in electrical and computer engineering from Carnegie Mellon University, he spent time both in and out of academia: in reverse chronological order, two years working on Web security for Mozilla, a masters’ degree in cognitive linguistics from UC San Diego, five years as a full-time paid maintainer of GCC, and undergraduate work in chemistry. His current research interests include: expanding our understanding of online information controls beyond the small number of governments that have been studied in detail; machine classification of text in many different languages, including languages for which large corpora are not available; human factors related to computer security (e.g. making it so one does not need either software expertise or trust in a large corporation to be safe online); and computational literacy (programming as a tool for everyone, not just the sort of people who write operating systems).