Abstract: When scaling up and streaming statistical inference algorithms one immediately runs into two issues: merging distributed inference results and coping with unbounded data. In this talk, we discuss how to tackle these problems by employing feature hashing, sketching, and approximate counters to represent the model's sufficient statistics. We provide novel approximation bounds for algorithms that combine count-min sketch with approximate counters and support a merge operation. Finally, we show that combining these approximations yields a distributed inference algorithm that produces the same quality approximation, but with less than a quarter of the memory.
Joint work with Jean-Baptiste Tristan and Michael Wick of Oracle Labs.
Bio: Joseph Tassarotti is a fifth year PhD student in the Computer Science Department at Carnegie Mellon University, advised by Robert Harper. His dissertation work is on the formal verification of concurrent randomized algorithms, and he is interested in the use of these algorithms in scalable machine learning systems. Joseph is the recipient of an NDSEG fellowship, and he earned an A.B. in Computer Science from Harvard College in 2013.