University of Massachusetts Amherst

Search Google Appliance


Certificate in Statistical and Computational Data Science

There are three pillars to Data Science: statistical skills, computer science and domain expertise. This certificate is offered jointly through the Statistics and Computer Science departments. The program blends topics in statistical methods, statistical computing, machine learning and algorithm development to train students to become effective data scientists. The program blends topics in statistical methods, statistical computing, machine learning and algorithm development to train students to become effective data scientists for any domain. Additional skills that students will develop include the ability to work with large databases, and to manage and evaluate data sets and create meaningful output that can be used in effective decision making. More information hereQuestions about this program should be directed to



The Certificate is a total of 15 credits and can be completed in one year. It consists of at least two computer science courses and two statistics courses.


Required Course

COMPSCI 589: Machine Learning

This course will introduce core machine learning models and algorithms for classification, regression, clustering, and dimensionality reduction. On the theory side, the course will focus on understanding models and the relationships between them. On the applied side, the course will focus on effectively using machine learning methods to solve real-world problems with an emphasis on model selection, regularization, design of experiments, and presentation and interpretation of results. The course will also explore the use of machine learning methods across different computing contexts including desktop, cluster, and cloud computing. The course will include programming assignments, a midterm exam, and a final project. Python is the required programming language for the course.

One or two of the following:

COMPSCI 585: Introduction to Natural Language Processing

Natural Language Processing (NLP) is the engineering art and science of how to teach computers to understand human language.  NLP is a type of artificial intelligence technology, and it's now ubiquitous -- NLP lets us talk to our phones, use the web to answer questions, map out discussions in books and social media, and even translate between human languages.  Since language is rich, subtle, ambiguous, and very difficult for computers to understand, these systems can sometimes seem like magic -- but these are engineering problems we can tackle with data, math, machine learning, and insights from linguistics.  This course will introduce NLP methods and applications including probabilistic language models, machine translation, and parsing algorithms for syntax and the deeper meaning of text.  During the course, students will (1) learn and derive mathematical models and algorithms for NLP; (2) become familiar with basic facts about human language that motivate them, and help practitioners know what problems are possible to solve; and (3) complete a series of hands-on projects to implement, experiment with, and improve NLP models, gaining practical skills for natural language systems engineering

COMPSCI 590D: Algorithms for Data Science

Big Data brings us to interesting times and promises to revolutionize our society from business to government, from healthcare to academia. As we walk through this digitized age of exploded data, there is an increasing demand to develop unified toolkits for data processing and analysis. In this course our main goal is to rigorously study the mathematical foundation of big data processing, develop algorithms and learn how to analyze them. Specific Topics to be covered include: 1) Clustering 2) Estimating Statistical Properties of Data 3) Near Neighbor Search 4) Algorithms over Massive Graphs and Social Networks 5) Learning Algorithms 6) Randomized Algorithms. This course counts as a CS Elective toward the CS major. 3 credits.

COMPSCI 590S: Systems for Data Science

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems like Hadoop and Apache Spark.

COMPSCI 590V: Data Visualization and Exploration

In this course students will learn the fundamental principles of exploring and presenting complex data, both algorithmically and visually.  We will cover systems infrastructure for collating large data, basic visualization of summary statistics, algorithms for exploring patterns in the data (such as rule-mining, graph analysis, clustering, topic models and dimensionality reduction), and artistic and cognition aspects of data presentation (including interactive visualization, human perception, decision-making).  Domains will include numeric data, relational data, geographic data, graphs and text.  Hands-on labs and projects will be performed in Python and D3.

Two or three of the following:

STAT 597A: Computational Statistics

This course provides an introduction to statistical computing using SAS and R. The primary objective is to teach the student useful programming skills for addressing a variety of problems in statistics and probability, including carrying out Monte-Carlo simulations.

STAT 597S: Intro to Probability and Math Statistics

This course provides a calculus-based introduction to probability and statistical inference. Topics include the axioms of probability, sample spaces, counting rules, conditional probability, independence, random variables and distributions, expected value, variance, covariance and correlation, the central limit theorem, random samples and sampling distributions, basic concepts of statistical inference (point estimation, confidence intervals and hypothesis testing) and their use in one and two-sample problems.

STAT 607: Mathematical Statistics I

This course is the first half of the STAT 607-608 sequence, which together provide the foundational theory of mathematical statistics. STAT 607 emphasizes concepts in probability, while 608 builds on those concepts to build statistical theory. STAT 607 addresses probability theory, including random variables, independence, laws of large numbers, central limit theorem, as well as perhaps briefly touch on statistical models; introduction to point estimation, confidence intervals, and hypothesis testing.

STAT 608: Mathematical Statistics II

This is the second part of a two semester sequence on probability and mathematical statistics. ST607 covered probability, basic statistical modelling, and an introduction to the basic methods of statistical inference with application to mainly one sample problem. In ST608 we pick up some additional probability topics as needed and examine further issues in methods of inference including more on likelihood based methods, optimal methods of inference, more large sample methods, Bayesian inference and decision theoretic approaches. The theory is utilized in addressing problems in nonparametric methods, two and multi-sample problems, and categorial, regression and survival models. As with ST607 this is primarily a theory course emphasizing fundamental concepts and techniques.

STAT 697R: Regression

Regression is the most widely used statistical technique. In addition to learning about regression methods this course will also reinforce basic statistical concepts and expose students (for many for the first time) to "statistical thinking" in a broader context. The primary focus of the course is on the understanding and presentation of regression models and associated methods, data analysis, interpretation of results, statistical computation and model building. Topics covered include simple and multiple linear regression; correlation; the use of dummy variables; residuals and diagnostics; model building/variable selection, regression models and methods in matrix form; an introduction to weighted least squares, regression with correlated errors and nonlinear including binary) regression.

STAT 705: Linear Models

Coverage includes i) a brief review of important definitions and results from linear and matrix algebra and then what is assumed to be some new topics (idempotency, generalized inverses, etc.) in linear algebra; ii) Random vectors, multivariate distribution, the multivariate normal, linear and quadratic forms including an introduction to non-central t, chi-square and F distributions; iii) development of basic theory for inferences (estimation, confidence intervals, hypothesis testing, power) for the general linear model with "application" to both full rank regression and correlation models as well as some treatment of less than full rank models arising in the analysis of variance (one and some two-factor models). The emphasis with applications is on understanding and using the models and on some computational aspects, including understanding the documentation and methods used in some of the computing packages.