The University of Massachusetts Amherst
University of Massachusetts Amherst

Search Google Appliance


Masters Concentration in Data Science

The Computer Science Masters with a Concentration in Data Science was created to help meet the need for expanded and enhanced training in the area of data science. It requires coursework in Theory for Data Science, Systems for Data Science, Data Analysis and Statistics.


Aerial photo of computer science buildingThe Masters Concentration in Data Science teaches you to develop and apply methods to collect, curate, and analyze large-scale data, and to make discoveries and decisions using those analyses.


Requirements and Admissions


Who should apply?

Students require a bachelor’s degree and a solid undergraduate background in computer science.




The Masters Degree is a total of 30 credits and is usually completed in two years.  Four Data Science core courses (12 credits) including one each from the areas of Theory for Data Science, Systems for Data Science, and Data Analysis, and one additional core course from any area. Two courses (6 credits) taken from among a set of courses designated as satisfying the Data Science Elective requirement. One course (3 credits) taken from among a set of courses satisfying the Data Science Probability and Statistics requirement.  



Useful Links

The full-time graduate program admission deadlines are:

  • October 1 for Spring enrollment (Master's Program only)
  • December 15 for Fall enrollment

Courses offered Spring 2020

COMPSCI 501: Formal Language Theory

Introduction to formal language theory. Topics include finite state languages, context-free languages, the relationship between language classes and formal machine models, the Turing Machine model of computation, theories of computability, resource-bounded models, and NP-completeness.

COMPSCI 514: Algorithms for Data Science

With the advent of social networks, ubiquitous sensors, and large-scale computational science, data scientists must deal with data that is massive in size, arrives at blinding speeds, and often must be processed within interactive or quasi-interactive time frames. This course studies the mathematical foundations of big data processing, developing algorithms and learning how to analyze them. We explore methods for sampling, sketching, and distributed processing of large scale databases, graphs, and data streams for purposes of scalable statistical description, querying, pattern mining, and learning. Was COMPSCI 590D. Undergraduate Prerequisites: COMPSCI 240 and COMPSCI 311. 3 credits

COMPSCI 520/620: Software Engineering: synthesis and development

Introduces students to the principal activities involved in developing high-quality software systems in a variety of application domains. Topics include: requirements analysis, formal and informal specification methods, process definition, software design, testing, and risk management. The course will pay particular attention to differences in software development approaches in different contexts.

COMPSCI 589: Machine Learning

This course will introduce core machine learning models and algorithms for classification, regression, clustering, and dimensionality reduction. On the theory side, the course will focus on understanding models and the relationships between them. On the applied side, the course will focus on effectively using machine learning methods to solve real-world problems with an emphasis on model selection, regularization, design of experiments, and presentation and interpretation of results. The course will also explore the use of machine learning methods across different computing contexts including desktop, cluster, and cloud computing. The course will include programming assignments, a midterm exam, and a final project. Python is the required programming language for the course.

COMPSCI 590M: Introduction to Simulation

How can we use computers to design systems and, more generally, make decisions, in the face of complexity and uncertainty? Simulation techniques apply the power of the computer to study complex stochastic systems when analytical or numerical techniques do not suffice. It is the most frequently used methodology for the design and evaluation of computer, telecommunication, manufacturing, healthcare, financial, and transportation systems, to name just a few application areas. Simulation is an interdisciplinary subject, incorporating ideas and techniques from computer science, probability, statistics, optimization, and number theory. Simulation models, which embody deep domain expertise, can effectively complement machine-learning approaches. This course will provide the student with a hands-on introduction into this fascinating and useful subject.

COMPSCI 590V: Data Visualization and Exploration

In this course students will learn the fundamental principles of exploring and presenting complex data, both algorithmically and visually.  We will cover systems infrastructure for collating large data, basic visualization of summary statistics, algorithms for exploring patterns in the data (such as rule-mining, graph analysis, clustering, topic models and dimensionality reduction), and artistic and cognition aspects of data presentation (including interactive visualization, human perception, decision-making).  Domains will include numeric data, relational data, geographic data, graphs and text.  Hands-on labs and projects will be performed in Python and D3.

COMPSCI 611: Advanced Algorithms

Principles underlying the design and analysis of efficient algorithms. Topics to be covered include: divide-and-conquer algorithms, graph algorithms, matroids and greedy algorithms, randomized algorithms, NP-completeness, approximation algorithms, linear programming.

COMPSCI 645: Database Design and Implementation

This course covers the design and implementation of traditional relational database systems and advanced data management systems. The course will treat fundamental principles of databases: the relational model, conceptual design, query languages, and selected theoretical topics. We also cover core database implementation issues including storage and indexing, query processing and optimization, as well as transaction management, concurrency, and recovery. Additional topics will address the challenges of modern Internet-based data management. These include data mining, provenance, information integration, incomplete and probabilistic databases, and database security.

COMPSCI 677: Distributed and Operating Systems

This course provides an in-depth examination of the principles of distributed systems in general, and distributed operating systems in particular. Covered topics include processes and threads, concurrent programming, distributed interprocess communication, distributed process scheduling, virtualization, distributed file systems, security in distributed systems, distributed middleware and applications such as the web and peer-to-peer systems. Some coverage of operating system principles for multiprocessors will also be included. A brief overview of advanced topics such as cloud computing, green computing, and mobile computing will be provided, time permitting.


COMPSCI 683: Artificial Intelligence

In-depth introduction to Artificial Intelligence focusing on techniques that allow intelligent systems to reason effectively with uncertain information and cope limited computational resources. Topics include: problem-solving using search, heuristic search techniques, constraint satisfaction, local search, abstraction and hierarchical search, resource-bounded search techniques, principles of knowledge representation and reasoning, logical inference, reasoning under uncertainty, belief networks, decision theoretic reasoning, planning under uncertainty using Markov decision processes, multi-agent planning, and computational models of bounded rationality.

COMPSCI 685 (previously 690N): Advanced Natural Language Processing

This course covers a broad range of advanced level topics in natural language processing. It is intended for graduate students in computer science who have familiarity with machine learning fundamentals. It may also be appropriate for computationally sophisticated students in linguistics and related areas. Topics include probabilistic models of language, computationally tractable linguistic representations for syntax and semantics, neural network models for language, and selected topics in discourse and text mining. After completing the course, students should be able to read and evaluate current NLP research papers. Coursework includes homework assignments and a final project.

COMPSCI 688: Probabilistic Graphical Models

Probabilistic graphical models are an intuitive visual language for describing the structure of joint probability distributions using graphs. They enable the compact representation and manipulation of exponentially large probability distributions, which allows them to efficiently manage the uncertainty and partial observability that commonly occur in real-world problems. As a result, graphical models have become invaluable tools in a wide range of areas from computer vision and sensor networks to natural language processing and computational biology. The aim of this course is to develop the knowledge and skills necessary to effectively design, implement and apply these models to solve real problems. The course will cover (a) Bayesian and Markov networks and their dynamic and relational extensions; (b) exact and approximate inference methods; (c) estimation of both the parameters and structure of graphical models.

COMPSCI 690OP: Optimization in Computer Science

Much recent work in computer science in a variety of areas, from game theory to machine learning and sensor networks, exploits sophisticated methods of optimization. This course is intended to give students an in-depth background in both the foundations as well as some recent trends in the theory and practice of optimization for computer science. There is currently no course in the department that covers these topics, and yet it is critical to a large number of research projects done within the department.

COMPSCI 690V: Visual Analytics

In this course, students will work on solving complex problems in data science using exploratory data visualization and analysis in combination. Students will learn to deal with the Five V’s: Volume, Variety, Velocity, Veracity, and Variability, that is with large data, complex heterogeneous data, streaming data, uncertainty in data, and variations in data flow, density and complexity. Students will be able to select the appropriate tools and visualizations in support of problem solving in different application areas. The course is a practical continuation of CS590V - Data Visualization and Exploration and focuses on complex problems and applications. It does not require CS590V. The data sets and problems will be selected mainly from the IEEE VAST Challenges, but also from the KDD CUP, Amazon, Netflix, GroupLens, MovieLens, Wiki releases, Biology competitions and others. We will solve crime, cyber security, health, social, communication, marketing and similar large-scale problems. Data sources will be quite broad and include text, social media, audio, image, video, sensor, and communication collections representing very real problems. Hands-on projects will be based on Python or R, and various visualization libraries, both open source and commercial.