# Graduate Data Science Courses

### COMPSCI 501: Formal Language Theory

Introduction to formal language theory. Topics include finite state languages, context-free languages, the relationship between language classes and formal machine models, the Turing Machine model of computation, theories of computability, resource-bounded models, and NP-completeness.

### COMPSCI 520/620: Software Engineering: synthesis and development

Introduces students to the principal activities involved in developing high-quality software systems in a variety of application domains. Topics include: requirements analysis, formal and informal specification methods, process definition, software design, testing, and risk management. The course will pay particular attention to differences in software development approaches in different contexts.

### COMPSCI 521/621: Advanced Software Engineering: analysis and evaluation

Software has become ubiquitous in our society. It controls life-critical applications, such as air traffic control and medical devices, and is of central importance in telecommunication and electronic commerce. In this course, we will examine state-of-the-art practices for software testing and analysis to verify software quality. We will initially look at techniques for testing and analyzing sequential programs, and then examine the complexity that arises from distributed programs. The students will be required to complete regular homework assignments and exams, and carry out a group research project extending techniques described in class and/or applying them to new domains.

### COMPSCI 585: Introduction to Natural Language Processing

Natural Language Processing (NLP) is the engineering art and science of how to teach computers to understand human language. NLP is a type of artificial intelligence technology, and it's now ubiquitous -- NLP lets us talk to our phones, use the web to answer questions, map out discussions in books and social media, and even translate between human languages. Since language is rich, subtle, ambiguous, and very difficult for computers to understand, these systems can sometimes seem like magic -- but these are engineering problems we can tackle with data, math, machine learning, and insights from linguistics. This course will introduce NLP methods and applications including probabilistic language models, machine translation, and parsing algorithms for syntax and the deeper meaning of text. During the course, students will (1) learn and derive mathematical models and algorithms for NLP; (2) become familiar with basic facts about human language that motivate them, and help practitioners know what problems are possible to solve; and (3) complete a series of hands-on projects to implement, experiment with, and improve NLP models, gaining practical skills for natural language systems engineering

### COMPSCI 589: Machine Learning

This course will introduce core machine learning models and algorithms for classification, regression, clustering, and dimensionality reduction. On the theory side, the course will focus on understanding models and the relationships between them. On the applied side, the course will focus on effectively using machine learning methods to solve real-world problems with an emphasis on model selection, regularization, design of experiments, and presentation and interpretation of results. The course will also explore the use of machine learning methods across different computing contexts including desktop, cluster, and cloud computing. The course will include programming assignments, a midterm exam, and a final project. Python is the required programming language for the course.

### COMPSCI 590D: Algorithms for Data Science

Big Data brings us to interesting times and promises to revolutionize our society from business to government, from healthcare to academia. As we walk through this digitized age of exploded data, there is an increasing demand to develop unified toolkits for data processing and analysis. In this course our main goal is to rigorously study the mathematical foundation of big data processing, develop algorithms and learn how to analyze them. Specific Topics to be covered include: 1) Clustering 2) Estimating Statistical Properties of Data 3) Near Neighbor Search 4) Algorithms over Massive Graphs and Social Networks 5) Learning Algorithms 6) Randomized Algorithms. This course counts as a CS Elective toward the CS major. 3 credits.

### COMPSCI 590N: Introduction to Numerical Computing with Python

This course is an introduction to computer programming for numerical computing. The course is based on the computer programming language Python and is suitable for students with no programming or numerical computing background who are interested in taking courses in machine learning, natural language processing, or data science. The course will cover fundamental programming, numerical computing, and numerical linear algebra topics, along with the Python libraries that implement the corresponding data structures and algorithms. The course will include hands-on programming assignments and quizzes. No prior programming experience is required. Familiarity with undergraduate-level probability, statistics and linear algebra is assumed.

### COMPSCI 590R: Applied Information Theory

Information Retrieval (IR) is the theory and practice that underlies technologies such as search engines. It deals with models and methods for representing, indexing, searching, browsing, and summarizing information in response to a person's information need.

### COMPSCI 590S: Systems for Data Science

In this course, students will learn the fundamentals behind large-scale systems in the context of data science. We will cover the issues involved in scaling up (to many processors) and out (to many nodes) parallelism in order to perform fast analyses on large datasets. These include locality and data representation, concurrency, distributed databases and systems, performance analysis and understanding. We will explore the details of existing and emerging data science platforms, including map-reduce and graph analytics systems like Hadoop and Apache Spark.

### COMPSCI 590V: Data Visualization and Exploration

In this course students will learn the fundamental principles of exploring and presenting complex data, both algorithmically and visually. We will cover systems infrastructure for collating large data, basic visualization of summary statistics, algorithms for exploring patterns in the data (such as rule-mining, graph analysis, clustering, topic models and dimensionality reduction), and artistic and cognition aspects of data presentation (including interactive visualization, human perception, decision-making). Domains will include numeric data, relational data, geographic data, graphs and text. Hands-on labs and projects will be performed in Python and D3.

### COMPSCI 611: Advanced Algorithms

Principles underlying the design and analysis of efficient algorithms. Topics to be covered include: divide-and-conquer algorithms, graph algorithms, matroids and greedy algorithms, randomized algorithms, NP-completeness, approximation algorithms, linear programming.

### COMPSCI 645: Database Design and Implementation

This course covers the design and implementation of traditional relational database systems and advanced data management systems. The course will treat fundamental principles of databases: the relational model, conceptual design, query languages, and selected theoretical topics. We also cover core database implementation issues including storage and indexing, query processing and optimization, as well as transaction management, concurrency, and recovery. Additional topics will address the challenges of modern Internet-based data management. These include data mining, provenance, information integration, incomplete and probabilistic databases, and database security.

### COMPSCI 646: Information Retrieval

The course will cover basic and advanced techniques for text-based information systems. Topics covered include retrieval models, indexing and text representation, browsing and query reformulation, data-intensive computing approaches, evaluation, and issues surrounding implementation. The course will include a substantial project such as the implementation of major elements of search engines and applications.

### COMPSCI 677: Distributed and Operating Systems

This course provides an in-depth examination of the principles of distributed systems in general, and distributed operating systems in particular. Covered topics include processes and threads, concurrent programming, distributed interprocess communication, distributed process scheduling, virtualization, distributed file systems, security in distributed systems, distributed middleware and applications such as the web and peer-to-peer systems. Some coverage of operating system principles for multiprocessors will also be included. A brief overview of advanced topics such as cloud computing, green computing, and mobile computing will be provided, time permitting.

### COMPSCI 682: Neural Networks

This course will focus on modern, practical methods for deep learning. The course will begin with a description of simple classifiers such as perceptrons and logistic regression classifiers, and move on to standard neural networks, convolutional neural networks, and some elements of recurrent neural networks, such as long short-term memory networks (LSTMs). The emphasis will be on understanding the basics and on practical application more than on theory. Most applications will be in computer vision, but we will make an effort to cover some natural language processing (NLP) applications as well, contingent upon TA support. The current plan is to use Python and associated packages such as Numpy and TensorFlow. Prerequisites include Linear Algebra, Probability and Statistics, and Multivariate Calculus. Some assignments will be in Python and some in C++. 3 credits.

### COMPSCI 683: Artificial Intelligence

In-depth introduction to Artificial Intelligence focusing on techniques that allow intelligent systems to reason effectively with uncertain information and cope limited computational resources. Topics include: problem-solving using search, heuristic search techniques, constraint satisfaction, local search, abstraction and hierarchical search, resource-bounded search techniques, principles of knowledge representation and reasoning, logical inference, reasoning under uncertainty, belief networks, decision theoretic reasoning, planning under uncertainty using Markov decision processes, multi-agent planning, and computational models of bounded rationality.

### COMPSCI 688: Probabilistic Graphical Models

Probabilistic graphical models are an intuitive visual language for describing the structure of joint probability distributions using graphs. They enable the compact representation and manipulation of exponentially large probability distributions, which allows them to efficiently manage the uncertainty and partial observability that commonly occur in real-world problems. As a result, graphical models have become invaluable tools in a wide range of areas from computer vision and sensor networks to natural language processing and computational biology. The aim of this course is to develop the knowledge and skills necessary to effectively design, implement and apply these models to solve real problems. The course will cover (a) Bayesian and Markov networks and their dynamic and relational extensions; (b) exact and approximate inference methods; (c) estimation of both the parameters and structure of graphical models.

### COMPSCI 689: Machine Learning

Machine learning is the computational study of artificial systems that can adapt to novel situations, discover patterns from data, and improve performance with practice. This course will cover the popular frameworks for learning, including supervised learning, reinforcement learning, and unsupervised learning. The course will provide a state-of-the-art overview of the field, emphasizing the core statistical foundations. Detailed course topics: overview of different learning frameworks such as supervised learning, reinforcement learning, and unsupervised learning; mathematical foundations of statistical estimation; maximum likelihood and maximum a posteriori (MAP) estimation; missing data and expectation maximization (EM); graphical models including mixture models, hidden-Markov models; logistic regression and generalized linear models; maximum entropy and undirected graphical models; nonparametric models including nearest neighbor methods and kernel-based methods; dimensionality reduction methods (PCA and LDA); computational learning theory and VC-dimension; reinforcement learning; state-of-the-art applications including bioinformatics, information retrieval, robotics, sensor networks and vision.

### COMPSCI 690N: Advanced Natural Language Processing

This course covers a broad range of advanced level topics in natural language processing. It is intended for graduate students in computer science who have familiarity with machine learning fundamentals. It may also be appropriate for computationally sophisticated students in linguistics and related areas. Topics include probabilistic models of language, computationally tractable linguistic representations for syntax and semantics, neural network models for language, and selected topics in discourse and text mining. After completing the course, students should be able to read and evaluate current NLP research papers. Coursework includes homework assignments and a final project.

### COMPSCI 690V: Visual Analytics

In this course, students will work on solving complex problems in data science using exploratory data visualization and analysis in combination. Students will learn to deal with the Five V’s: Volume, Variety, Velocity, Veracity, and Variability, that is with large data, complex heterogeneous data, streaming data, uncertainty in data, and variations in data flow, density and complexity. Students will be able to select the appropriate tools and visualizations in support of problem solving in different application areas. The course is a practical continuation of CS590V - Data Visualization and Exploration and focuses on complex problems and applications. It does not require CS590V. The data sets and problems will be selected mainly from the IEEE VAST Challenges, but also from the KDD CUP, Amazon, Netflix, GroupLens, MovieLens, Wiki releases, Biology competitions and others. We will solve crime, cyber security, health, social, communication, marketing and similar large-scale problems. Data sources will be quite broad and include text, social media, audio, image, video, sensor, and communication collections representing very real problems. Hands-on projects will be based on Python or R, and various visualization libraries, both open source and commercial.

### COMPSCI 691DD: Research Methods in Empirical Computer Science

This course introduces graduate students to basic ideas about conducting an ethical personal research program. Students will learn basic methods for activities such as reading technical papers, selecting research topics, devising research questions, planning research, analyzing experimental results, modeling and simulating computational phenomena, and synthesizing broader theories. The course will be structured around three activities: lectures on basic concepts of research strategy and techniques, discussions of technical papers, and preparation and review of written assignments. Significant reading, reviewing, and writing will be required, and students will be expected to participate actively in class discussions.

### STATISTC 501: Methods of Applied Statistics

For graduate and upper-level undergraduate students, with focus on practical aspects of statistical methods. Topics include data description and display, probability, estimation and modeling. Includes data analysis using the R software.

### STATISTC 515: Statistics I

This course is a calculus based introduction to probabilistic concepts and their use in statistical modeling. Coverage includes basic axioms of probability, sample spaces, counting rules, conditional probability, independence, random variables (and various associated discrete and continuous distributions), expectation, variance, covariance and correlation, sampling distributions, distributions of transformed variables, order statistics, the law of large numbers and the central limit theorem.

### STATISTC 516: Statistics II

Overall objective of the course is the development of basic theory and methods for statistical inference. Topics include: Sampling distributions; General techniques for statistical inference (point estimation, confidence intervals, tests of hypotheses); Development of methods for inferences on one or more means (one-sample, two-sample, many samples - one-way analysis of variance), inference on proportions (including contingency tables), simple linear regression and non-parametric methods (time permitting).

### STATISTC 526: Design of Experiments

An applied statistics course on planning, statistical analysis and interpretation of experiments of various types. Coverage includes factorial designs, randomized blocks, incomplete block designs, nested and crossover designs. Computer analysis of data using the statistical package SAS (no prior SAS experience assumed).

### STATISTC 535: Statistical Computing

The course will introduce computing tools needed for statistical analysis including data acquisition from database, data exploration and analysis, numerical analysis and result presentation. Advanced topics include parallel computing, simulation and optimization, and package creation. The class will be taught in a modern statistical computing language.

### STATISTC 597S: Intro to Probability and Math Statistics

This course provides a calculus-based introduction to probability and statistical inference. Topics include the axioms of probability, sample spaces, counting rules, conditional probability, independence, random variables and distributions, expected value, variance, covariance and correlation, the central limit theorem, random samples and sampling distributions, basic concepts of statistical inference (point estimation, confidence intervals and hypothesis testing) and their use in one and two-sample problems.

### STATISTC 605: Probability Theory

The subject matter of probability theory is the mathematical analysis of random events, which are empirical phenomena having some statistical regularity but not deterministic regularity. The theory combines aesthetic beauty, deep results, and the ability to model and to predict the behavior of a wide range of physical systems as well as systems arising in technological applications. In order to properly handle applications involving continuous state spaces, a measure-theoretic treatment of probability is required. The purpose of this course is to present such a treatment, which is based on Kolmogorov’s axiomatic approach. Topics to be covered include the following:

- Random variables, expectation, independence, laws of large numbers, weak convergence, central limit theorems, and large deviations.
- The concepts of conditional probability and conditional expectation.
- Basic properties of certain classes of random processes such as martingales and random walks.

### STATISTC 607: Mathematical Statistics I

This course is the first half of the STAT 607-608 sequence, which together provide the foundational theory of mathematical statistics. STAT 607 emphasizes concepts in probability, while 608 builds on those concepts to build statistical theory. STAT 607 addresses probability theory, including random variables, independence, laws of large numbers, central limit theorem, as well as perhaps briefly touch on statistical models; introduction to point estimation, confidence intervals, and hypothesis testing.

### STATISTC 608: Mathematical Statistics II

This is the second part of a two semester sequence on probability and mathematical statistics. ST607 covered probability, basic statistical modelling, and an introduction to the basic methods of statistical inference with application to mainly one sample problem. In ST608 we pick up some additional probability topics as needed and examine further issues in methods of inference including more on likelihood based methods, optimal methods of inference, more large sample methods, Bayesian inference and decision theoretic approaches. The theory is utilized in addressing problems in nonparametric methods, two and multi-sample problems, and categorial, regression and survival models. As with ST607 this is primarily a theory course emphasizing fundamental concepts and techniques.