University of Massachusetts Amherst

Search Google Appliance


Industry Mentorship Program

The Industry Mentorship Program is an exclusive benefit of the Industry Affiliates Program. The program matches small teams of data science Master's students with an industry-proposed project. Over the course of an academic semester each team works under the guidance of an industry mentor.


Program Objectives:

  • Small teams of MS-level data science students at UMass Amherst get the opportunity to work on industry-relevant problems, with guidance and mentoring from industry data science professionals.

  • Companies get the opportunity to make cost-effective progress on data science exploratory problems of interest, leveraging the effort of students who are in the midst of data science training. Company professionals “learn alongside” the student teams.

  • As a result of the experience of working with these students you may find candidates for future internships and full-time roles.


How It Works:

  • Industry partners submit a short proposal (one or two paragraphs), for one or more projects suitable for a team of 3-5 Master's students working up to 20% of their time for one semester (14 weeks from January through April).
  • Center for Data Science (CDS) faculty and staff will work with industry partners to collaboratively refine the proposals.
  • CDS faculty and staff organize student teams (drawing from a pool of applicants) for each project, finding a good mix of training and interests.
  • Supporting datasets should be made available by the company for use by students who will be working on University premises with University equipment and resources.
  • Per University policy, any intellectual property created by the student teams during the project will be owned by the University. Software that is created is typically open sourced, and the results published or publicly disclosed without restrictions.
  • Each proposal should identify at least one active data science professional from the company who will be able to meet weekly by video conference with the student team throughout the effort.
  • During the semester, CDS faculty and a team of PhD students will support the efforts of the student teams. ​


Key Dates for the Spring 2019 semester:

  • 11/16/18: Deadline for submitting project descriptions.  Please do not change the projects in substantial ways after this date as students base their decisions on participation on these descriptions.

  • 12/15/18: Students are matched to projects

  • 1/22/19: Spring semester starts

  • Feb: First report due

  • Mar: First project presentation

  • April: Second report

  • May: Final report and poster session


To participate, contact us at


Project Descriptions from Previous Years:


A core problem in artificial intelligence is to capture, in machine-usable form, the collection of information that an ordinary person would have, known as commonsense knowledge. This background knowledge is crucial for solving many difficult, ambiguous natural language problems in co-reference resolution and question answering, as well as the creation of other reasoning machines.  We focus here on learning hypernym ontologies (hierarchical and partially-ordered structures describing “is-a” relations, including (a) lexical entailment, e.g. a poodle is a dog, is a mammal, is an animal, and (b) phrasal concept entailment, e.g. “a night out dancing” is a “good date”).  In this project we will work to gather and integrate large-scale data from both pre-existing sources (such as ConceptNet), and solicit additional input by crowd sourcing. We will implement and experiment with multiple recent and novel methods for learning embeddings representing partial orderings, such as order embeddings (Vendrov et al, ICLR 2015), and probabilistic alternatives capable of representing conditional probabilities, all implemented in TensorFlow.  Student team members should be familiar with TensorFlow and embedding methods; experience in crowd sourcing is also a plus.



In this project, we'll examine the use of machine learning methods to automate one of the first steps in the manuscript processing pipeline. The task involves parsing and segmenting citation strings from the reference section of manuscripts into multiple fine-grained entities including authors, title, venue, address, publisher, editor etc. We'll explore the use of advanced machine learning methods including dilated CNNs and biLSTM-CRFs for this task, and also explore the use of constraints and hierarchical information in the entities to boost the performance of the tagging system. We will use the UMass Citation Extraction data set ( to train the machine learning models.



In this project, we will examine the use of machine learning methods to automate annotating portions of text with a specific UMLS entity. We will provide a set of ~4300 documents whose text (comprising the Title and Abstract for each document) has been manually annotated by experts. As part of this project, we will evaluate 3 machine learning methods: (i) TaggerOne (Leaman et al, Bioinformatics 2016), which is a rich-feature based semi-Markov model with perceptron-based training; (ii) A Bidirectional LSTM neural network, and (iii) a Transformer self-attention module based neural network as used in (Pat Verga et al., AKBC 2017). The goal is to do the segmentation and mention recognition, and the linking to a specific UMLS entity, as a joint task. The main challenge is that UMLS consists of a large number of entities, organized in a heterarchy (i.e. a DAG), and we want to give partial credit for being ‘close’. The code for TaggerOne is publicly available, and the student working on the TaggerOne stream will need to be well versed in Java. The other two streams should use Python and your favorite Neural Network library (TensorFlow, Theano (or Keras), or PyTorch). The code implementing the model for the 3rd stream should be available from Pat Verga at IESL, UMASS. Students should have a good understanding of the problems of Named Entity Recognition and Linking, CRF and Perceptron based models, and Neural Network models for text and NLP.



In this project, we will build classifiers to identify what type of evidence to assign to a Gene Ontology (GO) annotation.  The GO knowledge base provides hundreds-of-thousands of annotations with evidence codes and links to the papers in which the annotation's evidence appears.  Can we learn a classifier that identifies the appropriate evidence code for an annotation, given the features extracted from the annotation, and its paper? Features such as the authors, the venues, the methods employed and text surrounding mentions of concepts in the annotations might provide clues.  More sophisticated features based on citation counts could help further.

Such a classifier could also be applied to GO annotations extracted via automated methods, increasing our trust in them.  Thus, we can eventually make GO more comprehensive with automated methods, while maintaining a reasonable level of accuracy via predicted evidence codes.


  *Gene Ontology annotations

  *PubMed abstracts and PubMed central

Evaluation: typical classifier evaluation metrics: accuracy, F1, confusion, etc.  We will use the GO evidence codes as ground-truth labels so we do not need to label additional data.



Recent work shows word embedding techniques tend to represent polysemous words as linear superpositions of the embeddings for the individual meanings. This project looks to replicate and expand on existing polysemy research (in particular, investigating alternate approaches to identifying of number of root senses per word), then explore the impact meaning grounding can have on sentiment analysis, either as a simple preprocessing step for trained models, as a semisupervised pipeline for provisioning training examples to annotators, as a method for extrapolating to plausible related sentiment phrases, or other approaches the team comes up with.

We recommend TensorFlow and Python or this project, but are happy to defer to the team's preferred toolset.



Social engineering (“phishing”) attacks are a major threat to the security of governments, organizations, and individuals.  Particularly dangerous are attacks launched from within an organization by compromised user accounts. We are interested in exploring unsupervised learning techniques to build representations of the typical behavior of users in an organization which could form the basis for detecting abnormal behavior from a compromised account.  Concretely, students would implement and apply distributed representation (“embedding”) techniques to users in the Enron e-mail corpus ( and would then explore the learned representations to discern what, if any, useful information about user behavior is encoded.  Students will be involved in the full lifecycle of the project: defining the roles of interest, designing a coding scheme for mapping email-senders/recipients to roles, and evaluating different techniques to automatically detect the roles.  Students would be guided by two experienced natural language and machine learning researchers. Collaboration would be through GitHub, video-conferencing, and Slack. 



Recent advances in deep learning techniques for machine reading comprehension and question answering have dramatically improved the performance of question answering technology on open-ended questions. These techniques answer questions by identifying snippets or sections of text from a larger document (or document collection) where the extracted text contains an explicit answer to the question posed.  While building a general purpose question answering model is an excellent long-term goal, there are many real world scenarios in which building a domain-dependent model is sufficient. However, there is often only a limited amount of in-domain data to train from in each new domain. In this scenario, it may be possible to apply transfer learning to adapt either a general purpose model, or a domain-dependent model from a similar domain, to produce an effective model for the new domain.



In this project we will explore the use of transfer learning in the context of a question to answer ranking system for identifying the best answers for technical questions that have been posed to Stack Overflow. For this scenario we will provide a data set of original question and answer pairs collected by Stack Overflow for a variety of technical subject areas. This training data will also include many examples of "duplicate" questions posed to Stack Overflow,  i.e., questions for which an answer in a pre-existing Q&A pair can be used to answer the equivalent "duplicate" question. For transfer learning, we will provide a smaller collection of Q&A pairs in a new domain which can be used for transfer learning to the new domain. The new model will be evaluated by measuring its effectiveness of matching new "duplicate" questions to the pre-existing Q&A pairs in the new domain. The student team will be provided with pre-collected Stack Overflow data, one or more baseline models, and some supplemental scripts to assist in manipulating data, creating models and running experiments.



We propose to extend recent work in building classifiers to identify lab-grown strains of bacteria ( Previously we have used a random forest method to classify a set of Salmonella enterica genome sequences extracted from the NCBI sequence read archive. We identified signatures of laboratory culture that were consistent with genes identified in previous studies of mutational effects of laboratory culture. We would like to extend these experiments - both by looking at other analysis methods in addition to random forests, as well as identifying a more robust test set for evaluation, as the current system was evaluated using leave one out cross-validation.



For this project, we will seek to extract new job skills from millions of resumes and/or job postings.   After extracting and deduping skills from the text, they will augmented with definitions extracted from publicly available resources such as Wikipedia.  Skills will then be clustered, using a variety of clustering methods and cluster evaluation techniques.



Topics: Natural Language Processing, Machine Learning, Deep Learning, Knowledge Bases and Reasoning

Tools: Pytorch, Tensor Flow, Sklearn, NLTK, Lucene/Solr, and (possibly) RDF Stores

To propose a new methodology for answering standardized tests, it is necessary to understand and analyze the state of the art approaches with corresponding baselines [1-3]. This includes both techniques that focus on standardized tests. and those that have solved question-answering in the open domain and can be adapted for this task. The analysis will also involve comprehensive and comparative studies of the advantages and disadvantages of the existing approaches. Particularly: (a) the types of questions from standardized tests that can be answered correctly by each of the approaches, and (b) drawbacks of these approaches for questions that were answered incorrectly. The project entails setting up the infrastructure, implementing or reusing some of the existing approaches (from a subset of [1], [2] and [3] below), running it on the standardized tests dataset, and a comprehensive analysis of the standardized tests dataset and the obtained results.

One methodology for analyzing the questions is to connect them to specific problem solving techniques, e.g.,  By mapping each of the questions to a specific problem solving technique, or learning how to do this automatically, we may be able to improve the state of the art in deep question-answering.  As part of this project you could research how to classify the questions to particular problem solving techniques, after defining an appropriate corpus of problem solving techniques from a literature survey.

Another possibility for a project is to combine candidate answers for multiple choice questions (MCQs) from various techniques -- existing and new -- that all attempt to solve the same question(s). The challenge here would be to collect answers from the various techniques along with disparate measures of confidence/accuracy in those answers, and combine all of these into a single ranking of the multiple choice answers. This kind of portfolio-based approach has been used in a number of complex systems that reason over multiple answer choices, and would be interesting to explore in the MCQ setting for standardized tests.

[1] R3: Reinforced Ranker-Reader for Open-Domain Question Answering

S Wang, M Yu, X Guo, Z Wang, T Klinger, W Zhang… - 2018

[2] Clark, P., Etzioni, O., Khot, T., Sabharwal, A., Tafjord, O., Turney, P. D., & Khashabi, D. (2016, February). Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions. In AAAI (pp. 2580-2586).

[3] Khashabi, D., Khot, T., Sabharwal, A., Clark, P., Etzioni, O., & Roth, D. (2016). Question answering via integer programming over semi-structured knowledge. arXiv preprint arXiv:1604.06076.



Finding small molecules able to bind to a specific protein target is a critical aspect of drug discovery. In this project, using publicly available data on known small molecule-protein bindings from structured sources such as BindingDB and PubChem, we will investigate using recently proposed deep learning representations for chemical structures and protein sequences to make drug-protein binding predictions.

Data sources:

Possible initial models:

  • Duvenaud, David K., et al. "Convolutional networks on graphs for learning molecular fingerprints." Advances in neural information processing systems. 2015.

  • Schwaller, Philippe, et al. "" Found in Translation": Predicting Outcome of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models." arXiv preprint arXiv:1711.04810 (2017).

  • Wang, Shuohang, and Jing Jiang. "Learning natural language inference with LSTM." arXiv preprint arXiv:1512.08849 (2015)

Engineering tools and environment:  PyTorch, TensorFlow