The Industry Mentorship Program is an exclusive benefit of the Industry Affiliates Program. The program matches small teams of data science Master's students with an industry-proposed project. Over the course of an academic semester each team works under the guidance of an industry mentor.
Small teams of MS-level data science students at UMass Amherst get the opportunity to work on industry-relevant problems, with guidance and mentoring from industry data science professionals.
Companies get the opportunity to make cost-effective progress on data science exploratory problems of interest, leveraging the effort of students who are in the midst of data science training. Company professionals “learn alongside” the student teams.
As a result of the experience of working with these students you may find candidates for future internships and full-time roles.
Past versions of this course have resulted in multiple publications, including two best papers awards.
How It Works for Industry Partners:
Industry partners propose one or more projects of their interest suitable for a team of 3-5 machine-learning-trained MS students for a one-semester class. (One or two paragraphs is sufficient.)
MS students apply, are selected, and formed into teams by UMass, building teams with relevant background and interests.
Industry mentor meets with their team by video conference once every one-two weeks throughout the semester to give advice and technical guidance. The teams are also locally advised by a topically relevant PhD student.
The students also participate in a course on campus, receiving lessons on research pragmatics, methodology, presentation skills, etc.
Supporting datasets should be made available by the company for use by students who will be working on University premises with University equipment and resources.
- Per University policy, any intellectual property created by the student teams during the project will be owned by the University. Software that is created is typically open sourced, and the results published or publicly disclosed without restrictions.
How It Works for Masters Students:
- In mid-October Professor McCallum sends an email to all MS students in Computer Science inviting them to apply to the Industry Mentorship Program aka COMPSCI 696DS. It is only open to students that are enrolled in the Data Science concentration. Students must have taken two of the core courses to be eligible for this program.
- In late November students that have applied will be given a list of industry projects and asked to indicate their preferences.
- In late December students will find out if they got into the course and what project they were assigned to. There are many factors taken into consideration in making the teams for each project and everyone's preferences cannot be met. We will have the accepted students enrolled in SPIRE.
Key Dates for the Spring 2020 semester:
11/15/19: Deadline for submitting project descriptions. Please do not change the projects in substantial ways after this date as students base their decisions on participation on these descriptions.
12/15/19: Students are matched to projects
1/21/20: Spring semester starts
Feb: First report due
Mar: First project presentation
April: Second report
May: Final report and poster session
To participate, contact us at email@example.com.
Project Descriptions from Previous Years:
Collecting Commonsense Knowledge
A core problem in artificial intelligence is to capture, in machine-usable form, the collection of information that an ordinary person would have, known as commonsense knowledge. This background knowledge is crucial for solving many difficult, ambiguous natural language problems in co-reference resolution and question answering, as well as the creation of other reasoning machines. We focus here on learning hypernym ontologies (hierarchical and partially-ordered structures describing “is-a” relations, including (a) lexical entailment, e.g. a poodle is a dog, is a mammal, is an animal, and (b) phrasal concept entailment, e.g. “a night out dancing” is a “good date”). In this project we will work to gather and integrate large-scale data from both pre-existing sources (such as ConceptNet), and solicit additional input by crowd sourcing. We will implement and experiment with multiple recent and novel methods for learning embeddings representing partial orderings, such as order embeddings (Vendrov et al, ICLR 2015), and probabilistic alternatives capable of representing conditional probabilities, all implemented in TensorFlow.
Automating Manuscript Citations
This project examines the use of machine learning methods to automate one of the first steps in the manuscript processing pipeline. The task involves parsing and segmenting citation strings from the reference section of manuscripts into multiple fine-grained entities including authors, title, venue, address, publisher, editor etc. We'll explore the use of advanced machine learning methods including dilated CNNs and biLSTM-CRFs for this task, and also explore the use of constraints and hierarchical information in the entities to boost the performance of the tagging system. We will use the UMass Citation Extraction data set (http://www.iesl.cs.umass.edu/data/umasscitationfield) to train the machine learning models.
Evaluating Automated Annotation of UMLS Text
In this project, we will examine the use of machine learning methods to automate annotating portions of text with a specific UMLS entity. We will provide a set of ~4300 documents whose text (comprising the Title and Abstract for each document) has been manually annotated by experts. As part of this project, we will evaluate 3 machine learning methods: (i) TaggerOne (Leaman et al, Bioinformatics 2016), which is a rich-feature based semi-Markov model with perceptron-based training; (ii) A Bidirectional LSTM neural network, and (iii) a Transformer self-attention module based neural network as used in (Pat Verga et al., AKBC 2017). The goal is to do the segmentation and mention recognition, and the linking to a specific UMLS entity, as a joint task. The main challenge is that UMLS consists of a large number of entities, organized in a heterarchy (i.e. a DAG), and we want to give partial credit for being ‘close’. The code for TaggerOne is publicly available, and the student working on the TaggerOne stream will need to be well versed in Java. The other two streams should use Python and your favorite Neural Network library (TensorFlow, Theano (or Keras), or PyTorch).
Assigning Evidence to Gene Ontology Annotation
In this project, we will build classifiers to identify what type of evidence to assign to a Gene Ontology (GO) annotation. The GO knowledge base provides hundreds-of-thousands of annotations with evidence codes and links to the papers in which the annotation's evidence appears. Can we learn a classifier that identifies the appropriate evidence code for an annotation, given the features extracted from the annotation, and its paper? Features such as the authors, the venues, the methods employed and text surrounding mentions of concepts in the annotations might provide clues. More sophisticated features based on citation counts could help further.
Such a classifier could also be applied to GO annotations extracted via automated methods, increasing our trust in them. Thus, we can eventually make GO more comprehensive with automated methods, while maintaining a reasonable level of accuracy via predicted evidence codes.
Polysemy Research and Sentiment Analysis
Recent work shows word embedding techniques tend to represent polysemous words as linear superpositions of the embeddings for the individual meanings. This project looks to replicate and expand on existing polysemy research (in particular, investigating alternate approaches to identifying of number of root senses per word), then explore the impact meaning grounding can have on sentiment analysis, either as a simple preprocessing step for trained models, as a semi-supervised pipeline for provisioning training examples to annotators, as a method for extrapolating to plausible related sentiment phrases, or other approaches the team comes up with.
Detecting Behavior Indicating Sending of "Phishing" Emails
Social engineering (“phishing”) attacks are a major threat to the security of governments, organizations, and individuals. Particularly dangerous are attacks launched from within an organization by compromised user accounts. We are interested in exploring unsupervised learning techniques to build representations of the typical behavior of users in an organization which could form the basis for detecting abnormal behavior from a compromised account. Concretely, students would implement and apply distributed representation (“embedding”) techniques to users in the Enron e-mail corpus (https://www.cs.cmu.edu/~enron/) and would then explore the learned representations to discern what, if any, useful information about user behavior is encoded. Students will be involved in the full lifecycle of the project: defining the roles of interest, designing a coding scheme for mapping email-senders/recipients to roles, and evaluating different techniques to automatically detect the roles.
Transferring Domains for Question Answering Models
Recent advances in deep learning techniques for machine reading comprehension and question answering have dramatically improved the performance of question answering technology on open-ended questions. These techniques answer questions by identifying snippets or sections of text from a larger document (or document collection) where the extracted text contains an explicit answer to the question posed. While building a general purpose question answering model is an excellent long-term goal, there are many real world scenarios in which building a domain-dependent model is sufficient. However, there is often only a limited amount of in-domain data to train from in each new domain. In this scenario, it may be possible to apply transfer learning to adapt either a general purpose model, or a domain-dependent model from a similar domain, to produce an effective model for the new domain.
Identifying the Best Answers in Stack Overflow
In this project we will explore the use of transfer learning in the context of a question to answer ranking system for identifying the best answers for technical questions that have been posed to Stack Overflow. For this scenario we will provide a data set of original question and answer pairs collected by Stack Overflow for a variety of technical subject areas. This training data will also include many examples of "duplicate" questions posed to Stack Overflow, i.e., questions for which an answer in a pre-existing Q&A pair can be used to answer the equivalent "duplicate" question. For transfer learning, we will provide a smaller collection of Q&A pairs in a new domain which can be used for transfer learning to the new domain. The new model will be evaluated by measuring its effectiveness of matching new "duplicate" questions to the pre-existing Q&A pairs in the new domain. The student team will be provided with pre-collected Stack Overflow data, one or more baseline models, and some supplemental scripts to assist in manipulating data, creating models and running experiments.
Building Classifiers for Strains of Bacteria
We propose to extend recent work in building classifiers to identify lab-grown strains of bacteria (https://www.biorxiv.org/content/early/2016/10/06/079541). Previously we have used a random forest method to classify a set of Salmonella enterica genome sequences extracted from the NCBI sequence read archive. We identified signatures of laboratory culture that were consistent with genes identified in previous studies of mutational effects of laboratory culture. We would like to extend these experiments - both by looking at other analysis methods in addition to random forests, as well as identifying a more robust test set for evaluation, as the current system was evaluated using leave one out cross-validation.
Extraction and Analysis of Job Skills from Resumes
For this project, we will seek to extract new job skills from millions of resumes and/or job postings. After extracting and deduping skills from the text, they will augmented with definitions extracted from publicly available resources such as Wikipedia. Skills will then be clustered, using a variety of clustering methods and cluster evaluation techniques.
Analyzing State of the Art Approaches to Standardized Tests
To propose a new methodology for answering standardized tests, it is necessary to understand and analyze the state of the art approaches with corresponding baselines. This includes both techniques that focus on standardized tests and those that have solved question-answering in the open domain and can be adapted for this task. The analysis will also involve comprehensive and comparative studies of the advantages and disadvantages of the existing approaches. Particularly: (a) the types of questions from standardized tests that can be answered correctly by each of the approaches, and (b) drawbacks of these approaches for questions that were answered incorrectly. The project entails setting up the infrastructure, implementing or reusing some of the existing approaches (from a subset of ,  and  below), running it on the standardized tests dataset, and a comprehensive analysis of the standardized tests dataset and the obtained results.
One methodology for analyzing the questions is to connect them to specific problem solving techniques, e.g., https://www.ets.org/gre/revised_general/prepare/quantitative_reasoning/problem_solving/. By mapping each of the questions to a specific problem solving technique, or learning how to do this automatically, we may be able to improve the state of the art in deep question-answering. As part of this project you could research how to classify the questions to particular problem solving techniques, after defining an appropriate corpus of problem solving techniques from a literature survey.
Another possibility for a project is to combine candidate answers for multiple choice questions (MCQs) from various techniques -- existing and new -- that all attempt to solve the same question(s). The challenge here would be to collect answers from the various techniques along with disparate measures of confidence/accuracy in those answers, and combine all of these into a single ranking of the multiple choice answers. This kind of portfolio-based approach has been used in a number of complex systems that reason over multiple answer choices, and would be interesting to explore in the MCQ setting for standardized tests.
Predicting Drug-Protein Bindings
Finding small molecules able to bind to a specific protein target is a critical aspect of drug discovery. In this project, using publicly available data on known small molecule-protein bindings from structured sources such as BindingDB and PubChem, we will investigate using recently proposed deep learning representations for chemical structures and protein sequences to make drug-protein binding predictions.