The Spring 2019 MS Mentorship cohort consists of 15 projects (described below) with a team of three to four Master's students assigned to each. In every project, student teams are connected with a key contact from the sponsor organization, whom they communicate with regularly on the project. CDS faculty and PhD students also support the student teams.
Text classification using deep models with layer-wise optional inference-time early-stopping for improved latency
2. American Institutes for Research: Outlier and missing data detection
AIR proposes a project to examine the use of machine learning and data science methods for outlier and missing data detection. Using a large administrative data collection (~100,000 observations and over 2000 variables) we examine various clustering and dimensionality reduction methods that might improve our ability to recognize anomalous data points. Some data elements have been collected in past survey administrations, so time series methods will also be a component of this project. This data presents some unique challenges in the analysis of missing values: in particular the lack of any labeled training data and the difficulty of definitively identifying training cases, the large amount of missing data, and the difficulty of distinguishing true 0s from 0s entered as a placeholder for missing data. This project represents an excellent chance for students to work creatively on unsupervised learning problems. Student work will contribute to the literature on data science methods for data quality analysis, as well as contributing to our ongoing data quality workflow for this project.
Data Resources: Public CRDC data
3. Bloomberg Event Extraction from News Headlines
Using news in a specific portion of a news archive, universal schemas, and matrix factorization, jointly extract events of interest and do fine-grained NER (named entity recognition) from that portion. For example: the sports section of the NYT Annotated Corpus; the business section of the NYT. Benchmark against structured data resources (in the case of sports, e.g. MLB; in the case of business, e.g. Bloomberg data). With reference to Building a New Standard Dataset for Relation Extraction Tasks, a set of target events/relations will be supplied. If the student group makes considerable headway, this project can extend into entity and relation/event embeddings.
Data Resources: Most likely, the NYT Annotated Corpus. We can explore obtaining Bloomberg News.
4. Chan Zuckerberg Initiative: Clustering biomedical citation graphs
Motivation: Chan Zuckerberg Biohub and Chan Zuckerberg Initiative are interested in accelerating science. One of the tools we use is nudging scientists to publish preprints. The hypothesis is that early publication of preprints improves papers' impact. This project studies the citation network of the papers as a proxy of impact.
Goals: We consider a citation graph based on CZI data set in biomedical sciences. In this dataset nodes are papers, and links are citations. We consider three kinds of nodes: (1) conventional papers, (2) bioRxiv preprints, and (3) conventional papers accompanied by bioRxiv preprints. The question is whether the citation graph can be separated into components (papers only, bioRxiv only) with high connectivity inside a component and low connectivity between the components. If it can, there is not much impact of the preprints on regular papers. Otherwise, impact is noticeable.
Methods: Graph clustering tools (see e.g. Satu Elisa Schaeffe, Graph clustering, Comp Sci Rev 1 27-64, 2007).
Data Resources: CZI Knowledge graph
Tools & Environment: The students will have access to CZI/Meta database through Snowflake. They are expected to use Python and R to access the data and run models.
5. Chan Zuckerberg Initiative: Universal biomedical sentence embeddings
Sentence embeddings have become an essential part of many NLP applications. While many pretrained sentence embeddings trained on data in the general domain are available (eg. Google’s Universal Sentence encoder, MILA/ MSR’s General sentence embeddings), only one that we are aware of exists in the biomedical domain (BioSentVec).
For this project, students would research application of deep learning and multi-task learning approaches to training universal sentence embeddings on Pubmed scientific articles. Universal biomedical sentence embeddings would be used in many downstream tasks and enable better research and development in biomedical text mining, including but not limited to document clustering and measuring document similarity.
Data Resources: Pubmed articles (abstract+title)
6. Google: Probabilistic embeddings on taxonomies
Learning representations of symbolic data such as texts and graphs is an integral part of machine learning practice with broad applications in information extraction. However, many embedding methods in Euclidean space fail to account for the hierarchical structure inherent in many symbolic datasets such as with ontologies and knowledge graphs.
Our work seeks to explore techniques such as Probabilistic Embeddings with Box Lattice measures, to induce taxonomies and relational graphs. The internship project will expand upon current techniques with an opportunity to explore and develop other non-Euclidean techniques for hierarchical embeddings.
7. IBM Research: Analysis of Methods and Types for Complex Question Answering
The ARC Challenge is a new dataset from AI2 which takes only the most complex questions from previous datasets based off standardized science questions and proposes them to a question answering system. So far, progress on ARC has been slow as even current state of the art solvers are barely above 40%. As part of a project last year we began investigating how different types of questions could be classified and exactly what made questions in the ARC dataset hard.
In order to propose a new methodology for answering standardized tests, it is necessary to understand and analyze the state of the art approaches with corresponding baselines. Example baselines and more basic ones including TextSearch and NGram search from Aristo. We hope to further analyze the ARC dataset to understand what are the parts that make it challenging. The analysis will also involve comprehensive and comparative studies of the advantages and disadvantages of the existing approaches. Particularly: (a) the types of questions from standardized tests that can be answered correctly by each of the approaches, and (b) drawbacks of these approaches for questions that were answered incorrectly. The project entails setting up the infrastructure, implementing or reusing some of the existing approaches, running it on the standardized tests dataset, and a comprehensive analysis of the standardized tests dataset and the obtained results.
An interesting example to consider in the comprehensive analysis is attempting to connect question-categories to specific problem solving techniques. The NY Regents exam has the questions aligned to standards, which include things like (1) Abstraction and symbolic representation are used to communicate mathematically, (2) Deductive and inductive reasoning are used to reach mathematical conclusions, and (3) Critical thinking skills are used in the solution of mathematical problems.
By mapping each of the questions to a specific problem solving technique, or learning how to do this automatically, we may be able to improve the state of the art in deep question-answering. As part of this project you could research how to classify the questions to particular problem solving techniques, after defining an appropriate corpus of problem solving techniques from a literature survey.
Tools & Environment: Pytorch, Tensor Flow, Sklearn, NLTK, Lucene/Solr, and (possibly) RDF Stores
8: IBM Research: Fine-tuning BERT for AMR Parsing
Data Resources: Biomedical AMR Corpus, AMR Corpus
Tools & Environment: Python, Pytorch, AMR Parser
9: IBM Research: Learning to search over localized knowledge graphs for machine reading
This proposal focuses on learning to construct local structured knowledge from passages and use the structured knowledge for question answering, with a specific focus on questions requiring multi-hop reasoning. An example of a question that might require multi-hop reasoning is “Who was the president of the United States when the Beijing Olympics took place?” Here a model would have to solve the sub-question “Which year did the Beijing Olympics happen” to answer the original question. Such kind of questions would need complex question analysis and answer aggregation over multiple documents. A possible direction we would explore is to get multiple evidence (paragraphs, documents etc) from available resources (such as a search engine), form localized knowledge graph over those evidence, and then design algorithms to reason over those network structures.
We could also link those local graphs to large general purpose knowledge stores such as Freebase, Wikipedia, etc. to gather more background knowledge. For reasoning over these graphs, to arrive at an answer, we could explore various directions -- reinforcement learning for finding chains of reason, graph pattern matching such as GraphCNNs etc. We will be working on large QA datasets such as HotpotQA, ComplexWebQA and QAngaroo.
The research involves (1) defining a formulation of localized knowledge graph (localKG) over passage(s); (2) designing algorithms to accurately construct such localKGs from passage(s), which potentially requires the capability to link to general KGs such as DBpedia, WikiData or FreeBase; and (3) modeling of a reinforcement learning agent which learns to search over the localKGs to find the correct answers to questions.
Data Resources: HotpotQA, WikiHop, and ComplexWebQuestions
Tools & Environment: Python, Pytorch, Stanford CoreNLP
10. Lexalytics: Scientific Key-phrase extraction
The goal of this project is to investigate methods for filtering out the key concepts in long form text. The direct business motivation is for improving document summarization, but the techniques may also be applicable as a precursor step in a number of text applications. Broadly, our recommended approach is to use abstracts from academic papers as an easy way to acquire approximation of a summary, the full body as long form articles; to use a concept or key phrase extraction technique to identify units of discourse in both, optionally apply thesaural/similarity/clustering measurements to make approximate matches between the concepts in the abstract and article; and then use any number of techniques the students think appropriate (language models, embeddings, knowledge graphs, deep networks, etc.) to predict which concepts in the article will also appear in the abstract. The hope is that successful key idea identification can later guide extractive or generative summarization techniques.
Data Resources: Arxiv, or another large academic paper dataset
Tools & Environment: Largely up to the students. Tensorflow or another deep learning framework seems natural.
11. Microsoft: Develop industry solution sample notebooks using Azure Machine Learning
These solutions can be in any area but should include scale up and scale out training scenarios, plus interesting deployment scenarios including the potential usages of Spark, Distributed Tensorflow and Pytorch, scaled out hyperparameter searches, automated machine learning, IoT, etc. These examples will be hosted on our public GitHub repository and leveraged extensively by our teams to help market and explain the capabilities of Azure ML. Your work can have a big impact!
Tools & Environment: Solutions will be provided as notebooks using open source technology on the Azure Machine Learning fabric
12. Microsoft Research Montreal: Building Reinforcement Learning agents for text-based games
Text-based games are complex, interactive simulations in which text describes the game state and players make progress by entering text commands. They are fertile ground for language-focused machine learning research. Successful play requires skills like language understanding/grounding/acquisition, long-term memory and planning, and exploration (trial and error) -- all in the context of a sequential decision-making problem.
The goal of this project is to develop a reinforcement learning (RL) agent that performs well on a suite of simplified text-based games. Likely methods to be explored include deep Q-learning (augmented with recurrent neural networks) and model-based RL. Learning to build knowledge-graph models of the environment, capturing both the layout of objects and their relations, has proved promising in previous work.
Optionally, the team may also submit the agent they develop to the ongoing TextWorld competition.
Data Resources: The TextWorld Learning Environment. TextWorld is an open-source, extensible engine that both generates and simulates text games. Using this framework, the complexity of training/test games can be carefully controlled and limited.
Tools & Environment: PyTorch or equivalent library for deep neural networks
13. Oracle Labs: Learning Graph Embeddings for Companies in Financial Data
In this project, you'll research and apply methods for learning embeddings that integrate both natural language and graph-structured data. SEC filings are public documents, required by the US government for all publicly-traded companies, that contain both rich text information about corporate actors and their relations to one another. We want to research and implement methods for incorporating the natural language data contained in filings (as well as other textual data sources) into our representations of these relations. Then, we'll evaluate the utility of these representations in several downstream tasks such as link prediction and relation extraction.
Data Resources: SEC filings, Wikipedia, DBPedia
Tools & Environment: Python or JVM language of your choice (e.g. Java, Scala)
14. Quantiphi: Disease Progression from Clinical Notes
Chronic diseases, such as Diabetes, and Chronic Obstructive Pulmonary Disease, usually progress slowly over a long period of time, causing increasing burden to patients, their families, and the healthcare system. A better understanding of their progression is instrumental in early diagnosis and personalized care. Modeling disease progression based on real-world evidence is a very challenging task due to the incompleteness and irregularity of the observations, as well as the heterogeneity of the patient conditions, but extraction of such information on a large-scale (e.g., Clinical Notes) is key for epidemiological studies and understanding the development of the disease. Clinical Notes are very useful content-derived metadata. We intend to take help these metadata representations (embeddings) to classify the medical subdomain of a note accurately, and determine the progression of disease.
Data Resources: OpenSource EHR Dataset
15. Scripps Research: Identification and Differentiation of Diseases and Phenotypes
There are some good models for named entity recognition (NER) and entity linking (EL) in biomedical text for a range of entity types, including genes, proteins, and chemicals. However, one task that remains difficult is the identification and differentiation of diseases and phenotypes. Diseases are clinical diagnoses (e.g., "Parkinson disease"), while phenotypes are observable traits or characteristics (e.g., "tremor"). Accurate NER of diseases and phenotypes in the literature, and EL to relevant ontologies, would be useful for clinical decision support systems.
There are several challenges in this NER/EL task. Expert curation is expensive, so gold standard data sets are limited. And even among experts, there is significant ambiguity and overlap between diseases and phenotypes (e.g., "asthma"). Finally, the ontologies for both diseases and phenotypes are incomplete.
Here, we propose a disease/phenotype NER/EL project that would combine annotations across three different sources – domain experts, citizen scientists, and deep learning models. Each of these sources offers a different balance of accuracy versus throughput.
Crowd data (to be generated via Amazon Mechanical Turk or existing Citizen Science crowd)