Project Titles & Abstracts:
Drug Target Interaction Prediction using Deep Representation Learning
Students:Janani Krishna, Sanskruti Shah, Bhanu Teja Gullapalli, Prannoy Mupparaju
Industry Mentor: IBM
Finding small molecules able to bind to a specific protein target is a critical aspect of drug discovery. In this project, using publicly available data on known small molecule-protein bindings from structured sources such as BindingDB and PubChem, we investigate using recently proposed deep learning representations for chemical structures and protein sequences to make drug-protein binding predictions. We propose an end-to-end model that predicts Drug-Target Interactions by taking common unprocessed representations for drugs and proteins as input. Drugs are represented as graphs, with the constituent atoms being the nodes and the bonds as edges between them. Proteins are represented as a sequence of amino acids. We apply graph convolutions on drugs and temporal convolutions on proteins to learn their fingerprint. We compare the performance networks using hand-engineered features to our end-to-end network.
Crowdsourcing methodology for Atypical “Common-Sense” Hypernym Pairs at Large Scale
Students: Akanksha Gupta, Anubhav Singhal, Gaurav Anand & Mili Shah
Industry Mentor: Google
Lexical and Phrasal Entailment (for e.g., “love” is a “powerful neurological condition” and “love” is a “hunger”) in Natural Language Understanding requires a large knowledge base of hypernym pairs. We have explored cost and time efficient way of crowdsourcing to extract a high-quality common sense hypernym ontology dataset. Our work involved (1) Designing contextual tasks where we show sentences and a hyponym, and verification tasks where we verify the hypernyms collected from contextual tasks, (2) Extracting sentences containing Microsoft Concept Graph pairs and Hearst Patterns from Wikipedia, BBC and Books corpora (3) Evaluating sentence and worker quality.
New Skill Extraction from Wikipedia
Students: Ly Harriet Bui, Bhuvana Surapaneni, Rishi Mody, Rishikesh Jha
Mentor: Burning Glass Technologies
Burning Glass Technologies skill taxonomy consists of 15K skills which are further clustered into 500 skill clusters. The goal of the project is to enhance current BGT taxonomy by leveraging both unstructured and structured Wikipedia data. The project focuses on extracting new skills from selective domains using unstructured wikipedia data and creating a skill knowledge graph of the existing taxonomy using Wikidata. In order to enhance our knowledge graph to include skills that don’t have a Wikidata entry we have built an entity relationship model to extract new relationships among those skills using unstructured Wikipedia data.
Automatically Solving Algebra Word Problems with Structured Prediction Energy Networks
Students: Arpit Jain, Gota Gando, Krishna Prasad Sankaranarayanan, Nikhil Yadav
Mentor: Center for Data Science
To solve algebra word problems automatically, typical approaches find a transformation from the given word problem into a set of equations which correctly represents the input word problem. To approach this task, we apply structured prediction energy networks (SPENs), which is an energy based model where we can inject the structural knowledge of the task. We compare our SPEN to a simple greedy baseline approach to determine the effectiveness of modeling structural dependencies in this problem.
Transfer learning for Question Answering in New Domains
Students: Rohith Pesala, Udit Saxena, Rheeya Uppaal, Lopamudra Pal, Ishita Ankit
Mentor: Microsoft Research - Maluuba
We explore the problem of transferring knowledge learned from previously collected datasets and trained Deep Learning models to tasks for which there isn't sufficient labeled data, the cost of collecting labeled data is too high or the task is completely different from ones previously encountered. We consider the task of closed-domain Reading Comprehension: Question Answering using the Stanford Question Answering Dataset as the source domain with the NewsQA dataset as our transfer domain.
We explore the method of Joint Training and Active Learning which emulate the real world scenario of having a small amount of data in the target domain to be learnt, and a large quantity of data from the source domain. We also explore Adversarial Methods where we hope to learn domain specific and domain invariant features by adversarial training and optimizing for a minimax loss function. The motivation to use the adversarial methods is to use the available unlabeled data to our advantage.
Citation Field Extraction
Students: Jay Shah, Ankur Tomar, Shuying Guan, Chesta Singh
Mentor: Chan Zuckerberg Initiative
Citation strings from the reference section of scientific papers can be used to create manuscript profiles to help authors and publishers determine if the manuscript is suitable for the publisher’s portfolio. The manuscript profiles can also give information about researcher networks and research trends, mapping different scientific sub-communities in the process. In this project, we concentrated on the first step of the manuscript processing pipeline which involves automating the task of parsing and labeling the citation strings into fine-grained entities. To this end, we collected and prepared a curated dataset for this task. And to automate the labeling task, we implemented advanced machine learning methods like CRFs, Bi-LSTMs and Dilated-CNNs to name a few. We also explored the use of lexicons and additional features extracted from the citation strings to enhance the performance of these models.
A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
Students: Michael Boratko, Divyendra Mikkilineni, Harshith Padiglea, Pritish Yuvraj
We propose a comprehensive set of definitions of knowledge and reasoning types necessary for answering the questions in the AI2 Reasoning Challenge (ARC) dataset. Using ten annotators and a sophisticated annotation interface, we analyze the distribution of labels across the "challenge" set and statistics related to them. Additionally, we confirm an observation made in the original paper by demonstrating that, although naive information retrieval methods return sentences that are irrelevant to answering the query, sufficient supporting text is often present in the (ARC) corpus. Evaluating with human-selected relevant sentences improves the performance of a neural machine comprehension model by 42 points.
Visual Question Answering
Students: Anish Pimpley, Srideepika Jayaraman, Shruti Gullapuram, Srikanth Grandhe
Mentor: Microsoft Research - Maluuba
Visual question answering deals with coming up with an efficient representation of both the text and visual domains in order to perform reasoning. This is a challenging problem because reasoning in real world requires us to understand how different objects interact and behave with each other in the scene. To build systems that can reason, we need to incorporate concepts such as compositionality, physics, world knowledge etc. which is trivial for humans but not for current intelligent systems. We try to explore this task via the specific problem of question answering in the space of plots and figures using the recently released FigureQA dataset. We build on the ideas of task specific architectures such as Relation Networks and task generic architectures like FiLM to improve the state of the art performance on the FigureQA dataset.
Gene Ontology: Evidence Code Classification
Students: Neha Choudhary, Aaron Traylor, Ashish Ranjan, Twinkle Tanna
In biological literature, genes and proteins are referred to using a wide variety of terminology. Given the huge volume of publications every year and the inherent diversity in the field, biologists currently spend a significant amount of time and effort searching for information about genes and proteins. The Gene Ontology (GO) is a collaborative project focused on combining the information about the genes into one integrated database. The ontology covers three broad domains; namely, cellular components, molecular functions, and biological processes. However, currently, only ~5% of gene annotations are manually curated and hence authentic. The remaining ~95% of the annotations are electronically inferred and have not been verified manually. Although electronically inferred sources have enabled increasing the GO coverage significantly, research shows that inequality across annotation resources can lead to significant bias in Biomedical research. This project attempts to make GO more comprehensive and trustworthy, by building a classifier that identifies the type of evidence to assign to a GO annotation. This evidence detector can then be applied to the electronically inferred annotation to understand their validity.
Machine Learning Based Ticket Classification
Students: Aishwarya Sudhakar, Ariel Reches, Christopher Watson, Kruti Chauhan
Mentor: Pratt and Whitney
The goal of the project is to build a machine learning based ticket classification model. This model would take the incoming tickets in a system and classify them to assignees. The model uses descriptive and categorical domain specific text for learning. Classification can be supervised learning by assigning a ticket to a specific analyst or unsupervised learning where the tickets are put in clusters, the better performing model is to be chosen. Experiments with Bi-LSTM are performed to learn semantics rather than BOW approach for ticket assignments.
STRess Analysis INtervention (STRAIN)
Students: Rasmus Lundsgaard Christiansen, Abhinav Shaw, Ravi Agrawal, Rahul Handa
Tracking and understanding student stress levels is crucially important to college and university administrators. Monitoring stress levels allows for the institutions to intervene and try to improve not only student wellbeing and safety, but retention rates and graduation rates. Recent efforts by Wang et al (2014) demonstrates the feasibility of recording personal student health data at a college campus. In this work, we show how this data can be used to model and predict student stress levels. The collected data is in the form of multivariate time series containing activity details, location details, conversation details, etc. We present two approaches, one based on the aggregation of features from the time series, and a second that uses recurrent neural networks. We find that our RNN approach does 30 percentage points better than classical machine learning models at predicting the stress levels. Furthermore, we present a new system for collecting data that is based on the FitBit Ionic device. The device offers passive collection of new features, including heart-rate and more advanced sleep detection, but also a more accessible way to collect user surveys and a way of involving the user, by intervening (visually and with haptics) when stressful behavior is predicted.
UMLS Entity Recognition and Linking in Biomedical Journals
Students: Shikha Agarwal, Sneha Bhattacharya, Nathan Greenberg, Srijan Mishra, Yuvraj Singla
Mentor: Chan Zuckerberg Initiative
The UMLS dataset is a collection of title and abstracts excerpts from biomedical journals. We explore the use of deep learning techniques to automate entity recognition and linking in this dataset. Entity recognition is the task of recognizing groups of words as entities in a given document. Entity linking is the task of linking the recognized entity to a specific entity in the lexicon. We are exploring Bi-LSTM and CRF model for entity recognition and a separate LSTM based, modular entity linking system. We are comparing the results of these two models against the results of TaggerOne, a semi-Markov model that learns both of these tasks jointly.
Resolving Polysemy for NLP Applications
Students: Amol Agarwal, Vinayak Mathur, Anirudha Desai, Ananya Ganesh
The correct resolution of multiple senses of a polysemous word is crucial for a lot of downstream natural language processing applications. In this work we propose a more efficient and interpretable way to perform word sense induction (WSI) by building a global non-negative vector embedding bases (which are interpretable like topics) and clustering them for each polysemous word. By adopting Distributional Inclusion Vector Embeddings (DIVE) as our basis formation model, we avoid the expensive step of nearest neighbor search that plagues other graph-based methods without sacrificing the quality of sense clusters. Experiments on three datasets show that our proposed method produces similar or better sense clusters and embeddings compared with previous state-of-the-art methods while being significantly more efficient. We then try to extend the success of this approach to the sentiment analysis task. We propose a novel method to jointly solve WSI and sentiment analysis: efficiently injecting sentiment information during the WSI stage in order to discover sentiment-aware senses of each word. Our experiments show that this semi-supervised method provides a more interpretable solution for the sentiment analysis problem.
Unsupervised user modeling to detect compromised email accounts
Students: Mohit Surana, Mudit Bhargava, Rohini Kapoor, Saranya Krishnakumar
Social engineering (“phishing”) attacks are a major threat to the security of governments, organizations, and individuals. Particularly dangerous are attacks launched from within an organization by compromised user email accounts. In this work, we explore unsupervised learning techniques to build representations of the typical behavior of users in an organization which could form the basis for detecting abnormal behavior from a compromised email account. Once we are able to predict a user's normal behavior, any large deviation from the behavior can be tagged as an anomaly.