There will be four presentations from the inaugural UMass Amherst Data Science MS Student Independent Study Project Project Program with Industry Mentors organized by Professor Andrew McCallum.
Metadata extraction from research publications
Molly McMahon, Sheshera Mysore, Akul Siddalingaswamy, Aditya Narasimha Shastry
Mentor: Meta/Chan Zuckerberg Initiative - Ofer Shai, Shankar Vembu
Abstract: In this project, we’ll examine the use of advanced techniques to automate and improve the very first step of the manuscript processing pipeline. The task involves identifying and extracting elements from the PDF file including title, abstract, authors, and affiliations. Unlike published articles, PDF manuscripts are provided with no consistent formatting, added headers for editors and reviewers, line numbers, and other noisy elements. Cleanly labeling the metadata is crucial for proper downstream processing. We will leverage existing PDF parsing technology that provides detailed information about the text and formatting of the manuscript. We will be experimenting with Logistic Regression, Feed-forward Neural Networks, and Bidirectional LSTMs.
Concept/Theme Roll Up
Tanvi Sahay (Presenter). Ramteja Tadishetti, Ankita Mehta, Shruti Jadon
Mentor: Lexalytics Inc. - Paul Barba, Al Hough and Brian Pinette
Abstract: The main ideas in any set of sentences can be represented as a bunch of key phrases that provide information regarding the theme(s) of those sentences. One downstream application of interest with these sets of themes is to roll up similar themes together so a user can query all phrases that belong to a particular theme without caring about other information not of direct interest to her. This is particularly useful in the domain of hotel review where a user may be interested in the location of the hotel more than the type of food they serve and thus only wants to see reviews about the location. We try to solve this problem by preparing a distributed representation of the phrases (for which various methods have been experimented with) and cluster similar phrases together using this representation.
Multilingual Embeddings using ACS for Cross-Lingual NLP
Nitin Kishore, Daniel Sam Pete Thiyagu, Shamya Karumbaiah
Mentor: Oracle - Michael Wick and Pallika Kanani
Abstract: Oracle is an multinational corporation that develops products and builds tools in many different languages. An important practical problem is to make natural language processing (NLP) tools (document classification, named entity recognition, etc.) available in every such language. Traditionally, an NLP practitioner would collect training data in every language for every task for every domain, but such data collection is expensive and time-consuming. Further, many resources available in a language such as English are not available in languages with fewer speakers. In this project, we want to explore a solution to multilingual NLP that does not exclusively require labeling so much data. In particular, we would like to harness unlabeled multilingual data to learn a common representation under which structure is shared across different languages. For example, in such a space, the vector for the English word "good" is close to the vector for the French word "bon." Then, by employing Artificial Code switching and using the multilingual representation as features, we can train a classifier in one language and have it generalize to other languages, without much additional labeled data. In this project we
(1) explore how to learn a good multilingual representation
(2) study how the number and class of languages affect the quality of the multilingual embedding space, and
(3) study how well the multilingual representations allow us to transfer NLP models across different languages.
Career Path Analysis With Topical Sequence Models
Dan Saunders, Ananya Suraj (Presenters). Kartik Chhapia. Suraj Subraveti
Mentor: Center for Data Science - Matt Rattigan
Abstract: Previous efforts (Mimno and McCallum, 2008) have demonstrated the usefulness of topic models for understanding the dynamics of the job market. Using a corpus of resumes as training data, we can build a topic model which captures the important facets of a resume, where the “topics" are distributions over words typically found in job descriptions. We can use these topics to construct a ”topical sequence” model to predict job transitions for individuals over time. The goal of this project is to build on the previous work in this area and expand its scope to better understand workforce characteristics more generally. Example questions that we try to answer are: What is the next role for a particular person given his resume? What types of roles have the most variability in terms of career path?