University of Massachusetts Amherst

Search Google Appliance


Data Science Applications and Industry Collaboration

UMass Amherst is also internationally renowned in many applications of data science. The following highlight just a few examples:


Business logistics:The collection of data about business operations has become nearly ubiquitous due to a combination of inexpensive sensors, GPS tracking, RFID coding, optical scanning, and other technologies. This has driven a revolution in logistical awareness—businesses now know what they are doing. The next revolution will come when business know what they should be doing to achieve particular goals of customer service, efficiency, inventory control, etc. This next revolution will require data science that moves beyond descriptive analysis toward analyses that are predictive and prescriptive. These latter goals will require statistical modeling of complex relational, spatial, and temporal data, and it will require models that are both predictive and causal.
Jensen is working with Pratt & Whitney to model aircraft engine manufacturing and shop visits. Diao works with Cisco and BAE on data stream processing, and has developed methods for probabilistic reasoning about RFID tracking data in manufacturing and shipping contexts. McCallum has collaborated with on trend analysis in career trajectory dynamics from large-scale resume data. Krishnamurthy is working with Microsoft on interactive learning for complex prediction problems.


Finance & Insurance: Although modeling has always been a part of some areas of finance (such as risk management and portfolio analysis), data-intensive techniques have been less common. That is beginning to change, and many believe that big data will play an important role in capital markets, fund management, insurance, as well as retail banking. Recently researchers have mined logs of Twitter to find signals that could predict economic indicators like inflation and growth, before standard surveys are released. Others have proposed that crowd-sourced data or arrays of sensing apparatus could provide active streams of information that could be used as market indicators.
Jensen is collaborating with MassMutual to identify key customer groups; he previously worked with NASD to develop new fraud detection methods. O’Connor has worked on interpretation of large-scale Twitter data and other news media. Diao has developed methods for handling streaming data of importance in market data. McCallum has developed models of topical trends in text, with applications to market data. Levine has done work in improved privacy for BitCoin.


Health & Biomedicine: The increased adoption of electronic health record systems by hospitals and individual practitioners following the introduction of the Health Information Technology for Economic and Clinical Health Act of 2009 is resulting in a rapidly expanding volume of clinical data. Simultaneously, emerging mobile health (mHealth) technologies (smartphones, smartwatches and wearable physiological sensors) are enabling the collection of ever-larger volumes of physiological, behavioral, and activity data in non-clinical settings. The use of data science tools to curate, analyze, and derive value from these data sources is an exciting area of research with enormous potential for positive societal impact.
Marlin is collaborating with multiple hospitals on methods for analyzing electronic medical records. Ganesan and Marlin have been instrumental in developing next-generation wearable health sensors and data analytics through their “Mobile Data to Knowledge” Big Data Center of Excellence, sponsored by the National Institute of Health. Clarke and Osterweil have collaborated with hospitals nationally, using data to develop new techniques for detecting errors and safety vulnerabilities. McCallum has developed competition-winning natural language processing methods for bioinformatics.


Information economy and social networks: Billions of people around the world write their thoughts and opinions in social media, and connect to each other in social networks. This gives a vast trove of fine-grained behavioral data which we could use to better understand society––such as predicting consumer behavior, analyzing political trends, forecasting outbreaks of crime, health behaviors, or understanding how disease spreads. But current computational analysis techniques are highly imperfect for this type of data. For example, casual conversation is full of non-standard spellings ("lol", "idk") that pose significant problems for standard natural language processing tools, which are brittle and only work on well-edited newspaper text. Better artificial intelligence systems could give us a deeper understanding of meaning in social media. Interdisciplinary research that combines this with the social sciences will allow much better social insight for policy and business applications, as well as help practitioners design better online experiences for users.
McCallum has worked closely with Google, Yahoo, Oracle and multiple federal agencies on entity resolution/disambiguation and information extraction from text, the web and database sources. Croft has collaborated with over 50 companies and government agencies to develop information retrieval solutions; for example, he is working with Adobe on mining and searching for opinions in social media. Meliou is collaborating with Google on automatic data cleaning––identifying errors in databases, finding patterns in irregularities, and in some cases automatically patching the errors. Houmansadr has developed methods for network traffic analysis and avoiding Internet censorship.


Smart cities: In the future cities will encompass nearly 90% of global population growth, 80% of wealth creation, and 60% of total energy consumption. We must improve our understanding and develop new strategies for city planning, management and interventions that are enabled by emerging data science technologies, such as sensors, data gathering, data integration and data analysis––all at large scale. Doing so will also enable many new business and entrepreneurial opportunities.
Sheldon has worked with municipal and state-wide planners doing graph analysis to improve the efficiency and impact of repair crews.


Natural sciences: Material science, geoscience, physics, chemistry, astronomy and most of the sciences have progressed from recording their observations in lab notebooks to using digital instruments capable of recording terabytes of data. This vast quantity of data enables groundbreaking opportunities for discovery.
For example, McCallum is collaborating with materials scientists to automatically extract the “recipes” for new battery materials from hundreds of thousands of research articles, and then build pattern analyses on these recipes to suggest promising new recipes for further advances. Moss has collaborated with astronomers studying the shape of the Milky Way galaxy, working on efficient methods for loading, transforming, and computing likelihood distributions of physical models from observed data.


Education: Online instruction can record nearly every action between a student and system, including keystrokes, small bodily movements and interactions. Patterns in the data can provide knowledge about student performance, collaboration and learning approaches. Data helps to formulate new scientific hypotheses about human learning and to gather new evidence about education’s ‘wicked problems,’ such as performance gaps that produce cycles of underachievement and cultural-racial differences in learning. Data analysis might identify children with similar learning difficulties, successful teaching strategies, and gender differences in problem-solving or collaboration. Instructors analyze what students know on an hourly or weekly basis and identify techniques that are most effective for each pupil. Data gathering tools include body sensors to measure emotion (eye tracking, body posture, facial features, and mutual gaze), digital bracelets, RFID chips in student ID cards, multi-tabletop environments, the ‘internet of things (e.g. laboratory instruments for science labs)’ and the quantified-self (e.g. phones and watches). Human learning has become more accessible through educational data; the nature of recordable and analyzable data has become smaller and the grain size of possible interventions more focused. Human learning can be studied in far more nuanced ways, potentially improving research, evaluation, and accountability.
Woolf, a White House Presidential Innovation Fellow, collaborates with Apple, Microsoft and the Gates Foundation to build electronic tutoring systems that adapt to students' needs for personalized learning. Moll has developed interactive online teaching materials for introductory programming.


Energy, sustainability and climate: Data science represents a tremendous opportunity to harness energy-efficiency savings and promote sustainability. The increasingly pervasive deployment of smart meters and availability of inexpensive sensors generate a tremendous amount of information about energy generation, use and waste. However, according to Cisco, a third of utility industry managers say bridging the gap between simply collecting data and converting it into meaningful operational insights is a top priority. Applied to energy, data science techniques can identify opportunities for conservation, detect problems as (or before) they arise, and improve overall building performance. In Massachusetts, there is particular interest in the intersection of energy, sustainability and data science. The Commonwealth’s 2015 Data Innovation and Workforce Fund names energy among three key areas of public policy concern where data science can play a critical role.
Shenoy has been instrumental in assisting Holyoke Gas & Electric to increase efficiency with smart grid technology. Sheldon develops state-of-the-art methods for conservation decision-making and sustainability using terabytes of data from weather radar and bird migration. Kurose, Zink and others have developed new sensors and analysis methods that revolutionize our ability to predict tornadoes using collaborative distributed computing.


The industrial collaborations above are samples of our activities, which have encompassed over $30m in industrial research grants and joint projects and hundreds of student internships in the past ten years.