Data & Analytics

Ten Research Challenge Areas in Data Science

To drive progress in the field of data science, the authors propose 10 challenge areas for the research community to pursue. Because data science is broad, with methods drawing from computer science, statistics, and other disciplines, these challenge areas speak to the breadth of issues.

challenges.jpg

To drive progress in the field of data science, the authors propose 10 challenge areas for the research community to pursue. Because data science is broad, with methods drawing from computer science, statistics, and other disciplines and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. The authors preface their enumeration with metaquestions about whether data science is a discipline. They then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

Although data science builds on knowledge from computer science, engineering, mathematics, statistics, and other disciplines, data science is a unique field with many mysteries to unlock: fundamental scientific questions and pressing problems of societal importance.

This article enumerates 10 areas of research in which to make progress to advance the field of data science. The goal is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

Ten Research Areas

What are the research challenge areas that drive the study of data science? Here is a list of 10. They are not in any priority order, and some of them are related to each other. They are phrased as challenge areas, not challenge questions; each area suggests many questions. They are not necessarily the top 10, but they are a good 10 to start the community discussing what a broad research agenda for data science might look like.

1. Scientific Understanding of Learning, Especially Deep Learning Algorithms

As much as we admire the astonishing successes of deep learning, we still lack a scientific understanding of why deep learning works so well, though we are making headway.

2. Causal Reasoning

Machine learning is a powerful tool to find patterns and to examine associations and correlations, particularly in large data sets. While the adoption of machine learning has opened many fruitful areas of research in economics, social science, public health, and medicine, these fields require methods that move beyond correlational analyses and can tackle causal questions. A rich and growing area of current study is revisiting causal inference in the presence of large amounts of data.

3. Precious Data

Data can be precious for one of three reasons: the data set is expensive to collect; the data set contains a rare event (low signal-to-noise ratio); or the data set is artisanal—small, task-specific, or targets a limited audience.

4. Multiple, Heterogeneous Data Sources

For some problems, we can collect lots of data from different data sources to improve our models and to increase knowledge.

5. Inferring From Noisy or Incomplete Data

The real world is messy, and we often do not have complete information about every data point. Yet, data scientists want to build models from such data to do prediction and inference. This long-standing problem in statistics comes to the fore as (1) the volume of data, especially about people, that we can generate and collect grows unboundedly; (2) the means of generating and collecting data is not under our control, for example, data from mobile phone and web apps vary—by design—across different users and across different populations; and (3) many sectors, from finance to retail to transportation, embrace the desire to do real-time personalization.

6. Trustworthy AI

We have seen rapid deployment of systems using artificial intelligence (AI) and machine learning in critical domains such as autonomous vehicles, criminal justice, health care, hiring, housing, human resource management, law enforcement, and public safety, where decisions taken by AI agents directly impact human lives. Consequently, there is an increasing concern if these decisions can be trusted to be correct, fair, ethical, interpretable, private, reliable, robust, safe, and secure, especially under adversarial attacks.

7. Computing Systems for Data-Intensive Applications

Traditional designs of computing systems have focused on computational speed and power: the more cycles, the faster the application can run. Today, the primary focus of applications, especially in the sciences, is data. Novel special-purpose processors are now commonly found in large data centers.

8. Automating Front-End Stages of the Data Life Cycle

While the excitement in data science is due largely to the successes of machine learning, and more specifically deep learning, before we get to use machine learning algorithms, we need to prepare the data for analysis. The early stages in the data life cycle are still labor intensive and tedious. Data scientists, drawing on both computational and statistical tools, need to devise automated methods that address data collection, data cleaning, and data wrangling, without losing other desired properties.

9. Privacy

For many applications, the more data we have, the better the model we can build. One way to get more data is to share data, for example, multiple parties pool their individual data sets to build collectively a better model than any one party can build. However, in many cases, due to regulation or privacy concerns, we need to preserve the confidentiality of each party’s data set.

10. Ethics

Data science raises new ethical issues. They can be framed along three axes: (1) the ethics of data: how data are generated, recorded, and shared; (2) the ethics of algorithms: how artificial intelligence, machine learning, and robots interpret data; and (3) the ethics of practices: devising responsible innovation and professional codes to guide this emerging science and to define institutional review board criteria and processes specific for data.

Read the full story here.