Data & Analytics

Data-Driven Methods Present Potential for Success

Data-driven methods offer significant advantages in the industry, under certain conditions, over conventional methods. But reservations still exist about their use. The paper serves to bridge the gap between unclear understanding of these methods and their successful acceptance and implementation.


The advent of big data represents a major development in the industry, but the staggering amounts of data generated by recent methods is often not efficiently and effectively used. The increasing complexity of oil and gas systems as the result of nonlinearity, high levels of uncertainty, and the large volumes of data generated calls for more-sophisticated models that can turn raw data into actionable knowledge and are able to represent the complex relationships among the system-state (input, internal, and output) variables.

Data-driven methods and computational intelligence are being used increasingly as a complement or replacement for physics-based models such as numerical reservoir modeling and simulation. Although data-driven methods are effective in solving critical problems in the industry, design challenges exist that could create extra problems and could even render the model useless. Some focal points need to be described clearly, such as the optimum number of layers to solve a problem, the number of units needed for each layer, the generalization ability of the algorithm, and the boundaries of a training set to handle the problem successfully. When data are applied blindly, especially in a situation in which a sufficient amount of data about the problem does not exist or when the modeled system is not stable during the period covered by the model, the risk of algorithmic bias exists.

Data mining, a major component of data-driven methods, can offer significant benefits to the industry. Data mining, in its most fundamental form, is defined as extracting interesting, nontrivial, implicit, previously unknown, and potentially useful information from data. Data mining integrates machine-learning and pattern-recognition algorithms with statistical and visualization techniques to convert data into meaningful and comprehensible knowledge. Data-driven modeling (DDM) provides the methodology for analyzing and discovering the relationships among the system-state variables without the explicit knowledge of the physical behavior of the system. DDM focuses on two empirical models:

  • Computational intelligence, which involves artificial neural networks (ANNs), fuzzy rules-based systems, and genetic algorithms

  • Machine-learning models based on the theoretical foundations used by computational intelligence

In a typical data-analysis cycle, data collection and management is the first step. In this step, data are collected from various resources in different forms and prepared for analysis. Next, the variables required for the analysis are decided through exploratory data analysis (EDA) and the importance and the connections among them are explored. EDA techniques have been adopted into data-mining processes designed for analyzing large amounts of data.
Data-driven techniques in the oil and gas industry are used in a range of applications, including subsurface characterization and petrophysics, drilling, production, reservoir studies, enhanced oil recovery, facilities management, and pipelines.

In most of these methods, the link between raw data and the generated knowledge is hidden, so transforming data into valuable insights is a challenge. Moreover, it can be difficult to implement only one data-driven method to capture the behavior of the whole system. To overcome these issues, the contemporary trend suggests using hybrid models, a combination of different data- or physics-driven methods, to generate a single solution.

DDM Techniques

Linear Regression. An important modeling technique is linear regression, which can model the relationship between a dependent variable and one or more explanatory (predictor) variables.

Principal Component Analysis. Most machine-learning and data-mining techniques may not be effective for high-dimensional data. Therefore, some preprocessing is required to reduce the dimensions of data and deal with what is commonly known as the “curse of dimensionality.” Dimensionality-reduction techniques work by identifying an optimal subset of attributes or features according to an objective function or by reducing the number of features by creating linear combinations of the original attributes. Principal component analysis reduces the dimensions of the data by performing a linear transformation on the data by rotating the feature space, so that the data lie along the directions of maximum variation.

Decision Trees. This commonly used data analytical tool is used widely in the software, pharmaceutical, and legal industries. Decision trees are also one of the most visually effective and understandable data-mining techniques. A decision-tree algorithm is a supervised classification and prediction technique that relies on an attribute or a group of attributes to create models of the data.

Support Vector Machine (SVM). This is an important machine-learning model used for supervised classification of the data. SVM as an algorithm can be viewed as a clustering tool with features of decision trees. SVM is primarily used for classification but is also used for regression.

ANNs. ANNs are a rough approximation and simplified simulation of biological neural networks. ANNs are capable of developing transformations, associations, and mappings between data. Neural networks are very effective in handling nonlinear relations in data and can perform well in extremely complex functions. ANNs have become the preferred solution for a wide range of problems in the upstream and downstream industries. Neural networks are used when the problem is too complicated for mathematical modeling or when there are missing data.

Fuzzy-Rule-Based Systems. A fuzzy-logic model consists of fuzzy sets, which are formed by the functions of approximate reasoning as well as nonstatistical uncertainty. Uncertainties are mainly caused by information insufficiency and can be described as the product of excursiveness. The duty of a fuzzy-logic system is to model the uncertainty that creates complexity and imprecision. The output product of an event in a random process strongly depends on a chance. Probability theory is an effective tool in handling the problem when uncertainty is the product of the randomness of events.

Fuzzy-set theory is a popular tool in the industry. Hydrocarbon reservoirs are complex systems with a high percentage of uncertainties. Fuzzy-set theory is an effective tool that describes the kind of uncertainty associated with vagueness, imprecision, and a lack of information that can affect these systems.

Genetic Algorithm (GA). A GA is described as a heuristic search technique that tries to emulate the human evolution to solve an optimization problem. A GA mechanism generally consists of various steps such as initialization, assignment, selection, crossover, and mutation. The algorithm starts by randomly generating a population of chromosomes, each of which represents a solution. These initial chromosomes create offspring through crossover and mutation, as shown in Fig. 1. In the next step, using a fitness function, the best chromosomes are detected from the generated offspring. This process continues iteratively until the best set of chromosomes that satisfy a threshold is created. GAs are able to deal with complex multidimensional, and discontinuous problems. Because GAs do not check every possible combination of genes, they provide a fast and accurate solution for problems with a great deal of data.

Fig. 1—Generating offspring chromosomes by crossover and mutation.


Bayesian Belief Network (BBN). BBNs represent supervised classification and regression algorithms that have a structure similar to that of decision trees. However, the growth of the network is fueled by best priori probability distribution. A BBN is invested in the principle of conditional probability and cause/effect relationships between multiple random variables. A BBN is a probabilistic graphical model consisting of a set of nodes and connections (edges) that represents a decomposition of a large probabilistic domain into weakly connected subsets by conditional independency. The network is usually predesigned by the user and then the system learns the classification, depending on Bayes’ theory.

Acceptance of Data-Driven Methods in the Industry

The oil and gas industry is one of the major sensor-based industries, dependent upon many data-collecting sensors installed downhole and on the surface. Companies monitor their assets closely to calculate their reservoir production as well as to predict future performance of their reservoirs. As a result, the petroleum industry has to deal with considerably large volumes of structured and unstructured data from various sources. Despite numerous advantages that data-driven techniques can provide the industry, however, many companies are still reluctant to adopt these technologies.

Data-driven models are used increasingly in the industry to analyze structured and unstructured data, in particular finding connections between the input and output state variables without explicit information about the physical behavior of the system. More companies are realizing that they can use this available data to better optimize their overall performance in different areas, such as increasing the production capacity of the reservoir, forecasting extreme events, or simulating fluid flow.

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 190812, “Status of Data-Driven Methods and Their Applications in the Oil and Gas Industry,” by Karthik Balaji, SPE, and Minou Rabiei, SPE, University of North Dakota; Hakan Canbaz, Schlumberger; Zinyat Agharzeyva, Texas A&M University; Suleyman Tek, University of the Incarnate Word; Ummugul Bulut, Texas A&M University–San Antonio; and Cenk Temizel, SPE, Aera Energy, prepared for the 2018 EAGE Annual Conference and Exhibition/SPE Europec, Copenhagen, Denmark, 11–­14 June. The paper has not been peer reviewed.