Risk management

Enhancing Well-Work Efficiency With Data Mining and Predictive Analytics

Failure to prioritize objectives and improper selection of candidate wells can have significant implications for both derived value and potential risk.

Getty Images

Failure to prioritize objectives and improper selection of candidate wells can have significant implications for both derived value and potential risk. This paper addresses the business problem of reducing the uncertainty of well-work-program outcomes so that more-informed choices can be made, enhancing the benefits and value of a well-work program. It illustrates the use of data-driven models to estimate key performance indicators for well-work jobs and to predict the likely outcome using predetermined success criteria.


Well work consists of the complete end-to-end business process covering any operation on an oil or gas well during or at the end of its productive life. Well-by-well reviews, with good support information, remain the best way of spotting large amounts of potential production. Data-driven incremental-learning models provide a set of intelligent tools that synthesize large volumes of data and make timely recommendations on the basis of learned historical behaviors and discovered hidden patterns across scattered heterogeneous sources.

This project examined a wide range of data-mining and machine-learning algorithms capable of dealing with large volumes of data, data-quality issues, and restrictive parameter constraints. The resulting model uses existing variables available at the planning for workover jobs as input to predict the likely outcome of individual jobs. Enhancing the decision-making process with reduced uncertainty for the well-work portfolio maximizes the overall program value and its yield on investment.

The Data-Driven Predictive-Analytics Approach

Data mining is the science of extracting valuable knowledge from large data­bases. It involves several iterative steps. This study applied an iterative development process, progressing from clarifying business objectives through data preprocessing, fitting of learning models, validation of the models, and performance testing to examine the consistency between model results and domain-expert knowledge.

Data Collection and Database Construction. One of the challenges of manipulating well-work big data was developing a systematic acquisition and assimilation strategy for the relevant input from multiple heterogeneous sources. To overcome the challenges associated with acquiring relevant data, the project adopted an incremental phased approach by selecting one specific asset from the well-work portfolio.

Data Preparation. Data collection and archiving first must remove errors. Despite careful safeguard measures, there are still incidents where measurement sensors drift or malfunction. Human data entry is another common source of data-quality issues.

For this project, the raw data extraction had to be processed to prepare data for various modeling algorithms and learning schemes. Several methods were applied, include filtering, spectral decomposition, estimation of dynamic characteristics, and missing-data imputation.

Exploratory Analysis Phase. The exploratory analysis phase determines relationships among the variables. Bivariate cross-correlation matrices help to identify causal variables (process inputs) that are potentially good predictors of response variables of interest. This step involves determining the best noise-resilient predictors for the subsequent learning phase.

Multilayer perceptron (MLP) artificial neural networks (ANNs) were used to synthesize nonlinear functions that would optimally fit correlated multivariate data. Sensitivity analysis was also performed to quantify the level of interaction between various input variables.

The exploratory analysis phase involved the determination of natural groupings of well-work job attributes. A robust version of the K-means clustering algorithm was applied to find two sets of groupings in the well-work population that distinguished between high-success jobs and low-success jobs.

Predictive-Modeling Development and Calibration. A predictive model is a virtual process that is developed directly from the data created in data preparation and the correlation structures obtained in the exploratory analysis. It is usually necessary to adjust the process to accommodate changes in uncontrolled variables.

Predictive-Modeling Approach. A hybrid approach was adopted for the project that used both unsupervised and supervised learning techniques. Nine different machine-learning algorithms were applied to training data that had been organized into cases (vectors) of associated inputs and outputs. The algorithms synthesize (learn) a generalized mathematical formula that predicts values of outputs for different input values.

Some supervised learning algorithms, such as MLP ANN, are fitted using regressions similar to ones used for statistical models. However, in statistical modeling, the forms of the functions that are fitted are specified (e.g., a linear fit), whereas, in statistical learning, an MLP ANN uses a flexible mathematical structure that is automatically manipulated to provide a numerically optimized fit that is customized to the shape of the data.

Sensitivity Analysis of Various Models. A sensitivity analysis is conducted using a process that represents how much an output will change for a specified change in an input. In a model having multiple inputs, the sensitivity is the slope of the model’s response surface at coordinates defined by the values of the model’s inputs. In a linear model, the sensitivity of the output to a given input is constant; however, in a nonlinear model, such as support vector machines (SVMs) or ANNs, the sensitivity can change with any change in any input. It is commonly a goal of data-mining projects to determine which parameters most influence an output and under what conditions their influence is greatest.

In this project, mean sensitivities were calculated by incrementing each input across its range. Other inputs were fixed at their midranges, and averages were calculated for how much the model’s output changed across all of the increments. Mean sensitivities were normalized so that their absolute values summed to 1.0.

Models Training and Validation. The modeling data set was partitioned into 75% for model training and 25% for model validation. The data partition preserved the ratio of the prevalence of the two classes and used a stratified sampling mechanism. A separate testing set was withheld for final assessment and selection of the best model. An automated process was developed for encoding the supervised learning target class outcome for historical well-work jobs. The data partitions were derived directly from the results of the data-preprocessing step.

In order to determine the best model for prediction performance and overcome some of the learning bias inherent in the algorithms, several different models were trained to learn attributes for planned jobs for predicting outcome of well work. The first two in the following list are ensemble models of separately trained underlying models, independent of the other seven learned models:

  • Ensemble of weak learners, boosting model
  • Random-forest model
  • Kernel learner, SVM model
  • Decision-trees model
  • Gradient descent model
  • MLP ANN model
  • Logistic regression model
  • Rule-based induction model
  • Nonlinear additive model

The selection of the best model was based on having a minimal misclassification rate and the largest area under the receiver-operating-characteristic curve, or area under curve (AUC). This curve plots false positive rates against the true positive rates at different threshold levels. It represents the ratio of correct to incorrect predictions of well-work outcomes. The graph highlights tradeoffs between sensitivity and specificity. The curves close to the blue 45° diagonal line have small AUCs and represent models that are less accurate. An AUC value closer to unity is desired. Model comparisons and performance are shown in Fig. 1. The best-performing model has a balance between accurately predicting the least-successful well-work jobs and incorrectly predicting that the job would fall into the “unsuccessful” category.

Fig. 1—Receiver-operating-characteristic-curve analysis, assessing performance of competing models.


The top model accuracy was characterized by computing the margins. For an ensemble model with two possible outputs (least-efficient well-work job vs. others), the margin measures the degree to which the average number of votes for the correct output exceeds the average votes for the incorrect output. The larger the margin is, the more confidence in the model.

The top and selected model exhibits stable behavior throughout the training, learning, and validation process. This provided an additional level of confidence of model robustness.


The application of data mining and predictive analytics has been used widely across many disciplines, but it has been a less-exploited capability in the oil and gas industry. This paper shows that these capabilities allow for prediction of outcomes of planned well work with an adequate level of confidence. This is a vital step toward mitigating the high risk associated with major annual expenditures.

This paper demonstrates the value of applying a data-driven predictive-modeling approach to large historical well-work data and predicting the likely outcome for a new planned job vs. predetermined success criteria. For this project, nine different machine-learning algorithms were trained and their performance was tested against well-work history. The competing models’ performance was evaluated on a separate withheld testing set for best fit and prediction accuracy. Overall, the top model achieved 76% accuracy for predicting well-work outcome before job execution. The model relies on available information without introducing any additional overhead to the established ­process.

This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 167869, “Enhancing Well-Work Efficiency With Data Mining and Predictive Analytics,” by Mohamed Sidahmed, SPE, Eric Ziegel, SPE, Shahryar Shirzadi, SPE, David Stevens, SPE, and Maria Marcano, SPE, BP, prepared for the 2014 SPE Intelligent Energy Conference and Exhibition, Utrecht, The Netherlands, 1–3 April. The paper has not been peer reviewed.