Risk management

Do Data-Mining Methods Matter? A Wolfcamp Shale Case Study

Data mining for production optimization in unconventional reservoirs brings together data from multiple sources with varying levels of aggregation, detail, and quality.

Fig. 1: The gross structure of the Delaware basin and central-basin-platform features of the Permian Basin of west Texas. Study wells are bubble mapped.

Data mining for production optimization in unconventional reservoirs brings together data from multiple sources with varying levels of aggregation, detail, and quality. The objective of this study was to compare and review the relative utility of several univariate and multivariate statistical and machine-learning methods in predicting the production quality of Permian Basin Wolfcamp shale wells. Methods considered were standard univariate and multivariate linear regression and the advanced machine-learning techniques support vector machine, random forests, and boosted regression trees.


In the last few decades, because of the diminishing availability of conventional oil reserves, unconventional reservoirs are fast becoming the mainstream source of energy resources. In the meantime, with technology advances in data collection, storage, and processing, the oil and gas industry, along with every other technical industry, is experiencing an era of data explosion. One of the more frequently asked questions is, “How can these massive amounts of data in unconventional jobs be used to understand better the relationship between operational parameters and well production?” To tackle this problem, raw data usually need to be pulled from multiple data sources, a data-cleanse job needs to be run, and the information needs to be merged into a joint data set for subsequent analysis.

Here, a data set from Permian Basin Wolfcamp shale wells is used as a case study to illustrate the implementation of several popular analytic methods in data mining. The study area is within the Delaware basin in west Texas. Fig. 1 (above) is color-contoured on the top of the Wolfcamp and highlights the basin structure in the greater area, with the deep basin in purple and the shallowest Wolfcamp contours in red. Well locations are bubble mapped on top of the contours. The bubble-map color scheme has the best monthly oil production per completed foot of lateral (log 10 scale) shown in red and the poorest wells shown in purple.

The data set contains 476 Wolfcamp shale horizontal wells with production histories from the Permian Basin. Three production metrics serve as target variables in each method: the cumulative oil production of the first 12 producing months in barrels, the maximum monthly oil production of the first 12 producing months in barrels (MMO12), and the derived production efficiency by MMO12 over the total lateral length in barrels per foot.

Well-architecture variables included operator, well azimuth, drift angle, true vertical depth in feet, and cumulative lateral length in feet.

Ordinary Least-Squares Regression

The first method applied was multiple linear regression, also known as ordinary least-squares (OLS) regression. A linear relationship was fitted to minimize the difference between the observed and fitted values. The simplicity of the model comes with a set of strong assumptions. Also, the OLS model requires completeness of the data, and all wells with missing variables were omitted during the fit.

To compare with other methods fairly, the missing values were imputed on the basis of similarity of observations. This procedure is described in more detail in the Random Forests section.

Support Vector Machines

Support vector regression (SVR) is a technique closely related to the use of support vector machines (SVMs), which are widely used in classification tasks. In the case of SVR, the regression function has the form

f(x)=y=∑Ni=1(αi*–αi)(vitx+1)p+b,     (1)

where v1v2, …, vN are N support vectors, and b, pαi, and αi* are the parameters of the model. The parameters are optimized with respect to ε-insensitive loss, which considers any prediction within ε of the true value to be a perfect prediction (i.e., zero loss). During the parameter estimation, the support vectors are also selected from the training data set. Because the model is only specified through the dot product vitx of the support vectors and the predictors, the kernel trick can be used. This replaces the dot product vitx with a kernel function K(vix), which can produce highly nonlinear regression fits.

The SVR model was fit using the radial basis kernel K(vix)=exp(–γ|vix|2). Like the OLS model, the missing values were also imputed before fitting the model.

Random Forests

Random forests are ensemble machine-learning methods that are constructed by generating a large collection of uncorrelated decision trees based on a random subset of predictor variables and averaging them. Each decision tree can be treated as a flowchart-like predictive model to map the observed variables of subjects to conclusions about the target variable. Trees capture complex interaction structures in the data and have relatively low bias if grown deep. The averaging step in the algorithm provides a remedy to reduce the variance because of the noisy nature of trees. Unlike standard trees, where each node is split at the best choice among all predictor variables, each node in a random forest is split at the best among a subset of predictor variables randomly chosen at that node.

For those wells with incomplete predictors, random forest imputes the missing values on the basis of proximity. The proximity is a measure of similarity of different data points to one another in the form of a symmetric matrix with unity on the diagonal and values between zero and unity off the diagonal. The imputed value is the average of the nonmissing observations weighted by the corresponding proximities. Other than the imputation, the setup of random forest is quite user friendly and involves only two parameters: the number of predictor variables in the random subset at each node and the total number of trees. Random forests and the gradient boosting method are both decision-tree based. Their predictor variables can be any type—numerical or categorical, continuous or discrete. They automatically include interaction among predictor variables in the model because of the hierarchical structure of trees. They are also insensitive to skewed distributions, outliers, and missing values.

Gradient Boosting Model

The idea of boosting is to gain prediction power from a large collection of simple models (decision trees) instead of building a single complex model. The trees in a gradient boosting model (GBM) are constructed consequentially in a stagewise fashion, compared with being constructed in parallel in a random-forest model. Each tree is introduced to compensate the residuals, or shortcomings of the previous tree, where the shortcomings are identified by negative gradients of a squared error loss function. The final model can be considered as a linear-­regression model with thousands of terms, where each term is a tree.

The missing-value issue is handled in GBMs with the construction of surrogate splits. The key step in a tree model is to choose which predictor to split on (and where) at each node. Nonmissing observations were used to identify the primary split, including the predictor and the optimal split point, and then a list of surrogate splits was formed that tries to mimic the split of data by the primary split. Surrogate splits are stored with the nodes and serve as a backup plan in the event of missing data. If the primary splitting predictor is missing in modeling or prediction, the surrogate splits will be used in order.

Conclusions and Discussions

Several methods were implemented on the Wolfcamp data set—OLS, SVM, random forest, and GBM. These methods have different tolerances for missing values, one of the more common issues in real-world data sets. OLS and SVM simply omit the data points with missing values. Random forest imputes missing values on the basis of the proximity of the data points, and GBM creates surrogate splits as a workaround in the event of missing data. The two tree-based methods are more resistant to missing values.

In terms of overall fit on the full data measured by average absolute error and mean squared error, random forest performed best among these methods, closely followed by GBM. SVM showed moderate improvement over OLS, which served as a benchmark model.

In each model, the predictor variables have different levels of contribution in predicting the target variable. In order to evaluate the variable importance in OLS and SVM and compare the ranking with those from random forest and GBM, we adopted a metric based on the ­coefficient-of-determination (R2) measure of goodness of fit. Each variable was removed systematically from OLS and SVM models one by one. The R2 of the reduced model was then compared with that of the full model, and the difference of R2, or the R2 loss, served as a measure of the importance of that variable.

Do data-mining methods matter? The short answer is yes. According to the practice on the Wolfcamp data set, tree-based methods, random forest and GBM, required less preprocessing time on the raw data. They were also more resistant to common data-quality issues and provided better predictions than others.

This article, written by Special Publications Editor Adam Wilson, contains highlights of paper SPE 173334, “Do Data-Mining Methods Matter? A Wolfcamp Shale Case Study,” by Ming Zhong, SPE, Baker Hughes; Jared Schuetter and Srikanta Mishra, SPE, Battelle Memorial Institute; and Randy F. LaFollette, SPE, Baker Hughes, prepared for the 2015 SPE Hydraulic Fracturing Technology Conference, The Woodlands, Texas, USA, 3–5 February. The paper has not been peer reviewed.