Data management

Proposed Framework Normalizes Completions Tags

Murphy Oil has created a work flow to normalize the tags it uses when collecting data on its hydraulic fracturing stages. The work flow described here empowers decision makers, who no longer wait for hours to collect data or waste hours cleaning and preparing data for analysis.

Digital nebula. Social connection metaphor
Credit: Getty Images.

Molly Smith, General Manager for Drilling and Completions; Sarah Carr, Senior Completions Engineer; and Huzeifa Ismail, Data Scientist, Murphy Exploration and Production

Murphy Exploration and Production Company’s onshore wells are completed through multistage horizontal hydraulic fracturing. A typical well will have anywhere from 20 to 40 stages. A completions frac data set is composed of one Excel file for each stage of a well. Each stage file consists of 1-second time series data for approximately 150 parameters from the frac van. To date, more than 18,000 stages have been executed, resulting in more than 2.5 million tags (i.e., measurement or calculation variable names or parameters).

The data quality within each stage file is inconsistent. Same tags are referenced differently, causing large variances in the data set and making it impossible to run meaningful analytics on the data. This article discusses how a combination of a no-SQL historian (Ruiz et al. 2019), machine learning using fuzzywuzzy logic (Ragkhitwetsagul et al. 2018), and in-house app development was used to normalize the more than 100,000 tags into 58 unique ones.


To further its commitment to operational efficiency and value optimization in oil and gas exploration and production operations, Murphy Exploration and Production, a subsidiary of Murphy Oil (NYSE: MUR), has prioritized data as a key asset. Several data capture solutions (Ruiz et al. 2019) have been leveraged in the field to increase productivity by providing real-time completions data to the engineers in the office (Fig. 1) because completions account for approximately 60% of capital expenditures for an unconventional well. While data capture is only the start, using machine learning and analytics on this data to achieve real-time optimization is the goal. To do this, one must have accurate, quality, detailed information about operations.

MurphyTags Fig1.jpg
Fig. 1—Field operators on site during drilling and completions operations. (Click to enlarge.)
MurphyTags Fig2.jpg
Fig. 2—Completions hydraulic fracturing spread at a Murphy site. (Click to enlarge.)

Completions hydraulic fracturing (Fig. 2) is a prime example of a set of data where the need to capture, clean, and analyze was identified. The company saw the opportunity to clean up more than decade of onshore completions data and put a process in place to standardize the capture of data going forward. With clean, quality data, they could build a custom, in-house platform to analyze historical data, optimize operations in real-time, and link together the completions data with other data sources, including geology, drilling, reservoir, and production.

Currently, there is an abundance of data coming from most of the surveillance environments and applications. Identification and filtering of responsive messages from this big data ocean and then processing these informative data sets to gain knowledge are the two real challenges in today’s applications (Mujitha et al. 2015).

Murphy’s onshore wells are completed through multistage horizontal hydraulic fracturing. A typical well will have anywhere from 20 to 40 stages. A completions frac data set is composed of one Excel file for each stage of a well. Each stage file consists of 1-second time-series data for approximately 150 parameters from the frac van. To put this in perspective, on a well with 30 stages, it captures data for 4,500 parameters at a time-series level and stores in a data historian. To date, more than 18,000 stages have been executed, resulting in more than 2.5 million tags (i.e., variable names or parameters).

Murphy has saved the historical stage Excel files in individual well folders on a server. These files were transitioned to a no-SQL historian. Still, no analytics could be performed on the data because of the inconsistent nature of the data in the source files. Each file from frac service providers would consist of a similar data parameter set, but the parameter tag names within files would vary. Murphy has worked with at least nine different pressure pumping companies during the past 10 years, resulting in variability from company to company. More surprising was the inconsistency within a given pressure pumping company; even within a specific well, the data parameter naming conventions could change based on the site-specific engineer who supported that stage. For example, the data for one stage would have an annulus pressure reading under the tag “Backside Pressure,” and the next files could have the same parameter under the tags “Annulus 1” or “Backside.”

Correct and consistent tags are crucial for analyzing historical and real-time data and to apply machine learning or other advanced analytics. Murphy has grown a historian of more than a million tags of just completions data, with approximately 3,526 unique variations in the tag names of the same standard parameters. This makes the task of looking for relevant pieces of information from such a huge record tedious, like looking for a needle in a haystack.

To tackle this issue, several commercial and nonprofit tag normalization solutions were evaluated. However, the nomenclature and schema recommended by these services did not align with the internal data framework required, so an in-house solution using Python fuzzywuzzy machine-learning logic was architected and deployed instead. This solution condensed the 3,526 tags to 58 tags.

Materials and Methods

Several mathematical tools were considered to normalize these tags including rough set theory (RST) (Nowak; Zhang et al. 2016; Pawlak 1997), natural language processing (NLP) (Li and Liu 2015), part-of-speech (POS) tagging (Damnati 2018; Li and Liu 2015), and fuzzy logic (Ragkhitwetsagul et al. 2018). Each tool had its own set of advantages and disadvantages. For example: NLP and POS were ruled out as viable options because, even though these are considered behavioral technology and can be used to classify text, they are typically stronger tools to be used for automation of repetitive tasks as opposed to normalization. RST and fuzzy logic, on the other hand, are very similar. Both are primarily used as tag-normalization tools; however, RST works better on crisp sets, meaning an element is either a part of a well-defined set or not, while fuzzy logic applies to fuzzy sets, which allows an element to be partially in a set. Because of the high variability of the completions data set, fuzzy logic was the ideal choice.

Fuzzy Logic

Fuzzy logic is a multivalued logic that deals with approximation rather than a fixed or exact result. The fuzzy logic results range between 1 and 0. The fuzzy logic is now leaning toward a new expert system because it tends to reflect how people think and attempts to model their decisions. Fuzzy logic helps one to construct a model approximately and make decisions on the basis of its models. Fuzzy logic uses fuzzy string matching, which is also known as approximate string matching and is the process to find the strings that approximately match a pattern.

All the frac data was stored in form of strings and needed to be normalized to the categories. For this, a Python library called Fuzzywuzzy was used. Fuzzywuzzy used the Levenshtein distance (Yujian and Bo 2007) to calculate the differences between the sequences and the patterns that were developed. This helped create various categories that the algorithm could uses as sets and normalize the tags to be placed into these categories or sets.


A work flow was established for completion tags (Fig. 3).  First, completion files were received and transitioned to the historian. A fuzzy machine learning script then processed the newly uploaded data and referenced its library of all variations of a given tag, known as native tags, to find a match. If a match existed, the script would automatically assign the “Normalized and Classified” tag to the newly uploaded tag based on the normalization logic devised by Murphy. If a match did not exist, the native tag would filter through the normalization app, where it would be manually normalized. This previously unclassified native tag would now be added to the machine learning library to match any future similar tag names. The manual workload will reduce through this process over time as all variations in native tags are gradually captured in the machine learning library and, hence, be automatically normalized upon processing in real time.

MurphyTags Fig3.jpg
Fig. 3—Process work flow to normalize completions data and obtain clean data.

Another challenge during this initiative was ensuring all historical completions files were received from the vendors to have a complete database of completions data. Fig. 4 shows a dashboard that was used to track how many files were received from the vendors, how many more were missing, and whether the files were uploaded to the historian. Both the data collection of missing files and normalization of uploaded files occurred in parallel.

MurphyTags Fig4.jpg
Fig. 4—Dashboard used to keep track of completion files received from vendors and metrics on how many files have been uploaded to the historian broken down by vendor over the years. (Click to enlarge.)

Tag normalization essentially is converting an unnormalized tag, or a native tag, into a standard nomenclature that can be databased and queried for analytics. The normalized tag is accompanied by a classified name that can be referenced easily and understood by the business.

Note that different normalization methods do not seem to produce consistent results. The success of these methods primarily is because of the high scalability of the normalization. Tag normalization processes a native tag and applies a five-tier classification to it. Each tier classification is separated by an underscore, resulting in a normalized tag that better enables analytics (Sun et al. 2013).

The five tiers are

  • Tier 1—Parameter type and location, if applicable
  • Tier 2—Parameter measurement
  • Tier 3—Number of measurements
  • Tier 4—Parameter units
  • Tier 5—Free text to indicate offset well metrics

This article, however, only dives deeply into the first two tiers.

Tier 1 classification is either a physical property or a chemical type. If it is a physical property and a location is available, those two pieces of information are separated by a hyphen. For instance, a proppant measurement at the wellhead is classified as WH-PROP and the same measurement at a blender is BLD-PROP. For a chemical property, Tier 1 classification will assign the chemical name and type of chemical, separated by a hyphen. For example, high or low pH buffers are classified as HPH-BUFF and LPH-BUFF, respectively.

All tags filter through the first two tiers, while the rest of the tiers are dependent on the amount of information available. Table 1 summarizes Tier 1 classification that all 3,526 unique tags were filtered through.

Tier 1 TagClassification
BH-PROPBottomhole Proppant
BH-TREATBotomhole Treating
BLD-DISHBlender Discharge
BLD-PROPBlender Proppant
BLD-TUBBlender Tub
DH-XLDownhole Crosslink
FRFriction Reducer
FR-BREAKFriction Reducer Breaker
HPH-BUFFHigh pH Buffer
HVFRHigh-Viscosity Friction Reducer
HVFR-BRKHigh-Viscosity Friction Reducer Breaker
LPH-BUFFLow pH Buffer
SCALE-INHScale Inhibitor
SURF-XLSurfactant Crosslink
WH-PROPWellhead Proppant
WH-TREATWellhead Treating
Table 1—Tier 1 of the normalization process with its respective classifications.

Tier 2 classifies the tag or data point into the measurement it is providing. As summarized in Table 2, types of measurements include pressure, volume, rate, and time.
Tier 2 Tag Classification

Tier 2 TagClassification
Table 2—Tier 2 of the normalization process with its respective classifications.

Tier 3 aims to classify multiple readings of the same data type. If multiple data points for annulus pressure exist, Tier 3 classification will assign it a 1, 2, or 3 in consecutive order of the readings.

Tier 4 classifies tags into respective units of measurement. It will assign a unit of measurement in its native format, whether that is SI metric or imperial units. For example, a Tier 4 tag will assign a unit of mass as kilograms or pounds.

Finally, Tier 5 allows free text to indicate metrics for any offset wells. If a parameter was measured for an offset well, this tier will have the well number to indicate which offset well’s reading is being recorded.

The result of the Tier 5 classification will be a normalized tag separated by an underscore with information from Tier 1 through Tier 5 and a corresponding classified name that is equal to the concatenation of Tier 1 and Tier 2 classification. All normalized tags, at a minimum, will have a combination of one Tier 1 and one Tier 2 classification.

Data Example

For example, Fig. 5 shows all forms of native tags for annulus pressure across all completed wells in 2019. The size of the tag box indicates the relative number of tags for that given naming convention. This example highlights the problem of unclean, inconsistent data capture. There are many variations under which critical parameters, such as Annulus Pressure, are recorded.

MurphyTags Fig5.jpg
Fig. 5—All variations of the native tag for the annulus pressure parameter. (Click to enlarge.)

These native tags are clearly all in reference to annulus pressure. The Tier 1 tag ANN corresponds to a classification of annulus. The Tier 2 tag PRES corresponds to the pressure classification. Concatenating Tier 1 and Tier 2 tags creates the normalized tag ANN_PRES. Similarly, concatenating Tier 1 and Tier 2 classification results in the classification name of “Annulus Pressure.” Fig. 6 illustrates the resulting normalized tag and classification name.

MurphyTags Fig6.jpg
Fig. 6—Normalized tag and classification name assigned to all variations of the native tags for the annulus pressure parameter. (Click to enlarge.)

The machine learning library, which is used to normalize tags at the source as they are uploaded to the historian, is composed of precisely all the variations, and many more, described in Fig. 5. These variations are engineered to Annulus Pressure and ANN_PRES using the logic explained, which the script automatically assigns to any future tags that match one of these native tags during processing.


After the machine learning script processed the historical data, some native tag data remained unclassified. To clean the remaining troublesome data set, Murphy developed a solution to normalize the large sets of data with the new tier naming convention. The tag normalization process does not overwrite all the original data in the historian. Instead, it appends the normalized and classified tag to the existing schema with native tags. The original native tag for the collection of information is assigned to one relevant tag from the tier system, and that tag is subsequentially assigned to every well describing that same parameter.

Establishing an efficient work flow for normalizing historical tags was a challenge because of the quantity and quality of the data. It became clear that professionals familiar with the data subsets were needed to complete the task. One of the first solutions involved keeping a manual database up to date by assigning a specific native data column to the normalized tag name through a spreadsheet in Excel as a completion file was received.

This manual database also contributed to the machine learning library. However, this was time-consuming, and the process did not work for all the native data because some data series had repeat names in each stage.

Using agile work processes such as the manual database and machine learning created an opportunity for a tag normalization app. The app created a more user-friendly approach that tied directly to the data to reduce errors.


For each stage under a given well, the app displays all the preset key tier parameters that Murphy has chosen to capture from each data set. Instead of displaying the tags in the technical tier-based jargon, the tags are shown by their classification names, which the engineers can read and decipher easily. For example, “ANN_PRES_1” would read as “Annulus Pressure” in the app, as illustrated in Fig. 7.

MurphyTags Fig7.jpg
Fig. 7—App interface for normalizing historical tags. The dropdown displays all native tags received in the stage file for the specific well.

A computer interpreter of any type would be able to use the building and equipment/system indications to group objects. To make the function of an object clear to a machine, the object type/function portion of the name is composed of standardized camel-cased abbreviations that a computer can break apart and use to automatically apply metadata tags.

The tag normalization app automates and digitizes the five-tier classification system by using a dropdown menu, as shown in Fig. 8. The user selects the well and stage they want to normalize. The dropdown under each classification name contains all native tags from the completions files. The user can then select or search for a key word to choose the corresponding data set. Once the native tag has been selected and saved, the tag has been normalized. Critically, its native tag form is added to the machine learning library and all future native tags matching this will automatically be normalized.

MurphyTags Fig8.jpg
Fig. 8—Tag normalization app interface. Classified names appear in the grid, while the native tags are contained in the dropdown menu.

The data in the app is directly linked to the historian. Once a tag is normalized, a blue arrow appears in the box (Fig. 7), which leads the user to a plot of the raw data on the historian. This allows the user to verify the data and ensure the correct data is linked (Fig. 9).

MurphyTags Fig9.jpg
Fig. 9—Annulus pressure trendline in historian. The native tag is displayed as “name,” and the normalized tag is displayed as “description.”

While the process is still relatively labor intensive, it is more efficient and less error prone than the older process using spreadsheets. Additionally, the app platform allows for more improvements and efficiency gains. An improvement already updated is that the standard units are preset and automatically connected to the key parameters. Also, patterns will be recognized while working through a set of data. Once a user has completed a stage of data, the app will carry forward the previous native tag names if they have not changed. Users are no longer required to search for the native tag name, and they can simply plot to verify the data and press save.

The app is the company’s solution for cleaning up historical data, and, for new or future wells, Murphy is working with the hydraulic fracturing companies to label raw data channels using the tier-based classification system. In 2020, collaboratively the hydraulic fracturing service providers correctly labeled the data at the source. The native data came to Murphy in the correct format, and no additional normalization was needed. Clean data collection is an integral part of the business, and, thus, the company will continue to work with service providers to collect quality data up front with the anticipation that the industry will adopt this approach as a standard in the future.

Data that originates from the operation of a system typically contains samples that are not within the process limits (e.g., the allowed temperature range or the maximum possible mass flow in a process) but that do not necessarily indicate faulty operation of the system. In case the outliers are singletons or a limited amount of consecutive samples, it is reasonable to sanitize the data by removing the outliers and replacing them by physically plausible data samples. This task can be done by a human operator who requires only limited knowledge on the underlying physical processes (typically knowing the process limits is sufficient).

However, the challenge is to automate this process so that no human intervention is required. Algorithms from the domain of machine learning can be applied to solve this issue by clustering data and identifying the clusters which denote outliers (Zucker et al. 2015).


To date, approximately 100,000 completion tags have been normalized using this process, starting with the most recent wells and working backward. This process is a way to assign several attributes under one relevant category to reduce the clutter and provide an aggregate of all those under one umbrella for engineers and managers, within completions and across functional disciplines. The normalized data is critical to Murphy as it works toward more advanced data analytics. An in-house completions platform, called the RTR, was built to analyze historical data and compare it to live data. This could not be done without the clean, normalized data. Details about RTR will be published in an upcoming article.

This work flow can easily be reconfigured for any business unit. This is the kind of systematic separation and planned consolidation of information that makes it a methodical option for conscientious administrators monitoring the field. The tag normalization work flow empowers the decision makers, who no longer wait for hours to collect data or waste hours cleaning and preparing data for analysis.

Easy, accurate, and real-time per-second entries from every oil and gas well are stored in the data historian, normalized, and continuously processed to be accessed at any point of time. Murphy has made data an effective fuel for its growth engine by putting real-time updates and consistent monitoring right in the hands of engineers.


Damnati, G., Auguste, J., Nasr, A., et al. 2018. Handling Normalization Issues for Part-of-Speech Tagging of Online Conversational Text. Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan.

Li, C., Liu, Y. 2015. Joint POS Tagging and Text Normalization for Informal Text. Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence.

Mujitha, B.K., Jalal, A., Vishnuprasad, V., and Nishad, K.A. 2015. Analytics, Machine Learning & NLP—Use in Biosurveillance and Public Health Practice. Online Journal of Public Informatics. 7 (1).

Nowak, A. Rough Set Theory in Decision Support Systems.

Pawlak, Z. 1997. Rough Set Approach to Knowledge—Based Decision Support. European Journal of Operational Research 99 (1).

Ragkhitwetsagul, C., Krinke, J., Clark, D. 2018. A Comparison of Code Similarity Analysers. Empir Software Eng 23: 2,464–2,519.

Ruiz, F., Nix, T., and Ismail, H. 2019. Data Historian and Trending Tool Transforms Company Operations. Data Science and Digital Engineering in Upstream Oil and Gas.

Sun, J., Nishiyama, T., Shimizu, K., and Kadota, K. 2013. TCC: An R Package for Comparing Tag Count Data With Robust Normalization Strategies. BMC Informatics 14 (219).

Yujian, L., Bo, L. 2007. A Normalized Levenshtein Distance Metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6): 1,091–1,095.

Zhang, Q., Xie, Q., and Wang, G. 2016. A Survey on Rough Set Theory and Its Applications. CAAI Transaction on Intelligence Technology 1 (4).

Zucker G., Habib U., Blöchle M., et al. 2015. Sanitation and Analysis of Operation Data in Energy Systems. Energies 8 (11): 12,776–12,794.