Data & Analytics

Data Mining of Hidden Danger in Operational Production

The value of hidden-danger data stored in text can be revealed through an approach that can help sort and interpret information in an ordered way not used previously in safety management.


The value of hidden-danger data stored in text can be revealed through an approach that can help sort and interpret information in an ordered way not used previously in safety management. These optimized data then can be used to apply safety-management techniques precisely, centralize operations, and reduce risk levels.


The collection and storage of huge amounts of data have demonstrated a lack of coordination between the development of data-collecting capacity and the means to analyze those data accurately. Some experts use analytical methods to study the relationship between security factors and accident events to guide equipment maintenance, quality testing, and related work. However, extracting hidden-danger data stored in text format has been a challenge for the petroleum and petrochemical industries.

The data-mining technique discussed in the complete paper uses Chinese word segmentation, Chinese lexical annotation, named entity recognition, and other techniques to extract keywords from the text. Then, a structured hidden-danger database is built through a process of keyword mapping and extraction, data cleansing and integration, and data selection and transformation. Finally, the use of a data-stream sliding-window model and a correlation analysis comprises a method of correlating hidden dangers and promotes the application of enterprise safety management.

Theoretical Basis

The mining of hidden-danger data includes mainly the preprocessing of text data, the construction of a structured database, and data analysis.

Collecting a Professional Vocabulary. Hidden-danger data are stored in textual form, but data-mining and machine-learning models cannot deal with these nonstructured (or half-structured) types of information directly. Thus, natural-language processing must be used.

The collection of a professional vocabulary forms the basis of hidden-danger data analysis. Mechanisms by which this is accomplished include the following:

  • Chinese word segmentation. The hidden-danger description of Chinese character sequences is treated by segmentation, which aims to obtain a number of separated words, the meanings of which can be recognized by the computer automatically.
  • New-word collection. Chinese word segmentation can be an ineffective technique in identifying professional vocabulary terms in hidden-danger data. For example, the phrase “health, safety, and environment (HSE) management system quantitative audit standards” might be divided into the subphrases and words “HSE,” “management system,” “quantification,” “audit,” and “standard,” all of which might be unsuited to meet problem-analysis requirements. Therefore, in practice, according to the position of each word in the text, one can find adjacent words that are used frequently and can merge these in order to build a vocabulary for analysis purposes.
  • Collation with word meanings and lexical labels. Because problems are described by different people with different expressions, organizing the meanings of words within the vocabulary and standardizing the processing of the vocabulary are essential. To facilitate later analysis, the lexical characters should be labeled on the basis of collation to word meaning. At that point, a professional vocabulary can be created.

Building a Structured Hidden-Danger Database. This task is achieved through three steps that are detailed in the complete paper.

  • Keyword mapping and extraction
  • Data cleansing and integration
  • Data selection and transformation

Data Analysis. Regular mining of information includes three parts:

  • The correlation-analysis algorithm is used to identify the correlation relationship among hidden-danger factors.
  • The changing-mode-of-mining algorithm and change mode are applied according to the situation.
  • Visualization processing and analysis of regular information is completed, and the hidden-danger data-analysis report is generated.

Correlation-analysis and changing-mode-of-mining algorithms are described in the complete paper.

Data-Mining Analysis

Range of Data. Hidden-danger data containing 29,938 petroleum-refining safe-production cases were collected. Hidden-danger data are processed into a number of fields, including project, location, audit dates, problem descriptions, corresponding professions, and people responsible for rectification. The changing-mode and correlation-analysis algorithms of one enterprise were analyzed.

As part of the analysis process, 9,982 items related to keywords were extracted from 2.65 million words through the process of Chinese word segmentation, word merging, collation to word meaning and lexical labeling, and creation of a list of terms pertaining to HSE management systems in the petroleum-refining industry. On the basis of this vocabulary, a database of hidden dangers in petroleum-refining enterprises was constructed.

Types of Data. Enterprise A submitted problems during 2017. In the first half of the year, departments that saw a high frequency of problem submissions included, among others, storage and transportation, organic synthesis, safety and environmental protection, material procurement, storage and transportation workshops, and the thermoelectric joint workshop. The probability of problem submission for the joint workshop is 43.6%. In the second half of 2017, the departments seeing a high frequency of problem submission included mobile equipment, storage and transportation, safety and environmental protection, water supply and drainage, organic synthesis, and the joint workshop. The probability of problem submission for the joint workshop increased by 15.5%.

A total of 25 types of equipment appeared in the data in the first half of 2017, mainly catalytic, continuous reforming, diesel hydrogenation devices, and distillation units. Two hundred and thirty equipment and facilities categories appeared in the data, including pipelines, valves, flanges, and insulation; facilities-related categories included platforms, operating rooms, scaffoldings, spherical cans, guardrails, and pumping rooms. A total of 26 types of devices showed in the second half of 2017, mainly related to distillation units and catalytic, continuous reforming, sulfur, and fractionation devices, in which distillation-device frequency increased significantly. For that same period, equipment- and facilities-related categories numbered 365, including equipment such as valves, pipelines, protection, flanges, pressure gauges, heating furnaces, control valves, guides, and heat exchangers. Facilities-related references in the data for the same period included mainly platforms and operating rooms.

Hidden-Danger Problem Characterization. Hidden-danger problems can be divided into three types. The first of these constitutes what the authors term as simple management problems: Specifically, the working environment and job practices create safety gaps that are, at least in theory, simpler to control. Examples of this type of problem include too-brief inspections, inaccurate shift records, incomplete safety signage, and stacking of dangerous materials.

The second type consists of issues that involve a more-nuanced degree of management, termed by the authors as normal. For example, deficiencies may exist in the security-management system, or safety-reporting or operation-ticket-management systems may be ineffective.

The third level consists of management of complex technical difficulties. For example, these involve issues related to anticorrosion measures, chemical-process maintenance, large-equipment management, relay-protection tuning, and hazard-and-operability analysis methods.

As the identification and characterization process of hidden-danger data proceeded, several new categories of problems were identified in the latter half of 2017, indicating the efficacy of this targeted, more-precise approach to analyzing textually based hidden-danger data.

This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper IPTC 19485, “Data Mining of Hidden Danger in Enterprise Production Safety and Research of Hidden-Danger-Model Conversion,” by Kun Tian, Hong-Qiao Yan, Ya-Ming Mao, and Shun-Cheng Wu, CNPC, prepared for the 2019 International Petroleum Technology Conference, Beijing, 26–28 March. The paper has not been peer reviewed.