Bring Order Out of Chaos and Put Data in the Center
Organizations interested in realizing the benefits of having data at the center of their firm will need to define numerous specific work scopes for the conversion of unstructured to structured data.
The Open Subsurface Data Universe (OSDU), now wrapped up in The Open Group, is an initiative started by Shell employees to overcome the industry’s conventional tendency not to share any data. The OSDU was created to enable expedited sharing when sharing makes sense. The concept hinges on the idea that some categories of data, when shared by all, benefit all. The OSDU has an emphasis on subsurface data, but the group recognizes a need for a general enablement of sharing through the creation of repositories and of shared standards for storage and access. Along these lines, the group touts the benefits of relying on organizations external to the firm in helping to realize the full potential of digital transformation.
At the end of April 2019, in Amsterdam, Johan Krebbers, chief technology officer at Shell, gave a talk on “Putting Data at the Center of the Oil and Gas Industry.” In the talk, Krebbers declared that the major problem holding the industry back is a data-access problem. Too many resources are devoted to trying to find data, and the ability of practitioners to add value in the firm is reduced because of the narrow scope of their data access. The result is that the industry is operating at far less than full capacity. Dramatic boosts in productivity, cost saving, and safety performance are immediately realizable when this problem is overcome. Krebbers emphasized five recurring issues that lead to these two major problems.
- Data currently needs to be formatted in a particular way to be acceptable as input by a given software tool. As a result, people are having to spend a lot of time converting between formats, leading to human error.
- Data currently is stored in silos within the firm. These silos prevent any, aside from the creators of the data, from accessing the data without considerable hassle. In addition, the silos tend to be recreated by each generation within teams, so decisions are not informed by historical data buried in other silos.
- There is a distinct lack of metadata stored with files in folders. This makes it difficult for practitioners to know in which work flows a file could be useful or what kind of value exists within the file without opening and exploring it exhaustively.
- Energy firms are dealing with no or limited search capability. As a result, finding files relevant to a search query and the particular data needed from within relevant files is an arduous, manual process.
- Even the structured data within the firm is set up and stored in ways that are not suitable for machine learning or other data-driven applications to capture their potential value.
The solution, as Krebbers presents it, is to put data at the center, both within the company and across the industry. He said he believes that a related cultural shift is inevitable. The emerging dominant culture in the industry is exemplified by individuals who rely on full data sets to make transparent decisions and who take the broader needs of the whole company into consideration as they create and share their data. Driving cultural change is always difficult, even with market demand on the side of the shift. There are, however, four specific actions Krebbers and the OSDU community recommend for firms across the industry.
- Separate data from the applications that created it or traditionally use it. The goal is to have data formatted in such a way that any application can potentially access it.
- All data, whether structured or unstructured, needs to go into a single data platform accessible across the company and be usable by powerful data science techniques.
- Develop support methods for metadata and master data such that files are assigned multiple tags, each designating them as relevant to a certain team or work flow.
- Develop search functions that enable real-time data access so that practitioners can make fully informed decisions without delays. Krebbers said he believes this is increasingly important.
The realization of each of these four strategies relies on a common tactic: the conversion of unstructured data into structured data. Considering Krebber’s first point, one type of unstructured data is actually structured within a file according to an existing protocol or template, but that protocol or template is simply the wrong one for the intended use case. The output structure from one application, in the most common example, is different from the required input structure for the application at hand. If an era is coming in which all applications for oil and gas practitioners share common data input and output formatting, that era is not yet here. In the meantime, systematic machine conversion between formatting is essential, lest impractical amounts of employee time get wrapped up accomplishing diminishing returns in rote manual data-transfer efforts. Some of this automation can be accomplished in house, but there is also a place for external service providers to offer file-format-transition solutions to expedite existing work flows and to make better use of data currently untouched by existing work flows. This holds true whether the applications in question are incumbent software tools or data science tools that are new to the industry, and it holds true whether the data in question is being passed from one work flow to the next or is being accessed from a central repository by tools designed to handle big data.
Another type of unstructured data is buried in textual files and needs to be converted into tabulated data in order to be useful to practitioners attempting to crunch numbers. These unstructured files are often historical, produced at a time when the typed word was the best and most useful way to ensure utility in the future (they may contain tables, but they may not be reconciled easily with the master tables baked into contemporary work flows). This format becomes obsolete when it can be replaced by automated tabulated reporting, which lends itself much more readily to today’s data-analytics initiatives. For example, walking around a facility and checking meters manually and then writing up a report likely has been replaced by automatic tabulated data population by remote sensors. Yet, the old files still exist. In these cases, the buried data in the historical textual files is still useful, but it needs to be scanned and converted into tabulated data in large batches impractical for manual labor. There are other instances where unstructured textual data arguably can never be converted into today’s structured data without loss of value. These include the recorded observations of individuals in complex and dynamic environments irreducible to options on a dropdown menu. These also include the distillations of wisdom from industry veterans that deal with gray areas or “the space between cells in the spreadsheet.” Decision trees dictating best practices would be too enormous and dynamic to construct manually, so the typed word prevails in communicating observations and best practices. In these instances, however, the unstructured data scattered across these valuable files still needs to be considered by the applications dealing with structured data. A system for extraction needs to be established and maintained.
Another way to derive structure from unstructured data is to tag the file containing the unstructured data with a handle that is recognizable by applications and humans alike and that affiliates that file with a work flow, a location, a team, a time period, or another designation of value to the firm. The proliferation of data files is accelerating rapidly across all industries, but the collection of files in oil and gas was large to begin with and was siloed. Vast amounts of structured and unstructured files are stored in file folders that themselves are unstructured. Given its lack of accessibility, all this data, practically speaking, is unstructured. The task, therefore, of manually sorting through all data silos for various files and tagging each based on their affiliation with any of multiple designations is madness. Firms are faced with the option of either starting over or automating the tagging process of historical records. Starting over would involve creating a new and smart system, allowing tags on all new files going forward while either ignoring all the accumulated files from the past or else committing to decades of rote manual processing rife with human error. Automating the tagging would allow the new system to be informed by lessons from historical data before it is fully solidified. Tagging is useful not only for files but also for specific passages of text within files. For example, tags are often used in the categorization of events in the field to allow for optimal resource allocation and predictive maintenance.
Finally, workers already tend to have real-time access to their own structured data through dashboards connected to sensors. If search functionality is to create additional value to this, it will be through including in search results the relevant unstructured data across all the types mentioned previously. The intent of questions entered into search bars can be broken out using natural language processing, but, if the only answers that come back are complete files with tags matching keywords or are specific passages containing keywords, the application will fall short of functional utility. This is because there are too many recurring keywords across many nuanced contexts within the industry for the list of all the instances matching any given keyword from a search to be relevant, let alone helpful. In addition, if keywords or intents are matched with whole files, the user still needs to search within each of the unstructured files for the buried intelligence they seek. For enterprise search to work effectively for firms in a complex, technical, and dynamic industry such as oil and gas, the search function needs to match the intent of the questions with the intent of the passages throughout all relevant unstructured textual files.
Overall, the creation of a data-centric work culture in firms across the industry seems inevitable, and the efforts internally to drive the necessary changes in order to realize this culture are complex and will take considerable time and effort. One recurring need that sits at the beginning of all identified categories of effort toward this goal is for the conversion of unstructured to structured data, whether by reformatting, tagging, or finding. Organizations interested in realizing the benefits of having data at the center of their firm and their industry will need to define numerous specific work scopes for the conversion of unstructured to structured data, and then they will need to execute those work scopes, either making use of current internal resources or leveraging external knowhow. Given that a large part of this conversion is a batch process rather than an ongoing need—and given the industry’s lack of experience with unstructured-data management—the devotion of major resources to building internal capability would seem ill advised.
Alec Walker is the cofounder and chief executive officer of the natural-language-processing and data-analytics firm DelfinSia in Houston (LinkedIn). Delfin helps the oil and gas industry extract value from unstructured data. He also founded and runs Inly teaching oil practitioners computer and data science (LinkedIn). Walker has led digital-transformation and internal entrepreneurship projects for a variety of leading organizations including Intel, Inditex, AECOM, and General Motors. He has worked for Shell as a technical service engineer in refining, a tech tools software product manager, and a reservoir engineer in unconventional oil and gas. Walker holds a BS degree in chemical engineering from Rice University and an MBA degree from the Stanford Graduate School of Business.