Workshop Cracks Open Subsurface Data
SPE’s 2021 Open Subsurface workshop tackled the ins and outs of open source, open data, and open access.
The Society of Petroleum Engineers held a virtual workshop 18–21 May 2021 on the topic of open subsurface data. The purpose of the workshop was to bring together professionals to discuss the topics of open source, open data, open education, and open access in the context of the oil and gas industry in general and the subsurface portion specifically. The workshop included presentations about existing open projects as well as panel discussions on topics including legal matters and how to start, scale, and maintain an open project.
What Is Open?
When we speak about “open,” we usually mean that something is being shared with the public that is typically not shared or not shared for free. In “open source,” the source code of software is shared. In “open data,” the data itself is shared. In “open access,” the contents of publications are shared without charging an access fee. In “open education,” educational content is shared without charge.
We should note that, while open projects are inherently without charge, they are not necessarily free. Use is restricted by a license. Others then can build on the project. For example, the operating system Linux is an open-source project and available for free, but RedHat Linux, which builds on top of this, offers paid products and services. Open projects can range from a mere upload of a master’s degree thesis code to a large sophisticated community project such as Linux or a large corporate-sponsored project such as PyTorch. Making any kind of blanket quality claims on open projects, therefore, is difficult, and each must be assessed on its own merits.
While open projects are usually distributed without charge, they are not free to create or maintain. Projects receive contributions in three primary ways: labor, hardware, and funds. Labor could be donated by individuals volunteering their time; this is the most well-known aspect of open projects, with Linux as a prime example. Companies may choose to offer their employees’ work toward an open project, which has happened with popular artificial-intelligence frameworks such as PyTorch and TensorFlow. Some companies may donate data-center time and hardware to the open community, with GitHub being an example. Large, successful open projects eventually need full-time leadership, and even small projects need to spend some money on products and services. Receiving financial donations is a prerequisite of any open project. While receiving small donations from many sources is great, it is typically necessary to have one or more major sponsors who can contribute significant sums.
In addition to the financial aspects, open projects both create and need a community to function. Open projects grow through content contributions and obtain a purpose by a community of users who consume the offering. Both sides working together creates a successful project. The need to grow both communities side by side has been discussed in recent years around ecosystem businesses such as Uber and Airbnb, which depend on a balance of suppliers and consumers. Maintaining that balance while growing is the central challenge of scaling a project.
What Is the Value of Open?
By and large, people will do what they can. If they have access to data and code, they research and develop it further. If they do not have access, they will not. If important problems are closed, then they will not receive attention. This implies that open projects foster innovation.
Community depends upon community. While many companies and people are happy to use projects such as Linux or Apache for free, they are not willing to contribute to open projects or they complain about the quality of open projects. Investment in open projects is repaid by the community in research and contributions.
Two decades ago, every company had to write their own solvers. Now, we build on top of commonly available well-tested libraries and can accomplish a lot more with lower budgets. Code developed internally requires more time, resources, and budget to fix all the bugs and make it good. Developing code in a distributed ecosystem with more contributors and testing institutions allows the bugs to be fixed more quickly, leading to better and more-robust software with less time, resources, and budget. This is the value of open projects to internal cost savings.
Sharing subsurface and similar data with the public can be helpful to the oil and gas industry in its relationship with the public, given the current sentiment around climate change.
General Observations on Participating in Open Projects
Basis. The principal challenge of an open project is scaling, which entails growing both the contributor and user groups of people. When both groups are small, the project usually is of limited utility and influence. Growth on one side of the ecosystem, however, requires growth in the other. More contributors are attracted to a project that is used. Users are attracted to a project that meets their needs, is updated frequently, and is maintained regularly. In a commercial project, the two dimensions are usually the number of users and the price, while in an open project, the price dimension is replaced by the number of contributors. This desire of scale is a fundamental objective of any open project and underlies much of the advice provided here.
Quality. The project and its contents must be of high quality to be used. Projects of less than high quality do not attract contributors or users and, thus, are effectively nonexistent. The project must have a quality-assurance process to make sure that a publication or release is of sufficient standard.
Documentation. To allow newcomers to enter the project or to allow users to make efficient and effective use of the project, it must be documented well and in an understandable manner.
Barriers. For new people to join the open project—whether as contributors or users—they must invest in the project by learning about it. This investment can be so sizeable that a person cannot join several projects but rather must choose (before getting deeply into its ecosystem) which one to join. The higher this entry barrier is, the harder it will be for new members to join. Designing a low entry barrier is in the interest of any open project and should be considered from the start. This may include certain conventions in the code, but it certainly includes noncode items such as documentation, tutorials, videos, and other resources.
Independence. The project must become independent of the original founding personalities for it to scale. The founders themselves should be able to let go of both ego and personal attachment.
Process. Any project will have a diversity of stakeholders. People who contribute code have different needs and wishes than people who use the code and the management of user companies. The project must address them all. In addition, contributors will have opinions about what to do and how to do it, and the project needs to have a process to deal with disagreements.
Organization. Organizing the work takes a lot of time and sensitivity because the team is composed (often mostly) of volunteers. These volunteers (both people and organizations) have their own interests and so are difficult to organize into a coherent project roadmap. The roadmap and plan will need much more dynamic adjustments than a commercial project.
Skill Sets. It is important to have nondeveloper contributors for an open project. Such people could offer legal, design, graphical, and documentation help. Source code or data alone may not be helpful enough to be useful and, thus, scale. To be useful to a multitude of users, an open project needs these other (nondata and noncode) contents.
Metrics. Measuring project success is important for the internal ecosystem and all stakeholders. Building in some measurements would help. The numbers of downloads, users, and publications that cite the project are popular lower bounds. Projects function like the platform economy in that they require scaling on two fronts simultaneously—contributors and users. Critical mass in the project’s community is needed before the project becomes successful and scales. In practice, it may be difficult to determine what that critical mass is and when it is reached.
Marketing. The team must inform the field that the new resource exists. The effort of marketing the existence of the project is a serious task that requires organization, management, and time. Only by making the community aware will the growth of the ecosystem happen.
Standards. An open project should be widely usable and, thus, interoperable with other open or commercial projects in the same space. As such, open projects should investigate the standards and data formats that are in common use and support them. Being able to receive input and provide output in various standard file formats or interact automatically with other software is a major benefit to users. Open projects should resist the temptation to set new standards or reinvent the wheel.
Community. Projects create communities of people who work together on a common goal driven by personal and professional interest. The sense of community is a powerful motivator and driver and should be fostered. Large projects have regular user-group conferences where the community meets in person and interacts—just like large commercial projects. These events foster a community spirit that grows the community and enables it to attain longevity. An open project should plan on integrating community building events as soon as is feasible.
Credit. A major sticking point of open projects is attribution and crediting. Because contributors usually are not paid for their work, they expect to derive satisfaction, visibility, and career benefits from being credited for their contributions. It is essential for a project to have policies and methods for crediting, both to attract contributors and to avoid legal problems of copyright, ownership, and origination. This is especially true for junior contributors who are still building their careers and must be given a sense that their contribution will further their career.
Observations About Open Data
Open-data projects are, in some ways, more complex than open-source projects because the data belongs to an entity that must release the data to the public, whereas, with open source, the code is generated while the contributors are already aware of the project’s open nature.
Process. An organization needs an established decision-making process to determine if a data set can be made public or under what conditions it can be made public. Procedures need to be in place to allow someone to raise the question and eventually for a decision to be made and implemented. This process must also watch over the implementation of any conditions or prerequisites to publication.
Provenance. Users of the data will want to know where the data came from, under what conditions it was generated, and a host of other identifying information. This must be documented carefully and made available alongside the data itself. Ideally, all the metadata must be present to allow someone to reproduce the data set if they chose.
Anonymization. Some data may need to be removed from a proprietary data set to make it public. For example, any information regarding a person generally needs to be scrubbed.
Publication. The publishing of a data set occurs when the data is uploaded to an internet service and made available to the public. Many research projects that are funded by taxpayers are required to make their data sets public—at least after the end of the research project. What is helpful in this context is for the platform to offer a time-delay to release (i.e., the authors would upload the data, but it would be available to the public only at some specified future date). The publishing platform must provide assurance that the data set will remain available over the long term to allow reproducible research.
Digital Object Identifier (DOI). Data sets should be given a DOI so they can be referred to unambiguously and cited in academic literature.
Observations About Open Source
Duplication. Open-source projects need to observe the global community carefully to determine—on a regular basis—to what extent their development efforts duplicate existing projects. Sometimes, reproducing existing scientific results can be valuable and represent a contribution in its own right, especially if it is done in a different way.
Interoperability. It is important for open-source projects to work seamlessly with other software packages (both commercial and open) so that users can design a work flow that incorporates the project alongside other tools. This necessitates common file formats, data standards, and automation protocols.
Focus. Just like any project, open-source projects should focus on a core mission and not try to do everything from scratch. Functionality that is ancillary to the core mission should be integrated via interoperability with existing tools.
Observations About Open Access
Funding. Many projects that are funded by public money are required to publish their findings in a way that is freely accessible to the public. That necessitates the publication in open-access journals or the publication of the research on an open portal in addition to publication in a restricted-access journal.
Paywall. Many papers exist behind a paywall. Often, a single paper may cost as much as a book. While this is not prohibitive for a single paper, it is effectively prohibitive for anyone doing thorough research needing access to dozens if not hundreds of papers. Paywalls stifle research.
Financial and Legal Considerations
Cost. Open projects are not free to create or maintain. While some people and organizations may donate some of their time and intellectual property to the project, even the smallest projects are not entirely devoid of financial cost. As a project gets larger, more resources are needed and more professional management is needed. Many of the large projects have at least one major sponsor if they were not created by a corporate patron in the first place. For an open project to be sustainable and to scale, it must find a source of revenue. The main source of revenue for open projects is in the form of donations, but direct revenues may be generated also.
Revenue. Even though open projects are usually thought to be free, this is not always the case. The business model of open projects often involves the desire to sell setup, maintenance, or consulting services on top of the free offering or to upsell additional proprietary software extensions on the basis of the free foundation. In this way, the user enters the ecosystem for free and is charged later to get help and more-advanced features.
License. All open projects are released subject to license terms. Many ready-made licenses are available. Many of these are well known to corporate lawyers, and choosing one of these options instead of writing one’s own terms is sensible.
Infringement. Every open project is available subject to a license agreement. If a user infringes on that license, damages must be paid. Such damages could be significant if the project is awarded indirect damages (i.e. a share in the profits of the infringing entity). Thus, it is lucrative for an open project to carefully consider the license terms that it will adopt and to carefully observe potential infringements. It may be helpful to plant certain artefacts inside the source code of an open-source project to help identify or prove possible cases.
Ownership. Even initially in a project, it is not clear who owns a contribution. It may be the individual (if contributed privately) or the employer (if contributed as part of employment) or a group (if several parties contributed to a single whole). It can be important to have a record of all correspondence and contributions so that attribution can be made, even long after the fact.
Recommendations to Oil Companies
Benefits. Oil companies should think carefully about the potential benefits of open projects. Using open projects is common, but there are concerns over quality and maintenance. Large open projects typically have a strong community that is willing and able to provide maintenance and support, albeit at a price. Contributing to open projects is less popular but promises large rewards. Publishing open data will result in free research results. Contributing to open-source projects will lead to faster results and more robust programs than developing them in house. Consider how to leverage open projects rather than how to box them out.
Process. Companies should institutionalize processes that lead to the open publication of data and code. Even though this may not always lead to a decision to publish, it makes it clear to all in the organization what is required to get there and whose permission is needed. This process typically has a committee that meets at regular intervals to review proposals and has authority to make decisions.
Consortia. Assets are sometimes owned by more than one oil company. The issue of open projects, particularly open data, should be discussed at the start of such consortia with the aim of putting an intercompany process in place to decide on the publication of the jointly acquired data. While publishing may not be the default, it should at least be an option. Publication has value to the consortium internally as well as in its relationship to the public.
Recommendations to Professional Societies
Platform. Create a publication platform across professional societies for open-access papers, open source, and open data. This will require standards and work flows to help authors release their content. It will also require staff to answer the questions that will invariably be asked by the community.
Policies. Professional societies should work on common policies and standards so that the community is encouraged to adopt them instead of going where the policy is most lax. These policies should cover prerequisites for publication, specifications of metadata for open-data releases, documentation for processes, and quality standards, among others.
Education. Educate oil companies, their employees, and students about the open ideas and the benefits of open projects. Actively engage them in the ecosystem by promoting the ideas, the platform, and the policies.
Licenses. Offer curated license options and access to legal advice for choosing the right licensing models as well as pursuing legal action against infringement.
Competitions. Run competitions or hackathons that are based on open data and yield open-source or open-access results.
Peer Review. Improve the peer-review process by allowing and encouraging authors to submit their data and code as supporting material for their papers and make them available for the long term.