TWA's editors of the Discover a Career section interviewed Dr. Satyam Priyadarshy, a technology fellow and chief data scientist at Halliburton. He is also the managing director of Halliburton’s India Center. He often is recognized as the first chief data scientist of the oil and gas industry. Forbes India named him as one of the top 10 outstanding business leaders. His work or profile has appeared in many places including Chemical and Engineering News, The Scientist, Silicon India, Oil Review Middle East, Petroleum Review, Rigzone, and Forbes, among others. He is the recipient of many industry accolades and is one of the most sought out invited speakers in areas of emerging technologies, quantum mechanics, cyber security and blockchain, big data, and digital transformation, among others. The Society of Petroleum Engineers named him a 2021–2022 SPE Distinguished Lecturer. He is currently associated as an adjunct/visiting faculty at several universities and institutes in the US and India. Dr. Priyadarshy is also on the advisory board of several emerging entrepreneurial companies. He holds a PhD in quantum/theoretical chemistry from IIT Bombay and an MBA from Virginia Tech.
Dr. Priyadarshy shares his expertise and advice in the following transcript of the Q&A conducted by Aman Srivastava, Mohamed Mehana, Prithvi Singh Chauhun, Oyedotun Dokun, and Mrigya Fogat.
Mrigya: So, we all know that we are fortunate enough to live in a great time where information and knowledge is so accessible to us. Also witnessing a transition in the energy industry, especially digitalization, when it comes to artificial intelligence and data science. Dr. Satyam, what are some of the resources that you suggest for young professionals to keep up with the trends in artificial intelligence, the advancements in data science?
Dr. Satyam: Great question. You see, the field has been there for a long time. First and foremost, especially for oil and gas, people already in the industry or who want to get in, learn to read SPE papers. There is a lot of knowledge there, it's hidden in various shapes and forms. So, there are two reasons for it:
1. The applications that have been done and why they have been successful or not successful. When you are trying to formulate your problem, you can say not to make the same mistakes. We live today in this era of what is called "scalable computing." If somebody addressed a problem in our industry 20 years ago or 10 years ago that was not scaled at that time, we can actually now scale the same problem. And that is the value of AI or data sciences.
2. Another thing to remember, data science basically means "science on the data." Oil and gas is a science- and engineering-based industry.
Now, how to keep yourself updated about what is the latest and greatest in other industries? And that's where I think the gap happens a lot of the time because we are in the weeds trying to solve one problem, forgetting what's happening in the outside world. For example, if you look at the medical space data integration. Drug discovery in earlier days was based on two kinds of data: where is the molecule that they are trying to attach and the disease they are trying to address. But today they can actually connect multiple datasets like proteomics, bioinformatics seamlessly. Including the EHR (electronic health record), which could be the equivalent of our daily drilling report (DDR) or the daily briefing report, if you want to think of it. So, what this industry has done in terms of integrating data is what is missing for us in many cases. The second area is to really understand it in terms of algorithms of what is happening. Do we really need to apply the most complex algorithm for a problem? Why are people using it? And just saying, “Oh, I read this paper on deep learning, let's go apply it to reservoir simulation.” What are we going to gain? Understanding the foundation of it and the problem at hand?
So, in terms of resources:
1. SPE papers (including all associations in addition to SPE such as SEG, AAPG, etc.). They all have publications. JPTdoes a good, nice summary of articles monthly, which has good review articles. OnePetro, an online multisociety source of technical papers, contains contributions from 20 publishing partners and provides access to over 200,000 items related to upstream oil and gas.
2. The second set of papers that you want to read is IEEE - Institute of Electrical and Electronics Engineers. It gives you a wider range of applications of a certain algorithm for an optimization problem or an innovative solution to a complex problem.
3. Avoid some of the popular magazines where they want to have an abstract layer and not talk about science. They can create more confusion by saying artificial intelligence can increase production by 99%, and we as engineers ask "99% with respect to what?" Those are nice headlines, which is great, but not to learn.
Mohamed: Hi, Dr. Satyam. Usually for young professionals, data science is perceived sometimes as a Rosetta Stone. It is something that can solve any and all the problems that we have. Just apply data science or machine learning techniques and everything would be okay. What is your advice to young professionals about different techniques of data science and what approaches they should use when using or applying any of these techniques?
Dr. Satyam: Great question. As I said earlier, data science is equal to “science on the data.” All of us cannot be experts on everything. Or we may not have an aptitude for everything in this scheme of things. For example, some people may be very good at looking at data itself. Like how do I get the data from the field? Aman can really talk about it more. How is data stored in the field? How does it connect to it? How do I gather it and how do I actually bring it into shape and form so that I can actually learn and run some applications or algorithms on top of it? Other people might say let us go build our models on it. And then some people may just love using one or two models because they are so expert at it. But other people may want to explore more than one model or hundreds of models and spend all their lifetime on it. So, there are different aptitudes.
And then of course, the actual looking at the pattern recognition that comes out of the data, which is the hardest part of all. Because you can see something (in charts), but you can't interpret. And that is the skill that all of us really need to know in this industry because this is what will tell us where we can improve. That doesn't come by just reading books; it is by practice, discussing with the domain experts, especially if you don't have a background in the domain (going and showing them charts, time series charts, or very complex three-dimensional charts). Discussing that with the domain experts saying, why do I see this from this kind of data, or this kind of algorithm will give you that interpretive power that you need. Then of course, once you learn to interpret these patterns is when you actually start converting the problem into a solution and telling a story. This is how to address a problem.
One of the most confusing things that you will hear in our industry is everybody yearning after the accuracy of the model. What does that accuracy mean? Do I want a 100% accurate model, spend years building it? In the meantime, keep losing all the failures that we have in the industry, not being able to detect them. Let's assume a simple example. Take an operator, with 100 pumps running and typically, say, 50% of the pumps fail every year. Now, let’s assume you have all the data for all the 50% of those failures for the last several years. And you build very nice algorithms to predict failure based on what you have. But your accuracy is only, say, 60%. Would you implement that solution, or would you wait and keep improving until you get a 100%? If my model is 60% accurate in predicting the failures, that means out of that 50, we are at least able to predict 30 failures. If you’re going after 100%, that means we need to predict all 50 failures. Now, if I can prevent those 30 failures, have I not created value for us? So, we should look at data science in terms of value generation, not accuracy.
Another example you can think of, for clothes, you can go either buy a one-size-fits-all, but if you really want a custom fit for the same t-shirt, you're going to pay a lot more. How much do you want to invest?
Some of you may have heard me in other forums. I always say that you have to do data science in phases.
● Whatever data we have you can't do much about it. Learn how to really look at that data and come up with a plan in terms of value. One, is the value of actually addressing the problem. Second, how do I improve my quality or the quantity of the data going forward? We can't change the past, but we can change the future. In fact, I say that let's create the data that is so-called black or dark data into a smart data. Okay, let's put more sensors or collect the data at a higher frequency. But you can't change the past.
- Refine your model as you get more and more smart data. So, the data part is complex.
- When building these models, they are again iterative. Remember, when you do models, it will increase your value proposition.
- Learn to interpret with the domain experts.
- Address or what they call storytelling in the sense that the problem that we started with—have we understood the problem? Have we found a solution to that problem? To what extent and how much value is it going to generate for us?
So, these four components are very important for us. Remember, one of the things that I always say: You can think of data science problems in three ways.
1. Optimize something using that.
2. Automate.
3. Or build a completely innovative solution (innovate).
In our industry, lots of optimization can be done in almost everything, from a simple workflow to a very complex workflow. When you say optimize the workflow, that means also, you are actually optimizing the data as well because workflow works on data.
So, then automate. That means you have created a great amount of smart data. If you are on social media, whatever you type goes to your friend immediately. We assume that what we're typing is correct. And that also means the other person will interpret it correctly. Can we do that kind of work for our industry? Write an algorithm that will just run, give us the insights that we want, a nice layout and give us a solution for this problem. Automation means you must address the predictive part and you would have addressed a little bit of the prescriptive part in it too. For the prescriptive part, you need domain experts. Now that you have automated it and if there is a chance that this is not going to grow, that means you are not able to automate; that is because there are some gaps. That's when you start building innovative solutions.
Mohamed: I really like the way you summarize it in three items: optimize, automate, or innovate. That leads to the next question that we have. In your opinion, what do you think are the biggest problems that we have in the oil and gas industry that could use machine learning or data science? Can give us one or two examples from our industry?
Dr. Satyam: So, I don't know what the biggest problem is. At the highest level, it is producing oil and gas at the lowest cost possible with the highest safety. We have to look at all the aspects of it whether it is the exploration side, the drilling side, the completion, the production rate, then the management, the supply chain. Pretty much every area is open. The reason why it's open is because of the claim by industry that we have data, but the data is not completely and effectively used even today. I'll come back to the first question that you asked, where can you learn? This is where the industry has to come together. They talk about getting the talent and teaching these skills, but they have to give the data also. Most companies will say to go work on a project with synthetic data. We all know that synthetic data does not tell the real story. If you really want the talent that is ready to work, the industry-academic partnership must become stronger. So, that itself is a problem that needs to be addressed.
One of the best examples that I love talking about is fault interpretations. Now we are all very happy with our smartphones, because when we take a picture of a group, all the faces are recognized as rectangles. Twenty years ago, that was not the case. It could only recognize five people in the world, because the library of that data was only 20 people from California. And all the models were trained on those 20 pictures. But today, the data behind face recognition technology is significant. Now, some of you have worked in the industry for a long time. This industry must have done fault detection and fault interpretation for 50 years. Where is the library? If the industry is smart, then they would create a library and they don't have to give out all the data related to the subsurface, but they can give the shape and size of the faults. Now, when you have a new volume, you can really run against saying how many of these faults look like the fault number 1234. Automation of detecting faults could reduce the time of that workflow. So, you should think they're all interrelated: optimize, automate, and innovate.
At a higher level how fast can I go from planet to the pore? Or you can think of this problem: how can a geophysicist be a pilot? The whole interpretation is done by the algorithms. But at the final stage, the geophysicist says, “Oh, I will take these three recommendations;” human cognition doesn't go away, but it can help.
There are so many problems with drilling. Those who may have heard me speak about how, with the number of papers published for drilling performance, we still don't have a good predictive model for it, forget about the prescriptive part. Now if you think of it, if we looked at historical data, and if people really democratize that data, then we could build a very powerful model. Because you can learn from what was missed in other previous assignments. It's not impacting any operation that is going on right now. So, when you do historical data, you're not losing any increase in risk to a project. But it gives you lots of insights, and those insights are what you can apply to the current operations or future operations. This learning part is a big component of the challenge that we have. Most of us are engaged in a project where everybody wants an answer today, but they have never looked at the historical data.
Aman: That's very true. We are coming close to the 25-minute mark. I want to take a moment here to hear others' thoughts as well. Dokun, if you have any comments or questions or concerns.
Dokun: Thanks for such an insightful session. I think it's more of a learning point for me. So, at this point I didn't have any questions. My other colleagues can ask questions. What I have is more of a commendation for the insights you shared. I'm sure that our readers and listeners will be tremendously helped by the points that you've shared today. Well done. Thank you so much.
Mrigya: Thank you so much. Dr. Satyam. This was so informative. It's always good to hear people who are viewing the big picture because we are involved in a particular project at this stage, especially since this is for the YPs. So, we don't get to see the whole picture as to what everyone's thinking and the different domains of the oil and gas industry. So, in that way it is really insightful. And as while hearing you speak, I think everyone was having an idea. Yeah, we should work on these things. I think of the moment you spoke of the library. A lot of people thought, yeah, library! It's always insightful and thank you so much.
Mohamed: I just want to echo what Mrigya said about all the insights. I really enjoy what you shared, and I wish we could continue talking, but I think, unfortunately, we have limited time.
Prithvi: I don’t have a question but a thought that being a student, I did my bachelor's and then I'm doing my master’s. As a YP, until I get into the industry there is a big problem with not just having open-source data, but also having open-source applications and code and algorithms. So, if we take an example, you have these GitHub repositories and all the documentation of how it's being implemented at the code level. This is something that I find difficult for students, particularly in the gas industry because they might have data like SPE10 and Bruges model, but the application part and how it's being made from scratch is something that is missing. And it's very difficult to find such high-tech applications where we can learn line-by-line and build something upon it. That's probably because of the confidentiality with everything. But I think this industry grew because of being open source with everything, not just data.
Dr. Satyam: That is true. That is why we see this explosion of algorithms and tools and technology because of the open-source revolution, starting with Linux, a large amount of libraries that are available. Has the industry adopted it? Yes, to an extent. Is it enough? Maybe not. One of the classic studies that all of you should read is Rio Tinto, a mining company. They took their seismic data and let it be open to the crowd (experts, of course). People could look at that data even though it's proprietary data. Another one is a well-known case study of the Netflix competition, but that was a competition. They extracted the data and gave it to do something and that became very popular.
But for our oil and gas industry, if you take some well, in some country, which has been completely abandoned by now, that data exists in some shape and form. If somebody were to look at it and they find something interesting in that, why they abandoned 5 years before they should have, that would be interesting information. But at least people will have access to what happens in the field. So, I am on that mission to educate people on that. If you can be part of it and educate. Democratizing information within the right governance rules is possible, to an extent. I’m not saying that everything can be democratized. But when we have strong governance rules, NDAs (nondisclosure agreements) are signed. If we give it to professors or students, they sign the NDA. So, there are checks and balances and within those checks and balances, I think the industry can move a little bit faster.
Aman: I think Dr. Satyam has covered pretty much everything that even I wanted to comment on from my side. Whenever we as YPs are given a task in data science projects, that goes to someone who is hired as a data science engineer or a domain person, who is asked to work with the data science team or someone like Mrigya, who already has a domain knowledge and is working as a data scientist. There should be very clear understanding as to what we are trying to achieve by the end (optimize, automate, or innovate). You should not synthesize the data so much that they are extremely idealistic in nature and it makes no sense anymore. Or you ended up with a ‘duh’ answer like with all the HSE reports compiled together and we ran an NLP (Natural Language Processing) algorithm to determine that we should always wear hard hats on the rig site! Well, that has been known for the past 50 years. You spend $4 million to come up with an answer which people knew for the past 50 years.
Dr. Satyam: Just let me make two comments here. One of the other resources that I think a budding data scientist who wants to grow in this area should regularly look at is the innovatio Kaggle competition. It's a great place. Sometimes very complex problems are done. And it may not directly relate to us, but it helps you to understand what's happening in the world. This didn't exist a few years ago. They started as a competition. Now you can actually play with it. So, it’s a great place to see how people formulate a problem statement with the data. One of the challenges is how to formulate the problem; that is what we were talking about earlier—why are we doing it? So, when you look at other industries, how they're formulating can help us a lot.
The other comment is on NLP. In our industry, most of the unstructured data has nothing to do with NLP. There is no natural language. You have to be really careful. You look at SPE papers titled “NLP.” Which part is NLP? NLP means a sentence. How many of our reports have a sentence with English grammar? We're not really doing NLP. We are looking at the best practices of NLP, applying it on a completely fragmented technology language with an amount of abbreviations that are only worse than the amount of data. And hence I have always said that data science, at least in our industry, is a combination of three people: a typical data scientist, a domain expert, and the businessperson. Then you get some output/value. Otherwise, Mrigya can run NLP algorithms on drilling reports and she will say, “What is TRQ (torque), which is written 25,000 times?” We have to be really careful about using these terms and I keep making these suggestions to SPE also, because the title of a paper is NLP, but there is no NLP, which can be misleading.
Mohamed: Dr. Satyam, since you touched on two points, I want to hear more about your opinion on data-driven versus physics-based models and your take on using synthetic data to build a data-driven model.
Dr. Satyam: In the very beginning I said that we are a science- and engineering-based industry. So, if you think of it, when you do science on the data, you can't ignore the science and engineering. Data scientists like Mrigya, you give her the data, you just tell her high-level information, these are the feature sets. She may not be told exactly the applicable ranges of the value for a given “feature A.” So, she's going to build a model. She will look at it and you will get some results. And all of you can look at it. But Aman, who is a domain expert, says, “Oh, this is not possible, because ‘feature ‘A’ cannot have these kinds of values.”
But if you flip the process and Aman looks at the data and says, “In this data set, I don’t like the values in the feature A. Delete them.” Now when Mrigya builds a model, it will be what Aman wants. That is not right, because data that we get from the field is the real data. When we look at that data and there are data points that are not right according to the domain expert, you tag them saying, “Aman says this is not correct.” You now have two models, one with and one without the data. Compare that to what happened in the real world. You should have that value for “feature-A” even though theoretically it is not possible.
You actually play with complete data-driven, but you don't take that learning and put it in the field. You work with domain experts, understand the underlying science, and then you can refine the model to reflect that. In fact, the term I always use is called “augmented data science.” There's an augment for science or physics on the data, especially for our industry. This is very important.
A lot of hype and progression of data science happened because of internet companies. Because they were the ones who generated data, and I used to work in one of them previously, so I know where the origins of scalable AI came from. There is no physics there, it's all behavioral models. Large amounts of data of the same kind, but we don't have that (in the oil and gas industry). For example, Aman can probably correct me, for some sensor temperature data might be sent every 10 minutes but in the same process the pressure sensor might be sending it every 3 minutes. Now you have to make some sense out of it. And we all know the problem is that in the field, all of this data is collected from various places on the field and aggregated on a server. And that server's timing may not match the sensor’s timing. Now, if you literally go by the data-driven approach, this is a hodge-podge of data. You have to understand the physics behind it. Only then can you say that the pressure values should be taken twice and temperature three times after 10:03 a.m. (just an example). You see, without understanding physics you can't do that. That's purely some random numbers you are connecting. This is why it takes more time for our industry. You don't have a server log that you just run it through and that's why I say you can't really produce results that some people think can be produced in a short time because understanding that data is critical since we don't have a catalog of data and its properties.
Now to your question on synthetic data. I am using names here because all of you are in front of me. For example, Aman can generate a perfect data set for you. He can say, for this process, pressure can be in the range of one to ten. Same thing he can do for temperature and other variables. When you try to build a model on this data set, your model is going to be really great, but what have you learned? Nothing, because he generated the data of the output you wanted. It made the model look good.
This is where the challenge is for industries. They have data, it's not that they don't have data. Some people might say, “We have bad-quality data,” or “We don't have complete data.” How does it matter if they have real data? We can learn from it. The model may not be 40% accurate for something. But now you understand what the gaps are. With synthetic data, you built a 100% accurate model. And now you're going to take it to the field, and you get 10% accuracy. How will you compare? So, synthetic data is good to learn, but it's not for understanding the science.
I'm not talking about the e-commerce industry; I'm not talking about marketing. It was very easy to generate synthetic data there. How many things Mrigya is buying, how many things I am buying. I can just generate it. So that's a different industry, different problem.
What we have to look at is how is my subsurface, how is my drill bit, how is my fluid? The fluid that is used, for example, in Kuwait versus Oman is different. Will synthetic data give you that? We have to be a little careful about these things. People use these terms without, I would say thinking seriously, “Let's go take synthetic data” and an academic does that quite often. And that's why it's very difficult for people who come out of academics to really think, now what? This is where I keep promoting that the industry has to come together so that the students and people who are learning should get in touch with real data.
Aman: We had a discussion in the last panel discussion and we touched on this data science part as well. I think it was Mrigya who said that many times when I look at the data for reservoir engineering, at least being a domain person, I can say that these pressure values make no sense. But she can say that because she's from that background and having that background knowledge (the physics) is essential. So, I think for all the YPs out there, and this is just my opinion, but I think before you step into the world or the questions of data science, make sure that the basic understanding of your own domain is solid.
Dr. Satyam: So, one of the things I always say is that real data is the only single source of truth. Anybody, any person, any engineer, any scientist who modifies the data, even a comma, and the full stop in that data is not true anymore. What you can do is that you have the data which has values which are unacceptable, or which are not possible, you tag them. So, tagging technology has been very old, saying, "column C, value 51 is unsafe or not possible.” When you build your model, you can exclude values in the model which have been tagged. So, you're not modifying the data. You're only putting a condition on the data, this value has been tagged by so-and-so engineer who thinks that this is not possible. We have a model where Aman has tagged the data. The second model could be where Prithvi has tagged some data. Now you can compare what's going on with these two values removed or the two sets of values removed. Now you have some learning. If for some reason you think that value was not possible but the model which includes that value actually shows what happened in the real world, then the question would be, why is that happening? Why did you think this value was not possible?
One of the classic examples is, let us say, you have a pump. And typically a manufacturer will say that a pump will operate under certain weather conditions (temperature range -20℃ to +50℃). Now, imagine that pump has been deployed in Nevada, and the temperature goes to 50℃. But the pump has already failed. By the manufacturer's definition that pump was supposed to work fine. But it has failed at the value that it was given. Now someone might say that the temperature was 51℃, but that's not true because that's not recorded. Let us say it failed at 48℃ then as engineers we can claim that it is not possible because the pump was supposed to operate until 50℃, there must be something else going on. But if it happens regularly, like every day, every week, as an engineer if you ask to ignore that data, the model will never pick it up. But if you look at the data closely, then you say, “Look at that, every Monday, whenever the temperature reaches 48℃ the pump fails. There is something wrong with the pump as it fails at 2° before the manufacturer's testing range.” You can tag and say that this is possible or not possible, but let's build a model with the raw data.
Aman: I guess there has to be intelligence even if you are getting data from a real source. What you're excluding has to have a reason.
Dr. Satyam: It has to have some tacit knowledge from the field. So, you excluded it and then you build models with and without exclusion and then compare.
Aman: Like in WITSML if data is wrong it gives a large negative value (a very strange large negative value). Now if it is the RPM we are talking about, and the data comes negative it is easy. Wherever it has a negative, just tag that as a bad data point, so you can ignore that part of the data, but then the logic comes in. What do you do with that ignored part?
Dr. Satyam: But in this case it has been forced to ignore. Because that's how technology existed. There is an improvement area—going forward, do we really have to put in that negative number? That's a different value proposition. For a particular process you may find that 40% of the data is negative data. We assume that it will be negative and we force that data at the point of creation or ingestion. So maybe the future is that if you analyze six to seven projects and observe that most of the projects done in south Texas have 40% negative values, but in Oklahoma it is only 10%. Now you can say that there may be something else going on here. So, let's try to collect the data rather than put that negative number in the point of ingestion. So, then you change the quality of the data that's coming in.
Remember, we have to do all these things in terms of investment. The business side is there always, but as a scientist you want to really optimize for everything, perfect the data quality, volume. But business will say I don't want to invest in it. It is a balancing act.
Aman: How much time are you willing to spend on every line of data? How much time are you willing to spend on preprocessing before you actually move ahead with modeling?
Dr.Satyam: So, we have to spend a lot of time looking at the data and understanding the data. Modeling is almost at the end of the process.
Mrigya: I would like to add to what you said about the values that the subject matter experts remove from the data set or the data scientist ends up removing themselves. This is something I learned while working on a project with more-experienced data scientists. So, in that process, I was removing the outliers, but it was a very uncertain and complex data set, like the subsurface. So that's where that person (subject matter expert) advised me that sometimes it is these outliers and anomalies that we're looking for. That's where the resources are present. So as a data scientist, don't remove them without consulting a subject matter expert who knows the field.
Mrigya: We have to close the session, but I think we all still have a lot of questions. Thank you so much, Dr. Satyam for sharing your opinions, your thoughts with us. I'm sure for all young professionals like us, who look up to you, it gives us a good perspective. We see that there's a lot to be done because this was a holistic discussion and we touched upon a lot of topics, technical, workflows, and principles of democratization of knowledge, which was really important. We will be releasing the transcript of this interview and the recorded video. Once again, thank you for your time. Thank you to the entire Discover a Career section team. And we look forward to many more such fruitful sessions. Till then, stay happy and keep learning.
TWA's Discover a Career Team
Aman Srivastava is a product owner at Landmark Halliburton. He holds a bachelor's degree in mechanical engineering and a master’s degree in petroleum engineering.
Mohamed Mehana is a staff scientist in the computational earth science group at Los Alamos National Lab. He holds master's and PhD degrees from the University of Oklahoma.
Prithvi Singh Chauhan is studying for his master's degree in petroleum engineering from Texas A&M University. He holds a bachelor's degree in petroleum engineering from IIT-ISM Dhanbad.
Mrigya Fogat is a data scientist with Halliburton. She holds a BTech degree in petroleum engineering from Rajiv Gandhi Institute of Petroleum Technology in India.
Oyedotun Dokun is a business development professional at OES Energy Services. He is a mechanical engineer, and International Well Control Forum-certified (Level 4) drilling well control supervisor.