Digital transformation

YP’s Guide to Data Science: How to Push Past “Where Do I Start” and Learn To Excel

This column is intended to provide a starting point and a roadmap for professionals who want to learn data science and are struggling with the question, “Where do I start?”

Time to Learn

When I learned on 1 April last year that I had been selected as a 2020 Energy Influencer, I first thought I had been a victim of a great April Fool’s prank. However, I later discovered to my delight that it was indeed true. In my live interview with The Way Ahead, I was asked if I had any recommendations for young professionals who are looking to go into the field of data science. Although I am admittedly not an expert in the field—if anything like an expert were to exist in such a rapidly changing domain—I am a data scientist who learned about the field without a formal degree and I come from a chemical/petroleum engineering background. I guessed I was at the same starting point as most of the young professionals in the oil and gas industry. My advice was to focus on applying data science in practice, specifically in developing industry cases.

After the interview, I received a lot of follow-up questions from young professionals all over the world with the recurring theme of “Where do I start?” I realized that there was deluge of articles and resources online which—while they broadened potential avenues for learning—also induced decision paralysis. I suspected that young professionals would easily be overwhelmed.

I went through this journey myself as a student when I started learning about data science and discovered some resources, techniques, and books that were helpful. I have also recommended these resources to friends and colleagues and received positive feedback. This encouraged me to compile a library for young professionals to help them start their own data science journey.

This article is intended to provide a starting point and a roadmap for professionals who are struggling with the question, “Where do I start?”

I will start with a brief introduction to artificial intelligence (AI), machine learning, and data science. After that I will dive into resources and best ways to use them. My intent is to provide a useful and short inventory of resources, not a long list which would defeat the purpose of this article. All the resources I’ll mention are free (except the books) and none are affiliated in any way with me.

Quick Basics

AI, according to Webster’s, is the science of simulating intelligent behavior in computers. AI in its purest form is a distant target and a topic of ethical and technical debate; still, we as a society are living in the middle of an all-encompassing experiment to reach this goal. The effects of AI are everywhere: From the moment you wake up to the moment you go to sleep—in some cases, even when you sleep, i.e., with wearables—you are surrounded by devices powered with some component of AI.

There are various ways to classify AI, but I will stick to the branch most relevant to professionals: machine learning. Machine learning enables machines to learn from data, identify patterns, and make predictions.

Data are the omnipresent fuels that power AI and its components, but data in their raw form are not useful. The art of interacting with data to make them useful, create models, and interpret the results of these models is known as data science. Data analytics is more focused on analyzing the data and presenting them in an insightful way.

Tools of Data Domination

Three main skills create the backbone of data science:

  • Data storage and management
  • Data manipulation and modeling
  • Data visualization

These skills are used routinely by data scientists all over the world. For example, in my job as a data scientist, I regularly use the sequence query language (SQL) server database (data storage and management) to access the data, and Python programing language (data manipulation and modeling) to create models with that data. Based on the results of those models, I create visualizations/dashboards in Spotfire software (data visualization) that are used by our customers to make decisions. A good data scientist should have at least one tool from each of these three categories.

Data Storage and Management. In the modern workplace, data are stored in databases which are organized collections of data. The systems required to interact with databases are known as database management systems (DBMS). Databases can be relational (SQL Server, MySQL, PostgreSQL, etc.) or nonrelational (MongoDB, Cassandra, etc.). Relational databases are table-based and have a fixed format, i.e., columns are predefined in a table and rows correspond to data points. Most of the scientific data can be stored in relational databases due to their structure. Nonrelational databases, on the other hand, are document-based and very suitable for unstructured data where there is no predefined relationship. For example, text-related data (chats, articles, news) can have text, pictures, numbers, etc., so nonrelational databases are used to store these data.

Figure 1 simplifies the criteria for selecting a suitable database. The conventional and widely used databases are relational; nonrelational databases (also known as non-SQL) are getting popular but they require advanced skills. Therefore, relational databases are a good starting point for data storage and management tools. SQL is the language of relational databases. For young professionals, a working knowledge of SQL is a must-have skill.

Table showing  criteria for deciding on the right database for your data
Fig. 1—How to choose the right database for your data.

Data Manipulation and Modeling. These tools are the heart of data science. The two most famous and widely used programming languages are R and Python; both are open source and have a very active developers’ community that adds new functionalities every day. In both languages, there are a variety of packages available to enable users to apply complex machine-learning algorithms without complex programming. Python has been widely used for the development of machine-learning applications, while R is very popular for statistical analysis. I have listed some of the popular packages in Figure 2.

List of popular packages in Python and R programming languages
Fig. 2—Popular packages in Python and R.

For both of these programming languages, a general text editor such as Notepad or Notepad++ is enough to write the code, however certain applications known as Integrated Development Environment (IDE) are designed to make programming and troubleshooting easier. For Python, I would recommend Anaconda software which comes with two IDEs: Spyder and Jupyter Notebook. Spyder is a conventional IDE but Jupyter Notebook has revolutionized the way people write and share code. Jupyter Notebook is a web-based application where you can combine code with text and instructions. It allows for cell-by-cell execution which is useful for beginners. Anaconda also provides support for R, but I would recommend the RStudio IDE and the RStudio Resource page. In this article I will focus on Python-related resources.

Data Visualization. Data visualization tools—also known as business intelligence (BI) tools—are critical because they help data scientists convey their discoveries in an easily digestible way. There are variety of packages available in R and Python to visualize data, such as matplotlib (Python) and ggplot2 (R), but some commercial tools are preferred in this domain due to their ease of use and performance. The most well-known BI tools are Tableau, Spotfire, and PowerBI. Students can take advantage of free 1-year licenses for Tableau and Spotfire.

Where Do I Start? Recommended Resources

This section will help you answer, “Where do I start?” The resources are categorized based on various skill sets and competence levels.

Basic Programming Methods and Logic

For programming in general, I would recommend the lecture series by Professor John Guttag at MIT where he teaches the basics of programming in Python. You can find his video lecture series, Introduction to Computer Science and Programming, on YouTube, along with transcripts, online textbooks, assignments, and exams.

Python Programming

  • Corey Schafer covers basics such as how to install Python using Anaconda as well as advanced concepts like object-oriented programming. He also covers tutorials about various packages and GitHub. Highly Recommended.
  • freeCodeCamp.org provides a comprehensive guide to Python programming and a lot of data science-related topics. The group also offers tutorials for cloud computing and SQL programming.

Theoretical Machine Learning

Applied Data Science and Machine Learning

These resources require a basic understanding of Python programming.

  • MLCourse.ai– This machine-learning course by OpenDataScience (led by Yury Kashnitsky) features videos, practice sessions, and assignments. Kashnitsky shows the workflow that you can practice in assignments with real data sets. Highly Recommended.
  • Python Engineer Courses–Very good collection of Machine Learning from Scratch videos.
  • freeCodeCamp.org–This group also has a collection of videos on the application of machine learning.

Neural Networks and Deep Learning

  • Deeplearning.ai–Another very good course by Andrew Ng focused on neural network and deep learning. Use this resource once you have mastered Python programming and understand machine learning basics.

Data Visualization

Python-related visualization resources can be found in the resources listed above. Commercial software maintains good documentation and user communities; those are the first places to check for specific questions. I’d recommend the following materials for beginners in Spotfire and Tableau. I have not used PowerBI, so I’ll refrain from recommending resources for that BI tool.

Tableau

  • Edureka–Comprehensive video for beginners in Tableau.

Spotfire

Inspirations and Quick Fixes

The following subreddits on Reddit.com offer very good recommendations for self-learning.

If you have time to invest, go to kaggle.com and create an account. It is a very good practice and learning tool with free courses, and they organize data science competitions with prize money.

If you are in the middle of coding and need to find a quick fix, I would recommend these two resources.

Books

  • Automate the Boring Stuff with Python by Al Sweigart is a great resource that teaches how to use Python to execute mindless tasks.
  • Everybody Lies by Seth Stephens-Davidowitz shows how Google search data can be used to unearth insights that are not evident. This book illustrates the application of data science in social science.
  • Weapons of Math Destruction by Cathy O'Neil identifies the drawbacks of machine-learning models that were developed without proper understanding of the problem and models. This book highlights how AI and ML should be used responsibly.

Conclusion

According to recent trends, it is evident that machine learning is going to be an integral part of the industry. Data science skills will equip young professionals to lead this transformation. Once you have mastered the basic data science skills, consider how you can apply them in our industry. The key is to identify where and when to apply machine learning. These resources should help young professionals start their journey and shape their career.