I am glad to have you back. In Part 1, we explored the linear regression model having a linear activation function. In Part 2, we will dive deep into nonlinear activation functions and understand why we need these functions. By the time you complete this and Part 1, you will have a background to understand deep learning from scratch, which will be presented in Part 3 of this series. Let’s begin.
Previously we covered the linear activation function, ŷ = wx + b, which can only predict linear correlations while performing poorly when applied to nonlinear functions. That’s why we need nonlinear activation functions which help machine learning models to fit complex and real-world data.
Before delving deeper into the different types of activation functions, it's important to note that in Part 1, we discussed linear regression models with one layer, as illustrated in Fig. 1.
This model has input X and its corresponding output layer. A circle denotes a neuron, and a single circle means a single neuron. Here, ŷ is the output of the last layer (final output). However, when we have hidden layers (layers between input and output), it's important to understand the notation used to represent multiple hidden layers. In a model with multiple hidden layers, the output from the first hidden layer is denoted by a1, which then becomes the input to the second hidden layer. Similarly, the output from the second hidden layer is denoted by a2, which then becomes the input to the third hidden layer, and so on. The output from the last layer is the final output and is denoted by ŷ (Fig. 2).
Additionally, we previously denoted a linear activation function with ŷ (ŷ = wx + b), but now we are replacing it with z (z = wx + b). We combine this linear activation function with a nonlinear version, denoted by g(z), read as g of z. Here, g represents the nonlinear activation function and z represents the linear activation function. This combination results in the output, denoted by a (specifically, a1, a2, a3, etc., according to its corresponding hidden layer). That’s a lot of new things to take in. To summarize, the new notations introduced in this section are:
- z is the output of the linear activation function.
- g is any nonlinear activation function (discussed below).
- g(z) is the output after the nonlinear activation function is applied to (or combined with) the linear activation function.
- a(n) is the output of a particular nth hidden layer. Mathematically, it is a = g(z).
- ŷ is the final output (output of the last layer).
We will discuss this in more detail in Part 3. For now, take a moment to visualize Fig. 3 and familiarize yourself with the concept of multiple hidden layers.
With a solid understanding of the notation used in machine learning, it is now time to explore the various types of activation functions and understand when to use each one. Activation functions play a crucial role in machine learning by introducing nonlinearity, allowing the model to fit complex, real-world data. In this section, we will take a closer look at several popular activation functions and examine their unique properties and suitability for different types of data and model architectures.
ReLU Activation Function
Rectified linear unit (ReLU) is a widely used nonlinear function in machine learning. It is represented in Fig. 4.
In Fig. 4, we can see that for zero and all the negative values of z, the output g(z), or a, is equal to zero (nonlinear) and for all positive values of z, g(z), or a, is equal to the value of z (linear function). This combination makes it suitable for fitting nonlinear functions. Therefore, if we are certain that our final output, ŷ, cannot be negative, ReLU is a great choice. Mathematically, it can be represented as:
Leaky ReLU Activation Function
Leaky ReLU (LReLU) activation function is an improved version of the ReLU activation function. It differs from ReLU in that instead of setting all negative values to zero, a small slope value is used to maintain some information from negative inputs. This is represented in Fig. 5.
The mathematical representation of LReLU is defined as follows:
Where α is a small slope parameter, typically set to 0.01. This makes LReLU a suitable choice when the output of a function can be positive, zero, or have small negative values.
Tanh Activation Function
Tanh activation function (also known as a hyperbolic tangent function) is commonly used when the output range is -1 and +1. It produces an S-shaped curve, as can be seen in Fig. 6.
Mathematically, it is defined as:
Here, z is a linear function equal to wx + b, and ez can be written as e (wx+b). For large values of z, such as 10, ez (e10) will be a very large number and e-z (e-10) will be a very small number. This means the numerator and denominator will be approximately equal, resulting in a final value of 1. Similarly, for small values of z, such as -10, the final result will be -1.
Sigmoid Activation Function
Sigmoid activation function, also known as logistic function, is used for binary classification (logistic regression problems). Its output is the probability, range from 0 to 1, and, like Tanh, it is an S-shaped curve (Fig. 7).
Mathematically, it is:
When the input value, z, is a large number such as 10, the value of e-z becomes very small (0.000045), making the denominator approximately equal to 1, resulting in a final output of 1. On the other hand, when the input value is a small number, such as -10, the value of becomes very large (22026.46), making the denominator a very large number, resulting in a final output close to 0. At z = 0, the output is 0.5. To use the sigmoid function for binary classification, we can set a threshold such that values equal to or greater than 0.5 are classified as one class (for example, oil), and values less than 0.5 are classified as the other class (for example, gas).
Softmax Activation Function
Softmax function, also referred to as multiclass logistic regression, is used for multiclass classification problems where we need to predict more than two classes. It is designed to provide the probability of each category, where the sum of the probabilities for all categories equals 1. Similar to sigmoid, the output range of the softmax function is from 0 to 1, and it is visually represented as a curve that maps the input values to the probabilities of each category, as shown in Fig. 8.
Mathematically, it is:
Where zi is the i-th element of the vector z and the denominator is the sum of the exponential of all the elements of the vector z. This equation ensures that the sum of the output is equal to 1 and each output is between 0 and 1, which can be interpreted as a probability of each class. The class with the highest probability can be selected as the prediction.
For example, for three classes, we have:
Let’s say a1 is oil, a2 is gas, and a3 is water and we calculated the values as a1 = 0.2, a2 = 0.25, and a3 = 0.55 for a specific region. In this case, a3 has the largest value, so that region is water.
Conclusion and Key Takeaways
This concludes our discussion of the five commonly used activation functions. While there are other activation functions, these five are the most widely used. When choosing an activation function for your model, it is important to consider the specific requirements of your problem and the properties of the different functions. Consider the following.
For the output layer:
- If you have a binary classification problem, use a sigmoid activation function.
- If your problem involves multiclass classification, use the softmax activation function.
- For outputs that are zero or positive, use the ReLU or linear activation function.
- For outputs that can be negative, zero, or positive, use the Leaky ReLU or linear activation function.
- If the output ranges from -1 to +1, use the Tanh activation function.
For hidden layers:
ReLU and variants of ReLU (such as Leaky ReLU) are the most commonly used activation functions. These activation functions have the advantage of being computationally efficient and easy to train. Tanh and sigmoid functions are also sometimes used in hidden layers, although they are less common as they can make the training process slow and difficult.
In Part 3 of our machine learning journey, we will delve into the mathematical concepts behind deep learning, such as artificial neural networks.
See you there!
Read Part 1and Part 3.