Activation Functions : Sigmoid, tanh, ReLU, Leaky ReLU, PReLU, ELU, Threshold ReLU and Softmax basics for Neural Networks and Deep Learning

10 min readJan 19, 2019

In the fascinating realm of artificial intelligence, Neural Networks stand as a cornerstone, revolutionizing technologies from speech recognition to real-time translation. The journey of Neural Networks began in 1944, thanks to the pioneering work of Warren McCullough and Walter Pitts. These networks, inspired by the intricate workings of the human brain, consist of numerous interconnected nodes that mimic the firing of biological neurons.

Imagine a vast network where each node, akin to a neuron, processes incoming signals by multiplying them with specific weights. These signals, akin to synaptic inputs, are then summed to produce an output. This process, known as Forward Propagation, is fundamental to how Neural Networks learn and make decisions. (Note: The gradient calculation and backpropagation, which come into play as we add more layers, will be discussed in a subsequent post.)

But what truly breathes life into these networks are Activation Functions. These functions determine whether a neuron should be activated or not more or less like how our own Biological brains work, essentially deciding the neuron’s output based on the input it receives. Activation Functions are crucial for introducing non-linearity into the network, enabling it to learn complex patterns and perform sophisticated tasks.

Below is a sample code snippet illustrating how a simple digital neuron, mimicking a biological neuron, might be defined with a Python class :

Neuron.py

Synapses and Neurons in Neural Networks both Biological and Computational

For example (see D in above figure), if the weights are w1, w2, w3 … wN and inputs are i1, i2, i3 … iN we get a summation of S = w1*i1 + w2*i2 + w3*i3 … wN*iN

For several layers of Neural Networks and Connections we can have varied values of wX and iX and the summation S which varies according to whether the particular Neuron is activated or not, so to normalize this and prevent drastically different range of values, we use what is called a Activation Function for Neural networks that turns these values into something equivalent between 0, 1 or -1, 1 to make the whole process statistically balanced. This process is not just to preserve sanity of the code but also to reduce complexity and computing power required which would be more difficult on inactivated inputs. It also adds stability within the network by mitigating the impact of vastly divergent input values.

Introduction >

Activation functions that are commonly used based on few desirable properties like :

Nonlinear — The hallmark of a potent activation function lies in its nonlinearity. By introducing nonlinearities, even a two-layer neural network can morph into a universal function approximator. Notably, the identity activation function falls short in this regard, rendering networks employing it akin to mere single-layer models.
Range — The range of an activation function plays a critical role in shaping the stability and efficiency of gradient-based training methods. Finite ranges foster stability, as pattern presentations exert significant influence over only a subset of weights. Conversely, infinite ranges enhance efficiency, amplifying the impact of pattern presentations across most weights. However, in the latter scenario, smaller learning rates are typically warranted to navigate the heightened variability.
Continuously differentiable — For seamless integration with gradient-based optimization methods, activation functions must exhibit continuity and differentiability. While the ReLU function deviates from this criterion by lacking continuity at certain points, it remains viable for use, albeit with some optimization challenges. In contrast, the binary step activation function presents insurmountable hurdles, as its lack of differentiability at crucial junctures impedes any meaningful progress with gradient-based methods.

Derivative or Differential or Slope: Change in y-axis according to change in x-axis.

Monotonic — A crucial characteristic of an activation function lies in its monotonicity. A monotonic activation function ensures that the error surface associated with a single-layer model remains convex, facilitating more predictable and efficient optimization processes.

Monotonic function: A function which is either entirely non-increasing or non-decreasing.

Smooth functions with a monotonic derivative — Activation functions that exhibit smoothness with monotonic derivatives have demonstrated superior generalization abilities in certain scenarios. This property enhances the network’s capacity to generalize patterns beyond the training data, leading to more robust performance in real-world applications.
Approximates identity near the origin — Activation functions that approximate the identity function near the origin offer a distinct advantage during the training phase. In such cases, neural networks can efficiently learn when initialized with small random weights. Conversely, when activation functions deviate from approximating identity near the origin, careful consideration is required during weight initialization to ensure optimal learning dynamics.

Table of Activation Functions >

Breaking down some Activation functions :

There are some linear and simple straightforward functions such as the Binary Step Function or Linear Function (f = ax) but since they are not widely used and are undesirable as activation functions, we won’t be discussing them.

1. The Sigmoid Function >

Sigmoid Function (Logistic Function) is a classic choice characterized by its smooth, S-shaped curve. It maps input values to a range between 0 and 1, making it particularly suitable for binary classification tasks where outputs need to be interpreted as probabilities. However, it suffers from the vanishing gradient problem, limiting its efficacy in deep neural networks.

Although sigmoid function and it’s derivative is simple and helps in reducing time required for making models, there is a major drawback of info loss due to the derivative having a short range.

As our neural network grows deeper, information compression increases, leading to potential data loss known as the vanishing and exploding gradient problem. Sigmoid functions, biased towards positive outputs, are less ideal for early layers due to this positive skew. However, they find utility in the final layer for output interpretation. Understanding these nuances aids in optimizing neural network performance.

Besides the logistic function, sigmoid functions include the ordinary arctangent, the hyperbolic tangent, the Gudermannian function, and the error function, but also the generalised logistic function and algebraic functions

2. Tanh Function >

In tanh function the drawback we saw in sigmoid function is addressed (not entirely), here the only difference with sigmoid function is the curve is symetric across the origin with values ranging from -1 to 1.

The formula for hyperbolic tangent (tanh) can be given as follows

While tanh mitigates some of the issues present in the sigmoid function, such as asymmetry, it doesn’t completely eliminate the vanishing or exploding gradient problem. However, being centered at zero, it offers advantages over the sigmoid function. Hence, other activation functions are often preferred in machine learning, as we’ll explore below.

3. ReLU (Rectified Linear Units) and Leaky ReLU >

The rectifier is, as of 2018, the most popular activation function for deep neural networks.

Most contemporary Deep Learning applications, spanning Computer Vision, Speech Recognition, Natural Language Processing, and Deep Neural Networks, predominantly utilize ReLU instead of logistic activation functions. ReLU not only exhibits a superior convergence rate compared to tanh or sigmoid functions but also offers manifold advantages in various applications.

Additionally, ReLU boasts several variants, such as Softplus (SmoothReLU), Noisy ReLU, Leaky ReLU, Parametric ReLU, and ExponentialReLU (ELU), each tailored to specific requirements. We’ll delve into some of these variants in the subsequent discussion.

ReLU : short for Rectified Linear Unit, functions by setting its output to 0 if the input is less than 0, and retaining the input value otherwise. In other words, if the input is greater than 0, the output equals the raw input itself. This operation closely mirrors the behavior of biological neurons, making ReLU a popular choice in neural network architectures.

3.5

ReLU f(x)

ReLU is non-linear and has the advantage of not having any backpropagation errors unlike the sigmoid function, also for larger Neural Networks, the speed of building models based off on ReLU is very fast opposed to using Sigmoids :

Biological plausibility: One-sided, compared to the antisymmetry of tanh.
Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
Efficient computation: Only comparison, addition and multiplication.
Scale-invariant: max ( 0, a x ) = a max ( 0 , x ) for a ≥ 0

While ReLUs offer numerous benefits, they are not without drawbacks. Notably, ReLU is non-zero centered and non-differentiable at 0, although it is differentiable everywhere else. Another limitation of ReLU is its restriction to hidden layers

It cannot be used elsewhere due to the potential issue which is mentioned below.

Another issue encountered with ReLU is the Dying ReLU problem, wherein certain ReLU neurons become inactive for all inputs, resulting in a complete cessation of gradient flow. When a large number of neurons suffer from this condition within a neural network, it significantly impacts the network’s performance.

To address this challenge, Leaky ReLU comes into play. Leaky ReLU introduces a small slope for inputs left of x = 0, as depicted in the figure below, effectively creating a “leak” and extending the range of ReLU. This alteration ensures that even neurons with negative inputs contribute to the gradient flow, mitigating the risk of neuron death and enhancing the robustness of the network.

With Leaky ReLU there is a small negative slope, so instead of not firing at all for large gradients, our neurons do output some value and that makes our layer much more optimized too.

4. PReLU (Parametric ReLU) Function >

In Parametric ReLU, depicted in the figure above, the slope for x < 0 is not fixed, as in Leaky ReLU with a fixed slope of 0.01. Instead, a parameter ′a′ is introduced, which varies depending on the model.

By employing weights and biases, we adjust this parameter through backpropagation across multiple layers, allowing the network to learn the optimal slope for negative inputs.

for a ≤ 1

Hence, due to its relationship with the maximum value, PReLU is also employed in architectures known as “maxout” networks.

5. ELU (Exponential LU) Function >

Exponential Linear Units (ELUs) are utilized to expedite the deep learning process by shifting the mean activations closer to zero. This is achieved by introducing an alpha constant, which must be a positive number.

ELU has been demonstrated to yield more accurate results and faster convergence compared to ReLU. While both ELU and ReLU produce similar outputs for positive inputs, ELU smoothly smooths negative inputs (to -alpha) at a slower rate, whereas ReLU sharpens this smoothing abruptly.

alpha is a hyper parameter, with positive value constraint

6. Threshold ReLU Function >

Through the amalgamation of ReLU and FTSwish, Threshold ReLU (TReLU) was introduced. TReLU resembles ReLU with two significant modifications: it permits negative values while capping them, thereby substantially enhancing accuracy. Its formulation follows as: f(x) = x for x > theta, f(x) = 0 otherwise, where theta is a float >= 0 (the threshold location of activation).

7. Softmax Function >

Softmax is a fascinating activation function because it not only scales the output to a range between 0 and 1 but also ensures that the sum of all outputs equals 1. Consequently, the output of Softmax forms a probability distribution.

The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Mathematically Softmax is the following function where z is vector of inputs to output layer and j indexes the the output units from 1,2, 3 …. k :

Softmax Function

In conclusion, activation functions are crucial components in neural network architectures, each serving specific purposes. Softmax activation excels in multi-class classification tasks, while Sigmoid activation is fundamental for binary classification. Tanh provides symmetric outputs, while ReLU and its variants ensure robustness and efficiency. ELU facilitates faster convergence, and Threshold ReLU enables dynamic threshold adjustments. Experimenting with these functions at different layers is essential for optimizing neural networks, empowering practitioners to effectively tackle diverse challenges in machine learning.

For citing do use the format :

@article{himanshuxd,
    author       = {Himanshu S},
    title        = { Activation  Functions : Sigmoid, tanh, ReLU, Leaky ReLU, PReLU, ELU, Threshold ReLU  and Softmax basics for Neural Networks and Deep Learning },
    howpublished = {\url{https://himanshuxd.medium.com/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e}},
    year         = {2019}
}

[1] : https://en.wikipedia.org/wiki/Artificial_neural_network

[2] : https://en.wikipedia.org/wiki/Activation_function

[3] : https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

[4] : http://cs231n.github.io/neural-networks-1/

[5] : https://en.wikipedia.org/wiki/Softmax_function

[6] : https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions

[7] : https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning

[8] : http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/

See you on my next article !

If you found this useful and informative, please let me know by clapping or commenting ! Also for any queries you may have in regard to the above, ask me by commenting or tweeting @himanshuxd