# Activation Functions : Sigmoid, tanh, ReLU, Leaky ReLU, PReLU, ELU, Threshold ReLU and Softmax basics for Neural Networks and Deep Learning

Let’s start with the basics of Neurons and Neural Network and What is an ** Activation Function** and Why we would need it :

First proposed in 1944 by Warren McCullough and Walter Pitts, Neural Networks are the techniques powering the best speech recognizers and translators on our smartphones, through something called “deep learning” which employs several layers of neural nets.

Neural Nets are modeled loosely based on the human brain, where there are thousands or even millions of nodes that are densely connected with each other. Just like how the brain “fires” up neurons, in ** Artificial Neural Networks **(ANN) an

*Artificial Neuron*is fired up by sending a signal from the incoming node multiplies by some weight, this node can be visualized as something that is holding a number which comes from the ending branches (

**) supplied at that**

*Synapses**Neuron*, what happens is for a

*Layer*of

*Neural Network*(NN) we

**the**

*multiply***to the Neuron with the**

*input***held by that synapse and**

*weight***all of those up to get our output.**

*sum*- Example Code for Forward Propagation in a Single Neuron :

For example (see D in above figure), if the *weights* are ** w1, w2, w3 …. wN** and

*inputs*being

**we get a**

*i1, i2, i3 …. iN**summation*of :

*w1*i1 + w2*i2 + w3*i3 …. wN*iN*For several layers of *Neural Networks* and Connections we can have varied values of ** wX** and

**and the summation**

*iX***which varies according to whether the particular**

*S**Neuron*is

**or**

*activated***, so to**

*not***this and prevent drastically different range of values, we use what is called a**

*normalize***for**

*Activation Function**Neural networks*that turns these values into something equivalent between

**0,1**

*or*

**-1,1**to make the whole process

**. This process is not just to preserve sanity of the code but also to reduce complexity and computing power required which would be more difficult on inactivated inputs.**

*statistically balanced*# Introduction >

** Activation functions** that are commonly used based on few desirable properties like :

*Nonlinear*— When the activation function is non-linear, then a two-layer neural network can be proven to be a universal function approximator. The identity activation function does not satisfy this property. When multiple layers use the identity activation function, the entire network is equivalent to a single-layer model.*Range*— When the range of the activation function is finite, gradient-based training methods tend to be more stable, because pattern presentations significantly affect only limited weights. When the range is infinite, training is generally more efficient because pattern presentations significantly affect most of the weights. In the latter case, smaller learning rates are typically necessary.*Continuously differentiable*— This property is desirable (ReLU is not continuously differentiable and has some issues with gradient-based optimization, but it is still possible) for enabling gradient-based optimization methods. The binary step activation function is not differentiable at 0, and it differentiates to 0 for all other values, so gradient-based methods can make no progress with it.

Derivative or Differential or Slope:Change in y-axis according to change in x-axis.

*Monotonic*— When the activation function is monotonic, the error surface associated with a single-layer model is guaranteed to be convex.

Monotonic function:A function which is either entirely non-increasing or non-decreasing.

*Smooth functions with a monotonic derivative*— These have been shown to generalize better in some cases.*Approximates identity near the origin*— When activation functions have this property, the neural network will learn efficiently when its weights are initialized with small random values. When the activation function does not approximate identity near the origin, special care must be used when initializing the weights.

**Table of Activation Functions** >

Breaking down some Activation functions :

(There are some linear and simple functions such as The Binary Step Function or Linear Function *f = ax* but since they are not widely used and are undesirable as activation functions, we won’t be discussing them.)

**1. The Sigmoid Function** >

Sigmoid functions are used in machine learning for logistic regression and basic neural network implementations and they are the introductory activation units. But for advanced Neural Network Sigmoid functions are not preferred due to various drawbacks (vanishing gradient problem). It is one of the most used activation function for beginners in Machine Learning and Data Science when starting out.

Although sigmoid function and it’s derivative is ** simple** and helps in reducing time required for making models, there is a major drawback of

**due to the derivative having a short range.**

*info loss*So the more there are layers in our Neural Network (or the *deeper* our Neural Network is) the more our information gets compressed and lost at each layer and this amplifies at each step and causes major data loss overall. Vanishing and Exploding gradient problem is present, with sigmoid functions since it is positive in output, all our output neurons have a positive output too which is not ideal. Not being centered at 0 makes our sigmoid function not a good choice to run at the early layers, although in the last layer sigmoid function can be used.

Besides thelogistic function, sigmoid functions include the ordinaryarctangent, thehyperbolic tangent, theGudermannian function, and theerror function, but also thegeneralised logistic functionandalgebraic functions

# 2. **Tanh Function **>

In ** tanh** function the drawback we saw in sigmoid function is addressed (not entirely), here the only difference with sigmoid function is the curve is symetric across the origin with values ranging from -1 to 1.

The formula for hyperbolic tangent (tanh) can be given as follows

This does not however mean that tanh is devoid of the vanishing or exploding gradient problem, it persists even in the case of tanh but unlike Sigmoid as it is centered at Zero, it is more optimal than Sigmoid Function. Therefore other functions are employed more often which we will see below for machine learning.

# 3. ReLU **(Rectified Linear Units) **and Leaky ReLU >

The rectifier is, as of 2018, the most popular activation function for deep neural networks.

Most Deep Learning applications right now make use of ** ReLU** instead of

*Logistic Activation functions*for Computer Vision, Speech Recognition, Natural Language Processing and Deep Neural Networks etc. ReLU also has a manifold convergence rate on application when compared to tanh or sigmoid functions.

Some of the ReLU variants include : Softplus (SmoothReLU), Noisy ReLU, Leaky ReLU, Parametric ReLU and ExponentialReLU (ELU). Some of which we will discuss below.

ReLU : A ** Rectified Linear Unit** (A unit employing the rectifier is also called a rectified linear unit

**ReLU) has output 0 if the input is less than 0, and**

*raw*output otherwise. That is, if the input is greater than 0, the output is equal to the input. The operation of ReLU is closer to the way our

*biological neurons*work.

ReLU is ** non-linear** and has the advantage of not having any

*backpropagation errors unlike*the

*sigmoid function*, also for larger Neural Networks, the

**of building models based off on ReLU is very fast opposed to using Sigmoids :**

*speed**Biological plausibility*: One-sided, compared to the antisymmetry of tanh.*Sparse activation*: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).*Better gradient propagation*: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.*Efficient computation*: Only comparison, addition and multiplication.*Scale-invariant*: max ( 0, a x ) = a max ( 0 , x ) for a ≥ 0

ReLUs aren’t without any drawbacks some of them are that ReLU is ** Non Zero centered** and is

**, but differentiable anywhere else.**

*non differentiable at Zero*One of the conditions on ReLU is the usage, it can only be used in hidden layers and not elsewhere. This is due to the limitation mentioned below

Another problem we see in ReLU is the ** Dying ReLU problem **where some ReLU Neurons essentially

**for all inputs and remain**

*die***no matter what input is supplied, here**

*inactive***and if large number of dead neurons are there in a Neural Network it’s performance is affected, this can be corrected by making use of what is called**

*no gradient flows***where slope is changed left of x=0 in above figure and thus causing a**

*Leaky ReLU***and**

*leak**extending*the

*range*of ReLU.

With Leaky ReLU there is a small negative slope, so instead of not firing at all for large gradients, our neurons do output some value and that makes our layer much more optimized too.

# 4. PReLU (Parametric ReLU) Function >

In Parametric ReLU as seen from the figure above, instead of using a fixed slope like 0.01 used in Leaky ReLU, a parameter ‘a’ is made that will change depending on the model, for x < 0

Using weights and biases, we tune the parameter that is learned by employing backpropagation across multiple layers .

Therefore as PReLU relates to the maximum value, we use it in something called “maxout” networks too.

# 5. ELU (Exponential LU) Function >

Exponential Linear Units are are used to speed up the deep learning process, this is done by making the mean activations closer to Zero, here an alpha constant is used which must be a positive number.

ELU have been shown to produce more accurate results than ReLU and also converge faster. ELU and ReLU are same for positive inputs, but for negative inputs ELU smoothes (to -alpha) slowly whereas ReLU smooths sharply.

# 6. Threshold ReLU Function >

As a result of combining ReLU and FTSwish, Threshold ReLU or simply TReLU was made, TReLU is similar to ReLU but with two important changes, here negative values are allowed but they are capped this greatly improves accuracy. It follows: `f(x) = x`

for `x > theta`

, `f(x) = 0`

otherwise, where theta is a float >= 0 (Threshold location of activation).

# 7. **Softmax **Function >

*Softmax* is a very interesting activation function because it not only ** maps our output to a [0,1] range** but also maps each output in such a way that the

**. The output of Softmax is therefore a**

*total sum is 1***.**

*probability distribution*The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

Mathematically Softmax is the following function where ** z** is vector of inputs to output layer and

**indexes the the output units from**

*j***:**

*1,2, 3 …. k*In conclusion, Softmax is used for ** multi-classification in logistic regression model (multivariate)** whereas Sigmoid is used for

*binary classification in logistic regression model.*For citing do use the format :

`@article{himanshuxd,`

author = {Himanshu S},

title = { Activation Functions : Sigmoid, tanh, ReLU, Leaky ReLU, PReLU, ELU, Threshold ReLU and Softmax basics for Neural Networks and Deep Learning },

howpublished = {\url{https://himanshuxd.medium.com/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e}},

year = {2019}

}

[1] : https://en.wikipedia.org/wiki/Artificial_neural_network

[2] : https://en.wikipedia.org/wiki/Activation_function

[3] : https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

[4] : http://cs231n.github.io/neural-networks-1/

[5] : https://en.wikipedia.org/wiki/Softmax_function

[6] : https://github.com/Kulbear/deep-learning-nano-foundation/wiki/ReLU-and-Softmax-Activation-Functions

[7] : https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning

[8] : http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/

**See you on my next article !**

*If you found this useful and informative, please let me know by clapping or commenting ! Also for any queries you may have in regard to the above, ask me by commenting **or** tweeting @himanshuxd*