AIWiki
Malaysia

Activation Function

A mathematical function applied to a neuron's output in a neural network that introduces non-linearity, enabling models to learn complex patterns beyond simple linear relationships.

7 min readLast updated June 2026Foundations

An activation function is a mathematical function applied to the output of a neuron in an artificial neural network before that output is passed to the next layer. Activation functions serve a fundamental purpose: they introduce non-linearity into the network. Without them, a multi-layer neural network would behave identically to a single-layer linear model regardless of how many layers it has, limiting it to solving only linearly separable problems. By introducing non-linearity at each layer, activation functions allow deep networks to approximate virtually any continuous function, a property formalised by the Universal Approximation Theorem.

The choice of activation function affects how quickly a network learns, how stable training is, and how well gradients propagate through many layers during backpropagation. Different activation functions are suited to different tasks and positions within a network architecture.

Early Activation Functions

Sigmoid

The sigmoid function, also called the logistic function, maps any real-valued input to an output in the range (0, 1), producing a characteristic S-shaped curve. Because its outputs naturally represent probabilities, sigmoid was the dominant activation function in early neural networks from the 1980s onward and remains widely used in the output layer of binary classifiers.

The sigmoid function is:

Despite its intuitive properties, sigmoid suffers from the vanishing gradient problem. For very large or very small input values, the gradient of the sigmoid function approaches zero. During backpropagation, these near-zero gradients are multiplied across layers, causing earlier layers in a deep network to receive almost no training signal. This made deep networks extremely difficult to train, a barrier that limited progress in the field until the 2010s.

Hyperbolic Tangent (Tanh)

The hyperbolic tangent function maps inputs to the range (-1, 1), making it zero-centred unlike sigmoid. Zero-centred activations generally lead to more stable gradient updates. Tanh was widely used in recurrent neural networks, including early LSTM architectures, and outperforms sigmoid in many hidden-layer applications. However, it shares sigmoid's vanishing gradient problem for very large or very small inputs.

Modern Activation Functions

Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU) transformed practical deep learning when it was popularised by Nair and Hinton (2010) and notably deployed in AlexNet (2012). ReLU is defined simply as:

That is, ReLU outputs the input directly for positive values and outputs zero for negative values. Its computational simplicity, combined with the absence of the vanishing gradient problem for positive inputs, allowed researchers to train much deeper networks than were previously feasible. ReLU remains the most widely used activation function in convolutional neural networks and feed-forward layers.

ReLU does suffer from the dying ReLU problem: if a neuron consistently receives negative inputs, its gradient is permanently zero and it ceases to learn. Several variants address this:

  • Leaky ReLU allows a small, non-zero gradient for negative inputs by multiplying them by a small constant (typically 0.01).
  • Parametric ReLU (PReLU) treats the negative-slope coefficient as a learnable parameter.
  • ELU (Exponential Linear Unit) uses an exponential function for negative inputs, producing smooth negative outputs and faster convergence.
  • SELU (Scaled ELU) introduces self-normalising properties that maintain stable activations across layers without batch normalisation.

GELU

The Gaussian Error Linear Unit (GELU) has become the preferred activation function in transformer architectures, including GPT, BERT, and most modern large language models. GELU approximates the expected value of a stochastic regulariser and is differentiable everywhere, making it smooth and well-suited to gradient-based optimisation. Its formulation weights inputs by their probability under a Gaussian distribution, effectively gating neurons softly rather than with a hard zero threshold.

Softmax

The softmax function is used in the output layer of multi-class classifiers. It converts a vector of real-valued scores into a probability distribution that sums to 1, with larger scores receiving higher probabilities. Softmax is not applied to hidden layers — its role is specifically to produce interpretable class probabilities at a network's final output.

Swish

Swish, introduced by Google researchers in 2017, is defined as: . It is smooth, non-monotonic, and empirically outperforms ReLU on deep neural network benchmarks. Swish is used in various computer vision architectures including EfficientNet.

Selecting an Activation Function

The choice of activation function depends on the network architecture and task:

| Activation | Range | Common Use | |---|---|---| | Sigmoid | (0, 1) | Binary classification output | | Tanh | (-1, 1) | Recurrent layers, historical hidden layers | | ReLU | [0, inf) | CNN hidden layers, feed-forward networks | | Leaky ReLU | (-inf, inf) | Alternative to ReLU to avoid dead neurons | | GELU | (-0.17, inf) approx | Transformer hidden layers | | Softmax | (0, 1) summing to 1 | Multi-class classification output | | Swish | (-0.28, inf) approx | Deep vision networks |

For hidden layers in feedforward and convolutional networks, ReLU or one of its variants is the standard starting point. For transformer-based language models, GELU is most commonly used. Output layers use sigmoid for binary tasks, softmax for multi-class tasks, and linear (no activation) for regression.

Importance for Deep Learning

The development of effective activation functions is closely linked to the practical success of deep learning. The introduction of ReLU is widely credited as one of the key enablers of the deep learning renaissance in the early 2010s, alongside better weight initialisation schemes and the availability of GPU computing. More recently, smoother functions like GELU and Swish have become critical to the performance of very large language models and vision transformers.

Research into activation functions continues, with ongoing interest in learnable activation functions where the shape of the function itself is a parameter optimised during training.

References

  1. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML).
  2. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.
  3. Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). arXiv:1606.08415.
  4. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv:1710.05941.
  5. Google for Developers. (2024). Neural networks: Activation functions. Machine Learning Crash Course.