What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Activation Function

A mathematical function applied to a neuron's output in a neural network that introduces non-linearity, enabling models to learn complex patterns beyond simple linear relationships.

7 min readLast updated June 2026Foundations

An activation function is a mathematical function applied to the output of a neuron in an artificial neural network before that output is passed to the next layer. Activation functions serve a fundamental purpose: they introduce non-linearity into the network. Without them, a multi-layer neural network would behave identically to a single-layer linear model regardless of how many layers it has, limiting it to solving only linearly separable problems. By introducing non-linearity at each layer, activation functions allow deep networks to approximate virtually any continuous function, a property formalised by the Universal Approximation Theorem.

The choice of activation function affects how quickly a network learns, how stable training is, and how well gradients propagate through many layers during backpropagation. Different activation functions are suited to different tasks and positions within a network architecture.

Early Activation Functions

Sigmoid

The sigmoid function, also called the logistic function, maps any real-valued input to an output in the range (0, 1), producing a characteristic S-shaped curve. Because its outputs naturally represent probabilities, sigmoid was the dominant activation function in early neural networks from the 1980s onward and remains widely used in the output layer of binary classifiers.

The sigmoid function is:

Despite its intuitive properties, sigmoid suffers from the vanishing gradient problem. For very large or very small input values, the gradient of the sigmoid function approaches zero. During backpropagation, these near-zero gradients are multiplied across layers, causing earlier layers in a deep network to receive almost no training signal. This made deep networks extremely difficult to train, a barrier that limited progress in the field until the 2010s.

Hyperbolic Tangent (Tanh)

The hyperbolic tangent function maps inputs to the range (-1, 1), making it zero-centred unlike sigmoid. Zero-centred activations generally lead to more stable gradient updates. Tanh was widely used in recurrent neural networks, including early LSTM architectures, and outperforms sigmoid in many hidden-layer applications. However, it shares sigmoid's vanishing gradient problem for very large or very small inputs.

Modern Activation Functions

Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU) transformed practical deep learning when it was popularised by Nair and Hinton (2010) and notably deployed in AlexNet (2012). ReLU is defined simply as:

That is, ReLU outputs the input directly for positive values and outputs zero for negative values. Its computational simplicity, combined with the absence of the vanishing gradient problem for positive inputs, allowed researchers to train much deeper networks than were previously feasible. ReLU remains the most widely used activation function in convolutional neural networks and feed-forward layers.

ReLU does suffer from the dying ReLU problem: if a neuron consistently receives negative inputs, its gradient is permanently zero and it ceases to learn. Several variants address this:

Leaky ReLU allows a small, non-zero gradient for negative inputs by multiplying them by a small constant (typically 0.01).
Parametric ReLU (PReLU) treats the negative-slope coefficient as a learnable parameter.
ELU (Exponential Linear Unit) uses an exponential function for negative inputs, producing smooth negative outputs and faster convergence.
SELU (Scaled ELU) introduces self-normalising properties that maintain stable activations across layers without batch normalisation.

GELU

The Gaussian Error Linear Unit (GELU) has become the preferred activation function in transformer architectures, including GPT, BERT, and most modern large language models. GELU approximates the expected value of a stochastic regulariser and is differentiable everywhere, making it smooth and well-suited to gradient-based optimisation. Its formulation weights inputs by their probability under a Gaussian distribution, effectively gating neurons softly rather than with a hard zero threshold.

Softmax

The softmax function is used in the output layer of multi-class classifiers. It converts a vector of real-valued scores into a probability distribution that sums to 1, with larger scores receiving higher probabilities. Softmax is not applied to hidden layers — its role is specifically to produce interpretable class probabilities at a network's final output.

Swish

Swish, introduced by Google researchers in 2017, is defined as: . It is smooth, non-monotonic, and empirically outperforms ReLU on deep neural network benchmarks. Swish is used in various computer vision architectures including EfficientNet.

Selecting an Activation Function

The choice of activation function depends on the network architecture and task:

| Activation | Range | Common Use | |---|---|---| | Sigmoid | (0, 1) | Binary classification output | | Tanh | (-1, 1) | Recurrent layers, historical hidden layers | | ReLU | [0, inf) | CNN hidden layers, feed-forward networks | | Leaky ReLU | (-inf, inf) | Alternative to ReLU to avoid dead neurons | | GELU | (-0.17, inf) approx | Transformer hidden layers | | Softmax | (0, 1) summing to 1 | Multi-class classification output | | Swish | (-0.28, inf) approx | Deep vision networks |

For hidden layers in feedforward and convolutional networks, ReLU or one of its variants is the standard starting point. For transformer-based language models, GELU is most commonly used. Output layers use sigmoid for binary tasks, softmax for multi-class tasks, and linear (no activation) for regression.

Importance for Deep Learning

The development of effective activation functions is closely linked to the practical success of deep learning. The introduction of ReLU is widely credited as one of the key enablers of the deep learning renaissance in the early 2010s, alongside better weight initialisation schemes and the availability of GPU computing. More recently, smoother functions like GELU and Swish have become critical to the performance of very large language models and vision transformers.

Research into activation functions continues, with ongoing interest in learnable activation functions where the shape of the function itself is a parameter optimised during training.

Malaysian Context — AI Education and Research

Understanding activation functions is fundamental to any AI curriculum, and Malaysian universities offering data science and AI programmes cover this topic extensively. Universiti Malaya (UM), Universiti Kebangsaan Malaysia (UKM), Universiti Teknologi Malaysia (UTM), and Universiti Sains Malaysia (USM) all include neural network fundamentals — including activation functions — in their undergraduate and postgraduate AI programmes.

The Human Resource Development Corporation (HRD Corp), previously known as HRDC, funds training programmes covering deep learning fundamentals for Malaysian industry professionals. Multiple accredited training providers offer courses on neural network design, where the selection of appropriate activation functions for different tasks is a practical focus area.

Malaysian AI practitioners working on computer vision applications in manufacturing — particularly in the Penang electronics and semiconductor sector — regularly apply ReLU-based convolutional networks in inspection systems. Companies such as ViTrox Corporation, a Penang-based machine vision firm listed on Bursa Malaysia, deploy custom neural network architectures where activation function choice directly affects inference speed and accuracy on production lines.

In the financial sector, institutions such as Maybank and CIMB that are developing internal credit scoring and fraud detection models use feed-forward networks where ReLU variants are standard in hidden layers. GELU has become increasingly prominent as Malaysian AI teams adopt transformer architectures for NLP tasks in Bahasa Malaysia.

MDEC's AI programs, including the Malaysia Digital Acceleration Grant for AI (MDAG-AI), support SMEs in adopting AI solutions, and the technical guidance materials produced by MDEC reference modern activation function practices as part of best-practice neural network design.

References

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). arXiv:1606.08415.
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv:1710.05941.
Google for Developers. (2024). Neural networks: Activation functions. Machine Learning Crash Course.

Tags:neural network deep learning ReLU sigmoid non-linearity

Type	Neural network component
Purpose	Introduce non-linearity
Common types	ReLU, Sigmoid, Tanh, Softmax, GELU
Key property	Differentiability for backpropagation
Related	Neural network, Backpropagation, Deep learning