AIWiki
Malaysia

Softmax Function

The softmax function converts a vector of real-valued scores into a probability distribution, and is widely used as the output layer of neural network classifiers and in attention mechanisms.

4 min readLast updated June 2026Foundations

The softmax function takes a vector of arbitrary real numbers, often called logits, and transforms it into a vector of values between 0 and 1 that sum to 1, thereby forming a valid probability distribution. For an input vector with components z, the i-th output is softmax(z_i) = exp(z_i) / sum_j exp(z_j), the exponential of that component divided by the sum of the exponentials of all components. The function is a smooth, differentiable generalisation of the arg-max operation, which is why it is sometimes described as a soft version of selecting the largest element.

Properties

Softmax has several properties that make it well suited to machine learning. Its outputs are always positive and always sum to one, so they can be interpreted directly as class probabilities. It is monotonic, preserving the order of the inputs, so the largest logit yields the largest probability. The exponential exaggerates differences between inputs: a logit that is moderately larger than the others receives a disproportionately large share of the probability mass, while small differences are softened.

A useful feature is that softmax is invariant to adding a constant to every input. Subtracting the maximum logit from each component before exponentiating changes nothing mathematically but prevents the exponentials from overflowing to very large numbers, and this trick is standard in numerically stable implementations.

Relationship to classification

Softmax is the canonical output layer for multi-class classification networks. The raw scores produced by the final linear layer are passed through softmax to obtain a probability for each class, and the class with the highest probability is taken as the prediction. During training, the softmax output is compared against the true class using cross-entropy loss, a pairing so common that many frameworks fuse the two operations for efficiency and numerical stability. The gradient of cross-entropy with respect to the logits takes the simple form of the predicted probability minus the true label, which makes optimisation by gradient descent straightforward.

For two-class problems, softmax reduces to the logistic sigmoid function. Softmax can therefore be seen as the multi-class extension of logistic regression.

Beyond classification

The function appears throughout modern deep learning beyond the final classification layer. In the attention mechanism at the heart of transformer architectures, softmax converts raw compatibility scores between tokens into attention weights that sum to one, determining how much each token attends to every other. In reinforcement learning, softmax over action values produces a stochastic policy. A temperature parameter is often introduced to control sharpness: a high temperature flattens the distribution toward uniform, encouraging exploration or diversity, while a low temperature concentrates mass on the top choice. This temperature control is exactly what governs the randomness of text produced by large language models during sampling.

| Setting | Role of softmax | | --- | --- | | Image or text classifier | Produces class probabilities | | Transformer attention | Normalises attention weights | | Language model sampling | Shapes next-token distribution | | Reinforcement learning | Defines a stochastic policy |

References

  1. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
  2. Bridle, J. S. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs. Neurocomputing. Springer.
  3. Vaswani, A. et al. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems.