Cross-Entropy Loss
Cross-entropy loss is the standard objective function for training classification models, measuring the divergence between a predicted probability distribution and the true distribution of labels.
Cross-entropy loss is the objective function most commonly minimised when training classification models, including the vast majority of neural network classifiers and large language models. It quantifies how far a model's predicted probability distribution lies from the true distribution of the labels, returning a small value when the model assigns high probability to the correct class and a large value when it is confidently wrong. Minimising cross-entropy is equivalent to maximising the likelihood of the observed data under the model.
Definition
For a single example with a true label expressed as a one-hot vector and a predicted distribution p, cross-entropy loss is L = -sum_i y_i * log(p_i), the negative sum over classes of the true probability times the logarithm of the predicted probability. Because the true distribution is usually one-hot, with a single correct class, the expression simplifies to the negative logarithm of the probability the model assigned to that correct class. If the model assigns probability close to 1 to the right class, the logarithm is near zero and the loss is small; if it assigns a tiny probability, the logarithm is a large negative number and the loss is large.
For binary classification the formula reduces to binary cross-entropy, L = -(y * log(p) + (1 - y) * log(1 - p)), evaluated for the single output probability.
Connection to information theory and likelihood
The name comes from information theory, where cross-entropy measures the average number of bits needed to encode events from one distribution using a code optimised for another. Minimising it drives the predicted distribution toward the true one. Cross-entropy equals the entropy of the true distribution plus the Kullback-Leibler divergence between the two distributions; since the entropy term is fixed, minimising cross-entropy minimises the divergence. From a statistical viewpoint, this is exactly maximum likelihood estimation, which is one reason the loss is so well motivated.
Why it pairs with softmax
Cross-entropy is almost always applied to the output of a softmax layer. This pairing is favoured because the gradient of the combined operation with respect to the network's raw scores is simply the predicted probability minus the true label. This clean gradient avoids the saturation problems that arise when squared error is used with sigmoid or softmax outputs, where gradients can become vanishingly small and stall learning. The simple gradient propagates efficiently through backpropagation, making optimisation by gradient descent fast and stable. Deep-learning frameworks typically fuse softmax and cross-entropy into one numerically stable operation.
| Variant | Setting | | --- | --- | | Categorical cross-entropy | Multi-class, one-hot labels | | Binary cross-entropy | Two-class problems | | Sparse categorical cross-entropy | Multi-class, integer labels |
Role in language models
In large language models, next-token prediction is a classification problem over the vocabulary, and cross-entropy is the loss used during pre-training and fine-tuning. The exponential of the average cross-entropy is the perplexity, a standard measure of how well a language model predicts text. Lower cross-entropy means lower perplexity and a better fit to the training distribution.
References
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. Wiley.