Softmax, Argmax, and Cross Entropy
Quick Overview
This tutorial explains argmax, softmax, and cross-entropy, covering definitions, mathematical formulas, properties of softmax probabilities, loss computation, and worked numerical examples that demonstrate how logits convert to probabilities and are evaluated in neural network classification and Q-learning contexts.

Here’s a clean, intuitive explanation of argmax, softmax, and cross-entropy
1. Argmax
Argmax = “which class has the highest score?”
If your model outputs something like:
scores = [2.1, 0.4, 5.3]
Then:
argmax(scores) = 2 # because 5.3 is the largest
- It does NOT give the value (5.3)
- It gives the index of the maximum value
Used in:
- Classification prediction
- Q-learning (choose best action = argmax(Q-values))
- Choosing best probability class after softmax
2. Softmax
Softmax turns raw model scores (logits) into probabilities:
pᵢ = e^(zᵢ) / Σⱼ e^(zⱼ)
Properties:
- All probabilities are between 0 and 1
- Sum to 1
- Larger scores → larger probability (exponentially)
Example:
Scores:
z = [2, 1, 0]
Softmax:
e^2 = 7.389
e^1 = 2.718
e^0 = 1
sum = 11.107
softmax = [7.389/11.107, 2.718/11.107, 1/11.107]
≈ [0.665, 0.245, 0.090]
So the model thinks class 0 is most likely.
✅ 3. Cross-Entropy Loss
Cross-entropy measures how good the predicted probability distribution is.
For classification, if the true class is class ( y ):
- If the model gives the true class high probability, loss is small.
- If the model gives the true class low probability, loss is huge.
Examples:
If true class = 0
Model predicts:
p = [0.8, 0.1, 0.1]
Loss = -log(0.8) = 0.223 (very good)
Bad prediction:
p = [0.2, 0.3, 0.5]
Loss = -log(0.2) = 1.609 (bad)
Cross-entropy punishes wrong, confident predictions the most.
🎯 How these three work together
In neural networks:
-
Model outputs scores (logits), e.g.
[2.1, 0.4, 5.3] -
Softmax converts to probabilities
[0.03, 0.01, 0.96] -
Cross-entropy checks how good the probability for the correct class is e.g. if true class = 2 → loss = −log(0.96)
-
At prediction time, use argmax to pick the most likely class.
🔥 Concrete Numerical Example (Everything together)
Suppose:
-
Model outputs logits
z = [1.0, 3.0, 2.0] -
True label = class 1
Step 1 — Softmax
e^1 = 2.718
e^3 = 20.085
e^2 = 7.389
sum = 30.192
probabilities =
[2.718/30.192, 20.085/30.192, 7.389/30.192]
≈ [0.09, 0.66, 0.24]
Step 2 — Cross-entropy
True class = 1 → p_y = 0.66
Loss:
CE = -log(0.66) ≈ 0.415
Step 3 — Prediction (argmax)
argmax(z) = 1 because "3.0" is the largest logit.
→ model predicts class 1.
Summary Table
| Concept | What it does | Formula | Example |
|---|---|---|---|
| Argmax | Picks largest score | argmax(z) | [1,5,3] → 1 |
| Softmax | Converts logits → probabilities | ( \frac{e^{z_i}}{\sum e^{z_j}} ) | [2,1,0] → [0.66,0.24,0.09] |
| Cross-Entropy | Measures how wrong the predicted probability is | −log(p_y) | true=1, p=0.66 → loss=0.415 |
Comments (0)