umut's blog

Backpropagation Through Matrix Multiplication

TL;DR
In this long and technical blog, I explained:


Backpropagation appears quite straightforward when working with scalars or even simple vectors. However, once we step into the world of matrices, things quickly become more complex and difficult to follow. There are extra details and notations that make it less intuitive.

Personally, although I managed to understand this concept while preparing for deep learning exams, I usually lose my intuition a few months later. Reviewing it again requires extra effort to bring everything back together. That's why I decided to write this technical blog post, both as a reminder for myself and as a guide for anyone trying to understand backpropagation in matrix form.

1 - Forward Propagation

When training a deep learning model, the first step is to compute a linear combination of input features. Given an input vector xM with M features and a corresponding weight vector wM, we calculate a weighted sum and add a bias term b1 to produce a single output z:

z=w1·x1+w2·x2++wM·xM+b

This is simply the dot product between w and x with an added bias term:

z=wTx+b

To visualize this computation, we can represent it in matrix form:

[z]=[w1w2wM][x1x2xM]+[b]

or in image representation:

Test

We can simplify our notation by incorporating the bias term directly into the weight vector. This involves adding the bias b as the first element of the weight vector and including a constant input x0=1:

[z]=[bw1w2wM][1x1x2xM]

For notational clarity, we'll denote the bias term as w0=b and the constant input as x0=1. Under this convention, both vectors have dimension M+1: wM+1 and xM+1. Our equation becomes:

z=wTx

or in matrix form:

[z]=[w0w1w2wM][x0x1x2xM]

or in image representation:

Test

1.1 - Adding Non-Linearity

The linear combination alone is insufficient for learning complex patterns. To add non-linearity, we apply an activation function σ(·) to the output z.

a=σ(z)=σ(wTx)

Although σ often denotes the Sigmoid function, here it represents a general activation function. For this explanation, we'll use the ReLU activation function in hidden layers due to its simplicity, computational efficiency, resistance to vanishing gradients, and widespread popularity:

a=σ(z)=max(0,z)

we can show this with visual representation:

Test

1.2 - Multi-class Classification

In practice, deep learning models often solve multi-class problems where we need to predict one of C possible classes. This requires the output a to be C-dimensional vector, with each dimension representing the score for a particular class. The predicted class corresponds to the entry with the highest activation value.

To generate C outputs, we need C distinct weight vectors, each of dimension (M+1). To achieve this, we need a separate weight vector for each class. Stacking them gives a weight matrix WC×(M+1):

a[C×1]=σ(W[C×(M+1)]x[(M+1)×1])

or in expanded form:

[a1a2aC]=σ([W10W11W12W1MW20W21W22W2MWC0WC1WC2WCM][x0x1x2xM])

or in visual form:

Test

Each figure represents the same network, but highlights different output paths from the same inputs, and each path is computed by its own set of output weights.

1.3 - Batch Processing

The computations described so far process only a single input example (batch size = 1). In practice, we process multiple inputs in parallel to improve computational efficiency.

If we process batch size of N inputs simultaneously, then our input becomes XN×(M+1) (note the uppercase X since we now have a matrix rather than a vector). To handle N outputs while maintaining proper matrix dimensions, we transpose our weight matrix to W(M+1)×C:

A[N×C]=σ(X[N×(M+1)]W[(M+1)×C])

or in matrix form:

[A11A12A1CA21A22A2CAN1AN2ANC]=σ([X10X11X12X1MX20X21X22X2MXN0XN1XN2XNM][W01W02W0CW11W12W1CW21W22W2CWM1WM2WMC])

When dealing with batches, it can sometimes be hard to picture how forward propagation works. We can visualize the process like this:

Test

Note that in this illustration, the non-linearity (activation function) is not explicitly shown. The final output A actually corresponds to the activated values.

1.4 - Multi-layer Networks

The calculations presented so far describe a single layer. Deep learning models stack multiple layers to learn increasingly complex representations. We can denote each layer with a superscript [l] to represent the layer number:

Z[l][N×Hl]=A[l1][N×Hl1]W[l][Hl1×Hl]A[l][N×Hl]=σ(Z[l][N×Hl])

Here Hl represents the number of hidden units in layer l. The input to layer l activated output from the previous layer l1 . For simplicity, we usually define A[0]=X (the input layer).

Test

1.5 - Softmax for Classification

For classification problems, we apply the softmax activation function to the final layer's output to convert the raw scores into a probability distribution. The softmax function ensures that all outputs sum to 1, allowing us to interpret them as class probabilities.

The softmax function is applied row-wise (for each input example) across the C classes:

Y^ij=softmax(Zij)=eZijl=1CeZili=1,2,,N and j=1,2,,C

Intuitively, for input example i, this computes the probability of class j by normalizing the exponential of its score by the sum of exponentials across all C possible classes.

The matrix representation shows this row-wise operation:

[Y^11Y^12Y^1CY^21Y^22Y^2CY^N1Y^N2Y^NC]=[softmax(Z11Z12Z1C)softmax(Z21Z22Z2C)softmax(ZN1ZN2ZNC)]=[eZ11l=1CeZ1leZ12l=1CeZ1leZ1Cl=1CeZ1leZ21l=1CeZ2leZ22l=1CeZ2leZ2Cl=1CeZ2leZN1l=1CeZNleZN2l=1CeZNleZNCl=1CeZNl]

If we denote the last layer as L, then the model’s output for N inputs and C classes can be represented as follows:

Test

1.6 - Cross-Entropy Loss Function

After obtaining the predicted probabilities Y^, we measure the model's performance by comparing these predictions with the ground truth labels YN×C using a loss function. The ground truth is represented as one-hot encoded vectors, where each input example has exactly one correct class. For multi-class classification, we typically use Cross-Entropy (CE) loss:

=CE(Y,Y^)=1Ni=1Nj=1CYijlogY^ij

This formula computes the element-wise product between the true labels Y and the logarithm of predicted probabilities logY^ then averages across all examples to produce a single scalar loss value.

We can express this using matrix operations with the element-wise (Hadamard) product :

[1×1]=CE(Y,Y^)=1N1T[1×N](YlogY^)[N×C]1[C×1]

Expanding this matrix operation:

=1N [11121N]([Y11Y12Y1CY21Y22Y2CYN1YN2YNC][logY^11logY^12logY^1ClogY^21logY^22logY^2ClogY^N1logY^N2logY^NC])[11121C]

We can simplify this by recognizing that multiplying by vectors of ones simply sums all elements in the matrix. Therefore:

=1N sum([Y11logY^11Y12logY^12Y1ClogY^1CY21logY^21Y22logY^22Y2ClogY^2CYN1logY^N1YN2logY^N2YNClogY^NC])

We can show this summation with following image (excluding scaling factor 1N):

Test

Although this is a general representation, we denote the output cross-entropy loss as logY^. Since the ground truth Y is one-hot vector, it is 1 for true class and 0 for others in each input. Therefore, the assumption in this case can be:

ij={logY^ijif  Yij=10otherwisefor  i=1,2,,N

This completes our mathematical framework for forward propagation in deep learning models, from single predictions to batch processing with multi-class classification and loss computation.

2 - A Simple Computation Graph

Before diving lots of matrix gradient calculation, I would like to show a computation graph for forward and backward propagation so we can see the cases we need to be careful when we compute gradients. In this case, I want to only show scalar calculations first instead of thinking about matrices so we can adapt this into matrix backpropagation.

a[0]=xz[1]=a[0]·w[1]a[1]=σ(z[1])z[2]=a[1]·w[2]a[2]=σ(z[2])z[l]=a[l1]·w[l]a[l]=σ(z[l])z[L1]=a[L2]·w[L1]a[L1]=σ(z[L1])z[L]=a[L1]·w[L]y^=softmax(z[L])=CE(y^)=i=1Cyilogy^i

where [l] and [L] represents arbitrary layer l and last layer L, respectively. σ(·) is activation function (ReLU in this case). In this representation, everything is scalar except w[L] because the output must be multi-class for softmax so I just played in the last layer to show a meaningful example.

We can visualize the computation graph of this network as follows:

Test

2.1 - Chain Rule

Backpropagation relies on the chain rule of calculus to compute gradients efficiently. For a composite function f(g(h(x))), the chain rule states:

fx=fg·gh·hx

In neural networks, we apply this principle to decompose the loss gradient into a product of simpler derivatives based on any parameter.

2.2 - Scalar Backward Propagation

After computing loss, we can start computing gradients with respect to it. Since there are C classes, we need to calculate derivative of each prediction y^i:

y^i=(yilogy^i)y^i=yiy^i

We can visualize this gradient flow like this:

Test

Since the predicted output y^i is computed using zi[L] in both the numerator and the denominator

yi=softmax(zi[L])=ezi[L]j=1Cezj[L],

the derivative zk[L] depends not only on the numerator but also on the denominator:

zk[L]=i=1Cy^i·y^izk[L]

It might be easier to understand when you see visual representation:

Test

Since we already know y^i, the gradient expression simplifies to:

zk[L]=i=1Cyiy^i·y^izk[L]

The most confusing part begins here, because we now need to carefully compute the local gradient y^izk[L]. To do this, we apply the quotient rule:

y^izk[L]=(ezi[L]j=1Cezj[L])zk[L]=(ezi[L]zk[L])·j=1Cezj[L]ezi[L]·(j=1Cezj[L]zk[L])(j=1Cezj[L])2

For this derivative, there are two distinct cases to consider:

i=k or ik

Case 1: i=k

ezi[L]zk[L]=ezi[L] (j=1Cezj[L])zk[L]=ezk[L] y^izk[L]=zi[L]·j=1Czj[L]zi[L]·zk[L](j=1Czj[L])2 y^izk[L]=ezi[L]j=1Cezj[L]·(1ezk[L]j=1Cezj[L]) y^izk[L]=y^i·(1y^k)

Case 2: ik

(j=1Cezj[L])zk[L]=ezk[L] y^izk[L]=0·j=1Czj[L]zi[L]·zk[L](j=1Czj[L])2 y^izk[L]=ezi[L]j=1Cezj[L]·ezk[L]j=1Cezj[L] y^izk[L]=y^i·y^k

Finally, we can summarize these two cases as

y^izk[L]={y^i·(1y^k) if i=ky^i·y^k if ik

In literature, it is pretty common to use the following form:

y^izk[L]=y^i·(δiky^k)

where δik is 1 if i=k, 0 otherwise. When we combine the gradient of cross-entropy and local gradient, we have:

zk[L]=i=1Cyiy^i·y^i·(δiky^k)

where y^i cancels each other:

zk[L]=i=1Cyi·(δiky^k)=yk+y^ki=1Cyi

Since ground truth value yi is one-hot vector (1 for true class and 0 for others), equation simplifies to:

zk[L]=y^kyk

This compact form is what makes softmax combined with cross-entropy loss so useful. Instead of dealing with complicated fractions, we can directly use this result to propagate gradients to the next layers.


From this point, gradient calculations across layers will start to follow a recurring pattern. Each layer essentially repeats the same process: we compute gradients with respect to its inputs and its weights.

For the layer output z[L], we have two key variables to differentiate with respect to:

Thus, even though only the weight gradients directly influence optimization, the activation gradients play a important role in keeping backpropagation alive throughout the entire network:

Test
w[l]=z[l]·(a[l1]·w[l])w[l]=z[l]·a[l1] a[l1]=z[l]·(a[l1]·w[l])a[l1]=z[l]·w[l]

Lastly, we need to calculate the derivative of a[l] with respect to z[l] in σ(z[l]) (assuming σ(z[l])=ReLU(z[l])):

z[l]=a[l]·a[l]z[l]=a[l]·1(z[l]>0)

where 1(z[l]>0) outputs 1 if (z[l]>0), 0 otherwise as a indicator function.


Therefore, we can summarize backpropagation for scalar values with following pattern:

z[L]=y^yz[l]=a[l]·a[l]z[l]=a[l]·1(z[l]>0)w[l]=z[l]·z[l]w[l]=z[l]·a[l1]a[l1]=z[l]·z[l]a[l1]=z[l]·w[l]z[l1]=a[l1]·a[l1]z[l1]=a[l1]·1(z[l1]>0)z[1]=a[1]·a[1]z[1]=a[1]·1(z[1]>0)w[1]=z[1]·z[1]w[1]=z[1]·a[0]

3 - Backward Propagation

After computing the loss through forward propagation, we need to update the model's weights to minimize this loss. Backpropagation is the algorithm that computes the gradients of the loss function with respect to each parameter in the network. We'll derive these gradients step by step, working backwards from the loss to the input layer.

3.1 - Gradient of Cross-Entropy with Softmax

Starting with loss function, we calculate relative gradients and go back step by step to calculate further gradients. Since the model has predictions with Softmax activation and the loss is calculated with cross-entropy, the combination of these methods has a nice property which simplifies the gradient:

Z[L]=1N (Y^Y)[N×C]

or in matrix representation:

Z[L]=1N [Z11[L]Z12[L]Z1C[L]Z21[L]Z22[L]Z2C[L]ZN1[L]ZN2[L]ZNC[L]]=1N [Y^11Y11Y^12Y12Y^1CY1CY^21Y21Y^22Y22Y^2CY2CY^N1YN1Y^N2YN2Y^NCYNC]

or in visual representation:

Test

3.2 - Gradient of Arbitrary Layer l

At this stage, we can begin calculating the gradients of the weights in the corresponding layer. Since the forward propagation is computed as

Z[l]=A[l1]W[l]

there are two possible local gradients we might consider:

Z[l]W[l] or Z[l]A[l1]

You may wonder why we would need the gradient with respect to A[l1] since our goal is to update the weight matrix W[l]. The reason is that the derivative with respect to A[l1] becomes necessary for updating the weights of the previous layer, because

A[l1]=σ(A[l2]W[l1])

When we want to calculate the gradients of weight matrix W[l] with respect to the loss function , we use chain rule as follows:

W[l]=Z[l]Upstream 
 Gradient 
Z[l]W[l] Local 
 Gradient

The local gradient gives us:

Z[l]W[l]=(A[l1]W[l])W[l]=A[l1][N×Hl1]

Since there are Hl1×Hl values in weight matrix W[l], we need to find each individual gradient so the shape of W[l] should be Hl1×Hl. However, the dimensions of matrices does not match when we try to calculate matrix multiplication:

W[l][Hl1×Hl]=Z[l][N×Hl]A[l1][N×Hl1]

If we take transpose of A[l1] and place it to the left side of upstream gradient, we have the correct dimension matching:

W[l][Hl1×Hl]=A[l1]T[Hl1×N]Z[l][N×Hl]

At first glance, this transformation may seem arbitrary. Why do we transpose? How can we be sure this result is correct? To clarify, let’s check the matrix representation:

[W11[l]W12[l]W1Hl1[l]W21[l]W22[l]W2Hl1[l]WHl1[l]WHl2[l]WHlHl1[l]]=[A11[l1]TA12[l1]TA1N[l1]TA21[l1]TA22[l1]TA2N[l1]TAHl11[l1]TAHl12[l1]TAHl1N[l1]T][Z11[l]Z12[l]Z1Hl[l]Z21[l]Z22[l]Z2Hl[l]ZN1[l]ZN2[l]ZNHl[l]]

If we look closely only one gradient calculation, W11[l]:

W11[l]=[A11[l1]TA12[l1]TA1N[l1]T][Z11[l]Z21[l]ZN1[l]]

Since single value of weight matrix is fixed for each input value, the gradient of this weight is just summation of all N input derivatives:

W11[l]=A11[l1]TZ11[l]+A12[l1]TZ21[l]++A1N[l1]TZN1[l]

In the image below, you can see the visualization of this calculation for 4 weights in the network:

Backpropagation Full

At the same time, we need to calculate the derivative of A[l1], A[l1], so upstream gradients can flow to previous layer and the gradient of weight matrix can be calculated in that layer:

A[l1]=Z[l]Z[l]A[l1]

The local gradient is:

Z[L]A[l1]=(A[l1]W[l])A[l1]=W[l][Hl1×Hl]

Once again, the dimensions of matrices do not match:

A[l1][N×Hl1]=Z[l][N×Hl]W[l][Hl1×Hl]

We can transpose the weight matrix to align matrices:

A[l1][N×Hl1]=Z[l][N×Hl]W[l]T[Hl×Hl1]

We can see the reasoning more easily if we check matrix representation again:

[A11[l1]A12[l1]A1Hl1[l1]A21[l1]A22[l1]A2Hl1[l1]AN1[l1]AN2[l1]ANHl1[l1]]=[Z11[l]Z12[l]Z1Hl[l]Z21[l]Z22[l]Z2Hl[l]ZN1[l]ZN2[l]ZNHl[l]][W11[l]TW12[l]TW1Hl1[l]TW21[l]TW22[l]TW2Hl1[l]TWHl1[l]TWHl2[l]TWHlHl1[l]T]

By focusing on only first value, A11[l1], the gradient of A11l depends on the values that it affected during forward propagation (the same logic as W11[l]) :

A11[l1]=[Z11[l]Z12[l]Z1Hl[l]][W11[l]TW21[l]TWHl1[l]T]

Now we want to compute the gradient Z[l1]. Recall that A[l1] is obtained by applying an activation function to Z[l1]:

A[l1][N×Hl1]=σ(Z[l1][N×Hl1])

In our examples, we use the ReLU activation function. Its derivative with respect to each element is given by:

ReLU(Zij[l1])Zij[l1]={1 if  Zij[l]>0,0 otherwise 

For N×Hl1 inputs, we must compute a gradient for each element, producing a matrix of 0s and 1s. The derivative of ReLU applied element-wise can be written as the indicator (mask) of positive entries:

ReLU(Z[l1])Z[l1]=1(Z[l1]>0)

where 1(·) is the element-wise indicator function (1 when the condition is true, 0 otherwise). In matrix form:

ReLU(Z[l1])Z[l1]=[1(Z11[l1]>0)1(Z12[l1]>0)1(Z1Hl1[l1]>0)1(Z21[l1]>0)1(Z22[l1]>0)1(Z2Hl1[l1]>0)1(ZN1[l1]>0)1(ZN2[l1]>0)1(ZNHl1[l1]>0)]

Because the derivative is applied element-wise, the upstream gradient A[l1] is combined with this mask element-wise, not by matrix multiplication:

Z[l1][N×Hl1]=A[l1][N×Hl1]A[l1]Z[l1][N×Hl1]

or

Z[l1][N×Hl1]=A[l1][N×Hl1]1(Z[l1]>0)[N×Hl1]

3.3 - Repetition

At this point, we completed the backpropagation using the matrix multiplication step. From here on, the process is essentially a repetition: each layer applies the same logic and propagates the gradients backwards through the weights (also biases implicitly) and activations.

The only real exception was the output layer, where we combined the softmax function with cross-entropy loss. This required a more detailed derivation, but the model is consistent for all hidden layers and is repeated across the entire network. You can see the pattern as follows:

Z[L][N×C]=1N(Y^Y)N×CW[L][HL1×C]=(A[L1]W[L])W[L][HL1×N]Z[L][N×C]=A[L1]T[HL1×N]Z[L][N×C]A[L1][N×HL1]=Z[L][N×C](A[L1]W[L])A[L1][C×HL1]=Z[L][N×C]W[L]T[C×HL1]Z[L1][N×HL1]=A[L1][N×HL1](σ(Z[L1]))Z[L1][N×HL1]=A[L1][N×HL1]1(Z[L1]>0)[N×HL1]W[l][Hl1×Hl]=(A[l1]W[l])W[l][Hl1×N]Z[l][N×Hl]=A[l1]T[Hl1×N]Z[l][N×Hl]A[l1][N×Hl1]=Z[l][N×Hl](A[l1]W[l])A[l1][Hl×Hl1]=Z[l][N×Hl]W[l]T[Hl×Hl1]Z[l1][N×Hl1]=A[l1][N×Hl1](σ(Z[l1]))Z[l1][N×Hl1]=A[l1][N×Hl1]1(Z[l1]>0)[N×Hl1]Z[1][N×H1]=A[1][N×H1](σ(Z[1]))Z[1][N×H1]=A[1][N×H1]1(Z[1]>0)[N×H1]W[1][H0×H1]=(A[0]W[1])W[1][H0×N]Z[1][N×H1]=A[0]T[H0×N]Z[1][N×H1]

I know it looks ugly but if you pay some attention, you will see the pattern for gradient calculation. There are just some rules that you need to follow for each block.


For comments, please send me an email.