Backpropagation Through Matrix Multiplication

18 Sep, 2025

TL;DR
In this long and technical blog, I explained:

Forward propagation: How inputs flow through the network layer by layer (using matrix operations) to generate predictions.
Computation Graph: How simple scalar examples help visualize backpropagation and build intuition before scaling up.
Backward propagation: How errors flow backward (again with matrix operations) to compute gradients and update weights.

Backpropagation appears quite straightforward when working with scalars or even simple vectors. However, once we step into the world of matrices, things quickly become more complex and difficult to follow. There are extra details and notations that make it less intuitive.

Personally, although I managed to understand this concept while preparing for deep learning exams, I usually lose my intuition a few months later. Reviewing it again requires extra effort to bring everything back together. That's why I decided to write this technical blog post, both as a reminder for myself and as a guide for anyone trying to understand backpropagation in matrix form.

1 - Forward Propagation

When training a deep learning model, the first step is to compute a linear combination of input features. Given an input vector $x \in ℝ^{M}$ with $M$ features and a corresponding weight vector $w \in ℝ^{M}$ , we calculate a weighted sum and add a bias term $b \in ℝ^{1}$ to produce a single output $z$ :

z = w_{1} \cdot x_{1} + w_{2} \cdot x_{2} + \dots + w_{M} \cdot x_{M} + b

This is simply the dot product between $w$ and $x$ with an added bias term:

z = w^{T} x + b

To visualize this computation, we can represent it in matrix form:

[\begin{matrix} z \end{matrix}] = [\begin{matrix} w_{1} & w_{2} & \dots & w_{M} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{M} \end{matrix}] + [\begin{matrix} b \end{matrix}]

or in image representation:

We can simplify our notation by incorporating the bias term directly into the weight vector. This involves adding the bias $b$ as the first element of the weight vector and including a constant input $x_{0} = 1$ :

[\begin{matrix} z \end{matrix}] = [\begin{matrix} b & w_{1} & w_{2} & \dots & w_{M} \end{matrix}] [\begin{matrix} 1 \\ x_{1} \\ x_{2} \\ ⋮ \\ x_{M} \end{matrix}]

For notational clarity, we'll denote the bias term as $w_{0} = b$ and the constant input as $x_{0} = 1$ . Under this convention, both vectors have dimension $M + 1$ : $w \in ℝ^{M + 1}$ and $x \in ℝ^{M + 1}$ . Our equation becomes:

z = w^{T} x

or in matrix form:

[\begin{matrix} z \end{matrix}] = [\begin{matrix} w_{0} & w_{1} & w_{2} & \dots & w_{M} \end{matrix}] [\begin{matrix} x_{0} \\ x_{1} \\ x_{2} \\ ⋮ \\ x_{M} \end{matrix}]

or in image representation:

1.1 - Adding Non-Linearity

The linear combination alone is insufficient for learning complex patterns. To add non-linearity, we apply an activation function $σ (\cdot)$ to the output $z$ .

a = σ (z) = σ (w^{T} x)

Although $σ$ often denotes the Sigmoid function, here it represents a general activation function. For this explanation, we'll use the ReLU activation function in hidden layers due to its simplicity, computational efficiency, resistance to vanishing gradients, and widespread popularity:

a = σ (z) = max (0, z)

we can show this with visual representation:

1.2 - Multi-class Classification

In practice, deep learning models often solve multi-class problems where we need to predict one of $C$ possible classes. This requires the output $a$ to be $C$ -dimensional vector, with each dimension representing the score for a particular class. The predicted class corresponds to the entry with the highest activation value.

To generate $C$ outputs, we need $C$ distinct weight vectors, each of dimension $(M + 1)$ . To achieve this, we need a separate weight vector for each class. Stacking them gives a weight matrix $W \in ℝ^{C \times (M + 1)}$ :

\underset{[C \times 1]}{\underset{⏟}{a}} = σ (\underset{[C \times (M + 1)]}{\underset{⏟}{W}} \underset{[(M + 1) \times 1]}{\underset{⏟}{x}})

or in expanded form:

[\begin{matrix} a_{1} \\ a_{2} \\ ⋮ \\ a_{C} \end{matrix}] = σ ([\begin{matrix} W_{10} & W_{11} & W_{12} & \dots & W_{1 M} \\ W_{20} & W_{21} & W_{22} & \dots & W_{2 M} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ W_{C 0} & W_{C 1} & W_{C 2} & \dots & W_{C M} \end{matrix}] [\begin{matrix} x_{0} \\ x_{1} \\ x_{2} \\ ⋮ \\ x_{M} \end{matrix}])

or in visual form:

Each figure represents the same network, but highlights different output paths from the same inputs, and each path is computed by its own set of output weights.

1.3 - Batch Processing

The computations described so far process only a single input example (batch size = 1). In practice, we process multiple inputs in parallel to improve computational efficiency.

If we process batch size of $N$ inputs simultaneously, then our input becomes $X \in ℝ^{N \times (M + 1)}$ (note the uppercase $X$ since we now have a matrix rather than a vector). To handle $N$ outputs while maintaining proper matrix dimensions, we transpose our weight matrix to $W \in ℝ^{(M + 1) \times C}$ :

\underset{[N \times C]}{\underset{⏟}{A}} = σ (\underset{[N \times (M + 1)]}{\underset{⏟}{X}} \underset{[(M + 1) \times C]}{\underset{⏟}{W}})

or in matrix form:

[\begin{matrix} A_{11} & A_{12} & \dots & A_{1 C} \\ A_{21} & A_{22} & \dots & A_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{N 1} & A_{N 2} & \dots & A_{N C} \end{matrix}] = σ ([\begin{matrix} X_{10} & X_{11} & X_{12} & \dots & X_{1 M} \\ X_{20} & X_{21} & X_{22} & \dots & X_{2 M} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ X_{N 0} & X_{N 1} & X_{N 2} & \dots & X_{N M} \end{matrix}] [\begin{matrix} W_{01} & W_{02} & \dots & W_{0 C} \\ W_{11} & W_{12} & \dots & W_{1 C} \\ W_{21} & W_{22} & \dots & W_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ W_{M 1} & W_{M 2} & \dots & W_{M C} \end{matrix}])

When dealing with batches, it can sometimes be hard to picture how forward propagation works. We can visualize the process like this:

Note that in this illustration, the non-linearity (activation function) is not explicitly shown. The final output $A$ actually corresponds to the activated values.

1.4 - Multi-layer Networks

The calculations presented so far describe a single layer. Deep learning models stack multiple layers to learn increasingly complex representations. We can denote each layer with a superscript $[l]$ to represent the layer number:

\begin{matrix} \underset{[N \times H_{l}]}{\underset{⏟}{Z^{[l]}}} = & \underset{[N \times H_{l - 1}]}{\underset{⏟}{A^{[l - 1]}}} \underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{W^{[l]}}} \\ \underset{[N \times H_{l}]}{\underset{⏟}{A^{[l]}}} = & σ (\underset{[N \times H_{l}]}{\underset{⏟}{Z^{[l]}}}) \end{matrix}

Here $H_{l}$ represents the number of hidden units in layer $l$ . The input to layer $l$ activated output from the previous layer $l - 1$ . For simplicity, we usually define $A^{[0]} = X$ (the input layer).

1.5 - Softmax for Classification

For classification problems, we apply the softmax activation function to the final layer's output to convert the raw scores into a probability distribution. The softmax function ensures that all outputs sum to 1, allowing us to interpret them as class probabilities.

The softmax function is applied row-wise (for each input example) across the $C$ classes:

{\hat{Y}}_{i j} = softmax (Z_{i j}) = \frac{e^{Z_{i j}}}{\sum_{l = 1}^{C} e^{Z_{i l}}} i = 1, 2, \dots, N and j = 1, 2, \dots, C

Intuitively, for input example $i$ , this computes the probability of class $j$ by normalizing the exponential of its score by the sum of exponentials across all $C$ possible classes.

The matrix representation shows this row-wise operation:

[\begin{matrix} {\hat{Y}}_{11} & {\hat{Y}}_{12} & \dots & {\hat{Y}}_{1 C} \\ {\hat{Y}}_{21} & {\hat{Y}}_{22} & \dots & {\hat{Y}}_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {\hat{Y}}_{N 1} & {\hat{Y}}_{N 2} & \dots & {\hat{Y}}_{N C} \end{matrix}] = [\begin{matrix} softmax ( & Z_{11} & Z_{12} & \dots & Z_{1 C} & ) \\ softmax ( & Z_{21} & Z_{22} & \dots & Z_{2 C} & ) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ softmax ( & Z_{N 1} & Z_{N 2} & \dots & Z_{N C} & ) \end{matrix}] = [\begin{matrix} \frac{e^{Z_{11}}}{\sum_{l = 1}^{C} e^{Z_{1 l}}} & \frac{e^{Z_{12}}}{\sum_{l = 1}^{C} e^{Z_{1 l}}} & \dots & \frac{e^{Z_{1 C}}}{\sum_{l = 1}^{C} e^{Z_{1 l}}} \\ \frac{e^{Z_{21}}}{\sum_{l = 1}^{C} e^{Z_{2 l}}} & \frac{e^{Z_{22}}}{\sum_{l = 1}^{C} e^{Z_{2 l}}} & \dots & \frac{e^{Z_{2 C}}}{\sum_{l = 1}^{C} e^{Z_{2 l}}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{e^{Z_{N 1}}}{\sum_{l = 1}^{C} e^{Z_{N l}}} & \frac{e^{Z_{N 2}}}{\sum_{l = 1}^{C} e^{Z_{N l}}} & \dots & \frac{e^{Z_{N C}}}{\sum_{l = 1}^{C} e^{Z_{N l}}} \end{matrix}]

If we denote the last layer as $L$ , then the model’s output for $N$ inputs and $C$ classes can be represented as follows:

1.6 - Cross-Entropy Loss Function

After obtaining the predicted probabilities $\hat{Y}$ , we measure the model's performance by comparing these predictions with the ground truth labels $Y \in ℝ^{N \times C}$ using a loss function. The ground truth is represented as one-hot encoded vectors, where each input example has exactly one correct class. For multi-class classification, we typically use Cross-Entropy (CE) loss:

ℒ = CE (Y, \hat{Y}) = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} Y_{i j} \log {\hat{Y}}_{i j}

This formula computes the element-wise product between the true labels $Y$ and the logarithm of predicted probabilities $\log \hat{Y}$ then averages across all examples to produce a single scalar loss value.

We can express this using matrix operations with the element-wise (Hadamard) product $⊙$ :

\underset{[1 \times 1]}{\underset{⏟}{ℒ}} = CE (Y, \hat{Y}) = - \frac{1}{N} \underset{[1 \times N]}{\underset{⏟}{1^{T}}} (\underset{[N \times C]}{\underset{⏟}{Y ⊙ \log \hat{Y})}} \underset{[C \times 1]}{\underset{⏟}{1}}

Expanding this matrix operation:

ℒ = - \frac{1}{N} [\begin{matrix} 1_{1} & 1_{2} & \dots & 1_{N} \end{matrix}] ([\begin{matrix} Y_{11} & Y_{12} & \dots & Y_{1 C} \\ Y_{21} & Y_{22} & \dots & Y_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Y_{N 1} & Y_{N 2} & \dots & Y_{N C} \end{matrix}] ⊙ [\begin{matrix} \log {\hat{Y}}_{11} & \log {\hat{Y}}_{12} & \dots & \log {\hat{Y}}_{1 C} \\ \log {\hat{Y}}_{21} & \log {\hat{Y}}_{22} & \dots & \log {\hat{Y}}_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \log {\hat{Y}}_{N 1} & \log {\hat{Y}}_{N 2} & \dots & \log {\hat{Y}}_{N C} \end{matrix}]) [\begin{matrix} 1_{1} \\ 1_{2} \\ ⋮ \\ 1_{C} \end{matrix}]

We can simplify this by recognizing that multiplying by vectors of ones simply sums all elements in the matrix. Therefore:

ℒ = - \frac{1}{N} sum ([\begin{matrix} Y_{11} \log {\hat{Y}}_{11} & Y_{12} \log {\hat{Y}}_{12} & \dots & Y_{1 C} \log {\hat{Y}}_{1 C} \\ Y_{21} \log {\hat{Y}}_{21} & Y_{22} \log {\hat{Y}}_{22} & \dots & Y_{2 C} \log {\hat{Y}}_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Y_{N 1} \log {\hat{Y}}_{N 1} & Y_{N 2} \log {\hat{Y}}_{N 2} & \dots & Y_{N C} \log {\hat{Y}}_{N C} \end{matrix}])

We can show this summation with following image (excluding scaling factor $- \frac{1}{N}$ ):

Although this is a general representation, we denote the output cross-entropy loss as $- \log \hat{Y}$ . Since the ground truth $Y$ is one-hot vector, it is 1 for true class and 0 for others in each input. Therefore, the assumption in this case can be:

ℒ_{i j} = {\begin{matrix} - \log {\hat{Y}}_{i j} & if Y_{i j} = 1 \\ 0 & otherwise \end{matrix} for i = 1, 2, \dots, N

This completes our mathematical framework for forward propagation in deep learning models, from single predictions to batch processing with multi-class classification and loss computation.

2 - A Simple Computation Graph

Before diving lots of matrix gradient calculation, I would like to show a computation graph for forward and backward propagation so we can see the cases we need to be careful when we compute gradients. In this case, I want to only show scalar calculations first instead of thinking about matrices so we can adapt this into matrix backpropagation.

\begin{matrix} a^{[0]} & = & x \\ z^{[1]} & = & a^{[0]} \cdot w^{[1]} \\ a^{[1]} & = & σ (z^{[1]}) \\ z^{[2]} & = & a^{[1]} \cdot w^{[2]} \\ a^{[2]} & = & σ (z^{[2]}) \\ ⋮ \\ z^{[l]} & = & a^{[l - 1]} \cdot w^{[l]} \\ a^{[l]} & = & σ (z^{[l]}) \\ ⋮ \\ z^{[L - 1]} & = & a^{[L - 2]} \cdot w^{[L - 1]} \\ a^{[L - 1]} & = & σ (z^{[L - 1]}) \\ z^{[L]} & = & a^{[L - 1]} \cdot w^{[L]} \\ \hat{y} & = & softmax (z^{[L]}) \\ ℒ & = & CE (\hat{y}) & = - \sum_{i = 1}^{C} y_{i} \log {\hat{y}}_{i} \end{matrix}

where $[l]$ and $[L]$ represents arbitrary layer $l$ and last layer $L$ , respectively. $σ (\cdot)$ is activation function (ReLU in this case). In this representation, everything is scalar except $w^{[L]}$ because the output must be multi-class for softmax so I just played in the last layer to show a meaningful example.

We can visualize the computation graph of this network as follows:

2.1 - Chain Rule

Backpropagation relies on the chain rule of calculus to compute gradients efficiently. For a composite function $f (g (h (x)))$ , the chain rule states:

\frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}

In neural networks, we apply this principle to decompose the loss gradient into a product of simpler derivatives based on any parameter.

2.2 - Scalar Backward Propagation

After computing loss, we can start computing gradients with respect to it. Since there are $C$ classes, we need to calculate derivative of each prediction ${\hat{y}}_{i}$ :

\frac{\partial ℒ}{\partial {\hat{y}}_{i}} = \frac{\partial (- y_{i} \log {\hat{y}}_{i})}{\partial {\hat{y}}_{i}} = - \frac{y_{i}}{{\hat{y}}_{i}}

We can visualize this gradient flow like this:

Since the predicted output ${\hat{y}}_{i}$ is computed using $z_{i}^{[L]}$ in both the numerator and the denominator

y_{i} = softmax (z_{i}^{[L]}) = \frac{e^{z_{i}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}},

the derivative $\frac{\partial ℒ}{\partial z_{k}^{[L]}}$ depends not only on the numerator but also on the denominator:

\frac{\partial ℒ}{\partial z_{k}^{[L]}} = \sum_{i = 1}^{C} \frac{\partial ℒ}{\partial {\hat{y}}_{i}} \cdot \frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}}

It might be easier to understand when you see visual representation:

Since we already know $\frac{\partial ℒ}{\partial {\hat{y}}_{i}}$ , the gradient expression simplifies to:

\frac{\partial ℒ}{\partial z_{k}^{[L]}} = \sum_{i = 1}^{C} - \frac{y_{i}}{{\hat{y}}_{i}} \cdot \frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}}

The most confusing part begins here, because we now need to carefully compute the local gradient $\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}}$ . To do this, we apply the quotient rule:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = \frac{\partial (\frac{e^{z_{i}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}})}{\partial z_{k}^{[L]}} = \frac{(\frac{\partial e^{z_{i}^{[L]}}}{\partial z_{k}^{[L]}}) \cdot \sum_{j = 1}^{C} e^{z_{j}^{[L]}} - e^{z_{i}^{[L]}} \cdot (\frac{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}}{\partial z_{k}^{[L]}})}{{(\sum_{j = 1}^{C} e^{z_{j}^{[L]}})}^{2}}

For this derivative, there are two distinct cases to consider:

i = k or i \neq k

Case 1: $i = k$

The derivative of numerator:

\frac{\partial e^{z_{i}^{[L]}}}{\partial z_{k}^{[L]}} = e^{z_{i}^{[L]}}

The derivative of the denominator is:

\frac{\partial (\sum_{j = 1}^{C} e^{z_{j}^{[L]}})}{\partial z_{k}^{[L]}} = e^{z_{k}^{[L]}}

Substituting these into the quotient rule:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = \frac{z_{i}^{[L]} \cdot \sum_{j = 1}^{C} z_{j}^{[L]} - z_{i}^{[L]} \cdot z_{k}^{[L]}}{{(\sum_{j = 1}^{C} z_{j}^{[L]})}^{2}}

Factorizing into:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = \frac{e^{z_{i}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}} \cdot (1 - \frac{e^{z_{k}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}})

Shortly, this is equivalent to:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = {\hat{y}}_{i} \cdot (1 - {\hat{y}}_{k})

Case 2: $i \neq k$

The derivative of the numerator is zero, because $e^{z_{i}}$ does not depend on $z_{k}$ .
The derivative of the denominator is:

\frac{\partial (\sum_{j = 1}^{C} e^{z_{j}^{[L]}})}{\partial z_{k}^{[L]}} = e^{z_{k}^{[L]}}

Thus, we have:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = \frac{0 \cdot \sum_{j = 1}^{C} z_{j}^{[L]} - z_{i}^{[L]} \cdot z_{k}^{[L]}}{{(\sum_{j = 1}^{C} z_{j}^{[L]})}^{2}}

Simplifying:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = - \frac{e^{z_{i}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}} \cdot \frac{e^{z_{k}^{[L]}}}{\sum_{j = 1}^{C} e^{z_{j}^{[L]}}}

Shortly, this is equivalent to:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = - {\hat{y}}_{i} \cdot {\hat{y}}_{k}

Finally, we can summarize these two cases as

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = {\begin{matrix} {\hat{y}}_{i} \cdot (1 - {\hat{y}}_{k}) & if i = k \\ - {\hat{y}}_{i} \cdot {\hat{y}}_{k} & if i \neq k \end{matrix}

In literature, it is pretty common to use the following form:

\frac{\partial {\hat{y}}_{i}}{\partial z_{k}^{[L]}} = {\hat{y}}_{i} \cdot (δ_{i k} - {\hat{y}}_{k})

where $δ_{i k}$ is 1 if $i = k$ , 0 otherwise. When we combine the gradient of cross-entropy and local gradient, we have:

\frac{\partial ℒ}{\partial z_{k}^{[L]}} = \sum_{i = 1}^{C} - \frac{y_{i}}{{\hat{y}}_{i}} \cdot {\hat{y}}_{i} \cdot (δ_{i k} - {\hat{y}}_{k})

where ${\hat{y}}_{i}$ cancels each other:

\begin{matrix} \frac{\partial ℒ}{\partial z_{k}^{[L]}} & = & \sum_{i = 1}^{C} - y_{i} \cdot (δ_{i k} - {\hat{y}}_{k}) \\ = & - y_{k} + {\hat{y}}_{k} \sum_{i = 1}^{C} y_{i} \end{matrix}

Since ground truth value $y_{i}$ is one-hot vector (1 for true class and 0 for others), equation simplifies to:

\frac{\partial ℒ}{\partial z_{k}^{[L]}} = {\hat{y}}_{k} - y_{k}

This compact form is what makes softmax combined with cross-entropy loss so useful. Instead of dealing with complicated fractions, we can directly use this result to propagate gradients to the next layers.

From this point, gradient calculations across layers will start to follow a recurring pattern. Each layer essentially repeats the same process: we compute gradients with respect to its inputs and its weights.

For the layer output $z^{[L]}$ , we have two key variables to differentiate with respect to:

The weights $w^{[L]}$ : their gradients are crucial because they are the trainable parameters we want to optimize during learning.
The activations $a^{[L - 1]}$ : while not parameters themselves, their gradients are equally important since they ensure the flow of gradients backward through the network, enabling earlier layers to update as well.

Thus, even though only the weight gradients directly influence optimization, the activation gradients play a important role in keeping backpropagation alive throughout the entire network:

If we calculate derivatives with respect to weights $w^{[l]}$ :

\frac{\partial ℒ}{\partial w^{[l]}} = \frac{\partial ℒ}{\partial z^{[l]}} \cdot \frac{\partial (a^{[l - 1]} \cdot w^{[l]})}{\partial w^{[l]}} = \frac{\partial ℒ}{\partial z^{[l]}} \cdot a^{[l - 1]}

If we calculate derivatives with respect to weights $a^{[l - 1]}$ :

\frac{\partial ℒ}{\partial a^{[l - 1]}} = \frac{\partial ℒ}{\partial z^{[l]}} \cdot \frac{\partial (a^{[l - 1]} \cdot w^{[l]})}{\partial a^{[l - 1]}} = \frac{\partial ℒ}{\partial z^{[l]}} \cdot w^{[l]}

Lastly, we need to calculate the derivative of $a^{[l]}$ with respect to $z^{[l]}$ in $σ (z^{[l]})$ (assuming $σ (z^{[l]}) =$ ReLU( $z^{[l]}$ )):

\frac{\partial ℒ}{\partial z^{[l]}} = \frac{\partial ℒ}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} = \frac{\partial ℒ}{\partial a^{[l]}} \cdot 1 (z^{[l]} > 0)

where $1 (z^{[l]} > 0)$ outputs 1 if $(z^{[l]} > 0)$ , 0 otherwise as a indicator function.

Therefore, we can summarize backpropagation for scalar values with following pattern:

\begin{matrix} \frac{\partial ℒ}{\partial z^{[L]}} & = & \hat{y} - y \\ ⋮ \\ \frac{\partial ℒ}{\partial z^{[l]}} & = & \frac{\partial ℒ}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} & = & \frac{\partial ℒ}{\partial a^{[l]}} \cdot 1 (z^{[l]} > 0) \\ \frac{\partial ℒ}{\partial w^{[l]}} & = & \frac{\partial ℒ}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial w^{[l]}} & = & \frac{\partial ℒ}{z^{[l]}} \cdot a^{[l - 1]} \\ \frac{\partial ℒ}{\partial a^{[l - 1]}} & = & \frac{\partial ℒ}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial a^{[l - 1]}} & = & \frac{\partial ℒ}{z^{[l]}} \cdot w^{[l]} \\ \frac{\partial ℒ}{\partial z^{[l - 1]}} & = & \frac{\partial ℒ}{\partial a^{[l - 1]}} \cdot \frac{\partial a^{[l - 1]}}{\partial z^{[l - 1]}} & = & \frac{\partial ℒ}{\partial a^{[l - 1]}} \cdot 1 (z^{[l - 1]} > 0) \\ ⋮ \\ \frac{\partial ℒ}{\partial z^{[1]}} & = & \frac{\partial ℒ}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}} & = & \frac{\partial ℒ}{\partial a^{[1]}} \cdot 1 (z^{[1]} > 0) \\ \frac{\partial ℒ}{\partial w^{[1]}} & = & \frac{\partial ℒ}{\partial z^{[1]}} \cdot \frac{\partial z^{[1]}}{\partial w^{[1]}} & = & \frac{\partial ℒ}{z^{[1]}} \cdot a^{[0]} \end{matrix}

3 - Backward Propagation

After computing the loss through forward propagation, we need to update the model's weights to minimize this loss. Backpropagation is the algorithm that computes the gradients of the loss function with respect to each parameter in the network. We'll derive these gradients step by step, working backwards from the loss to the input layer.

3.1 - Gradient of Cross-Entropy with Softmax

Starting with loss function, we calculate relative gradients and go back step by step to calculate further gradients. Since the model has predictions with Softmax activation and the loss is calculated with cross-entropy, the combination of these methods has a nice property which simplifies the gradient:

\frac{\partial ℒ}{\partial Z^{[L]}} = \frac{1}{N} \underset{[N \times C]}{\underset{⏟}{(\hat{Y} - Y)}}

or in matrix representation:

\frac{\partial ℒ}{\partial Z^{[L]}} = \frac{1}{N} [\begin{matrix} \frac{\partial ℒ}{\partial Z_{11}^{[L]}} & \frac{\partial ℒ}{\partial Z_{12}^{[L]}} & \dots & \frac{\partial ℒ}{\partial Z_{1 C}^{[L]}} \\ \frac{\partial ℒ}{\partial Z_{21}^{[L]}} & \frac{\partial ℒ}{\partial Z_{22}^{[L]}} & \dots & \frac{\partial ℒ}{\partial Z_{2 C}^{[L]}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial ℒ}{\partial Z_{N 1}^{[L]}} & \frac{\partial ℒ}{\partial Z_{N 2}^{[L]}} & \dots & \frac{\partial ℒ}{\partial Z_{N C}^{[L]}} \end{matrix}] = \frac{1}{N} [\begin{matrix} {\hat{Y}}_{11} - Y_{11} & {\hat{Y}}_{12} - Y_{12} & \dots & {\hat{Y}}_{1 C} - Y_{1 C} \\ {\hat{Y}}_{21} - Y_{21} & {\hat{Y}}_{22} - Y_{22} & \dots & {\hat{Y}}_{2 C} - Y_{2 C} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {\hat{Y}}_{N 1} - Y_{N 1} & {\hat{Y}}_{N 2} - Y_{N 2} & \dots & {\hat{Y}}_{N C} - Y_{N C} \end{matrix}]

or in visual representation:

3.2 - Gradient of Arbitrary Layer $l$

At this stage, we can begin calculating the gradients of the weights in the corresponding layer. Since the forward propagation is computed as

Z^{[l]} = A^{[l - 1]} W^{[l]}

there are two possible local gradients we might consider:

\frac{\partial Z^{[l]}}{\partial W^{[l]}} or \frac{\partial Z^{[l]}}{\partial A^{[l - 1]}}

You may wonder why we would need the gradient with respect to $A^{[l - 1]}$ since our goal is to update the weight matrix $W^{[l]}$ . The reason is that the derivative with respect to $A^{[l - 1]}$ becomes necessary for updating the weights of the previous layer, because

A^{[l - 1]} = σ (A^{[l - 2]} W^{[l - 1]})

When we want to calculate the gradients of weight matrix $W^{[l]}$ with respect to the loss function $ℒ$ , we use chain rule as follows:

\frac{\partial ℒ}{\partial W^{[l]}} = \underset{Upstream  Gradient}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{Local  Gradient}{\underset{⏟}{\frac{\partial Z^{[l]}}{\partial W^{[l]}}}}

The local gradient gives us:

\frac{\partial Z^{[l]}}{\partial W^{[l]}} = \frac{\partial (A^{[l - 1]} W^{[l]})}{\partial W^{[l]}} = \underset{[N \times H_{l - 1}]}{\underset{⏟}{A^{[l - 1]}}}

Since there are $H_{l - 1} \times H_{l}$ values in weight matrix $W^{[l]}$ , we need to find each individual gradient so the shape of $\frac{\partial ℒ}{\partial W^{[l]}}$ should be $H_{l - 1} \times H_{l}$ . However, the dimensions of matrices does not match when we try to calculate matrix multiplication:

\underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial W^{[l]}}}} = \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{[N \times H_{l - 1}]}{\underset{⏟}{A^{[l - 1]}}}

If we take transpose of $A^{[l - 1]}$ and place it to the left side of upstream gradient, we have the correct dimension matching:

\underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial W^{[l]}}}} = \underset{[H_{l - 1} \times N]}{\underset{⏟}{{A^{[l - 1]}}^{T}}} \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}}

At first glance, this transformation may seem arbitrary. Why do we transpose? How can we be sure this result is correct? To clarify, let’s check the matrix representation:

[\begin{matrix} \frac{\partial ℒ}{\partial W_{11}^{[l]}} & \frac{\partial ℒ}{\partial W_{12}^{[l]}} & \dots & \frac{\partial ℒ}{\partial W_{1 H_{l - 1}}^{[l]}} \\ \frac{\partial ℒ}{\partial W_{21}^{[l]}} & \frac{\partial ℒ}{\partial W_{22}^{[l]}} & \dots & \frac{\partial ℒ}{\partial W_{2 H_{l - 1}}^{[l]}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial ℒ}{\partial W_{H_{l} 1}^{[l]}} & \frac{\partial ℒ}{\partial W_{H_{l} 2}^{[l]}} & \dots & \frac{\partial ℒ}{\partial W_{H_{l} H_{l - 1}}^{[l]}} \end{matrix}] = [\begin{matrix} {A_{11}^{[l - 1]}}^{T} & {A_{12}^{[l - 1]}}^{T} & \dots & {A_{1 N}^{[l - 1]}}^{T} \\ {A_{21}^{[l - 1]}}^{T} & {A_{22}^{[l - 1]}}^{T} & \dots & {A_{2 N}^{[l - 1]}}^{T} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {A_{H_{l - 1} 1}^{[l - 1]}}^{T} & {A_{H_{l - 1} 2}^{[l - 1]}}^{T} & \dots & {A_{H_{l - 1} N}^{[l - 1]}}^{T} \end{matrix}] [\begin{matrix} \frac{\partial ℒ}{\partial Z_{11}^{[l]}} & \frac{\partial ℒ}{\partial Z_{12}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{1 H_{l}}^{[l]}} \\ \frac{\partial ℒ}{\partial Z_{21}^{[l]}} & \frac{\partial ℒ}{\partial Z_{22}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{2 H_{l}}^{[l]}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial ℒ}{\partial Z_{N 1}^{[l]}} & \frac{\partial ℒ}{\partial Z_{N 2}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{N H_{l}}^{[l]}} \end{matrix}]

If we look closely only one gradient calculation, $\frac{\partial ℒ}{\partial W_{11}^{[l]}}$ :

\frac{\partial ℒ}{\partial W_{11}^{[l]}} = [\begin{matrix} {A_{11}^{[l - 1]}}^{T} & {A_{12}^{[l - 1]}}^{T} & \dots & {A_{1 N}^{[l - 1]}}^{T} \end{matrix}] [\begin{matrix} \frac{\partial ℒ}{\partial Z_{11}^{[l]}} \\ \frac{\partial ℒ}{\partial Z_{21}^{[l]}} \\ ⋮ \\ \frac{\partial ℒ}{\partial Z_{N 1}^{[l]}} \end{matrix}]

Since single value of weight matrix is fixed for each input value, the gradient of this weight is just summation of all N input derivatives:

\frac{\partial ℒ}{\partial W_{11}^{[l]}} = {A_{11}^{[l - 1]}}^{T} \frac{\partial ℒ}{\partial Z_{11}^{[l]}} + {A_{12}^{[l - 1]}}^{T} \frac{\partial ℒ}{\partial Z_{21}^{[l]}} + \dots + {A_{1 N}^{[l - 1]}}^{T} \frac{\partial ℒ}{\partial Z_{N 1}^{[l]}}

In the image below, you can see the visualization of this calculation for 4 weights in the network:

At the same time, we need to calculate the derivative of $A^{[l - 1]}$ , $\frac{\partial ℒ}{\partial A^{[l - 1]}}$ , so upstream gradients can flow to previous layer and the gradient of weight matrix can be calculated in that layer:

\frac{\partial ℒ}{\partial A^{[l - 1]}} = \frac{\partial ℒ}{\partial Z^{[l]}} \frac{\partial Z^{[l]}}{\partial A^{[l - 1]}}

The local gradient is:

\frac{\partial Z^{[L]}}{\partial A^{[l - 1]}} = \frac{\partial (A^{[l - 1]} W^{[l]})}{\partial A^{[l - 1]}} = \underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{W^{[l]}}}

Once again, the dimensions of matrices do not match:

\underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} = \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{W^{[l]}}}

We can transpose the weight matrix to align matrices:

\underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} = \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{[H_{l} \times H_{l - 1}]}{\underset{⏟}{W^{[l]^{T}}}}

We can see the reasoning more easily if we check matrix representation again:

[\begin{matrix} \frac{\partial ℒ}{\partial A_{11}^{[l - 1]}} & \frac{\partial ℒ}{\partial A_{12}^{[l - 1]}} & \dots & \frac{\partial ℒ}{\partial A_{1 H_{l - 1}}^{[l - 1]}} \\ \frac{\partial ℒ}{\partial A_{21}^{[l - 1]}} & \frac{\partial ℒ}{\partial A_{22}^{[l - 1]}} & \dots & \frac{\partial ℒ}{\partial A_{2 H_{l - 1}}^{[l - 1]}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial ℒ}{\partial A_{N 1}^{[l - 1]}} & \frac{\partial ℒ}{\partial A_{N 2}^{[l - 1]}} & \dots & \frac{\partial ℒ}{\partial A_{N H_{l - 1}}^{[l - 1]}} \end{matrix}] = [\begin{matrix} \frac{\partial ℒ}{\partial Z_{11}^{[l]}} & \frac{\partial ℒ}{\partial Z_{12}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{1 H_{l}}^{[l]}} \\ \frac{\partial ℒ}{\partial Z_{21}^{[l]}} & \frac{\partial ℒ}{\partial Z_{22}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{2 H_{l}}^{[l]}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial ℒ}{\partial Z_{N 1}^{[l]}} & \frac{\partial ℒ}{\partial Z_{N 2}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{N H_{l}}^{[l]}} \end{matrix}] [\begin{matrix} W_{11}^{[l]^{T}} & W_{12}^{[l]^{T}} & \dots & W_{1 H_{l - 1}}^{[l]^{T}} \\ W_{21}^{[l]^{T}} & W_{22}^{[l]^{T}} & \dots & W_{2 H_{l - 1}}^{[l]^{T}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ W_{H_{l} 1}^{[l]^{T}} & W_{H_{l} 2}^{[l]^{T}} & \dots & W_{H_{l} H_{l - 1}}^{[l]^{T}} \end{matrix}]

By focusing on only first value, $\frac{\partial ℒ}{\partial A_{11}^{[l - 1]}}$ , the gradient of $A_{11}^{l}$ depends on the values that it affected during forward propagation (the same logic as $\frac{\partial ℒ}{\partial W_{11}^{[l]}}$ ) :

\frac{\partial ℒ}{\partial A_{11}^{[l - 1]}} = [\begin{matrix} \frac{\partial ℒ}{\partial Z_{11}^{[l]}} & \frac{\partial ℒ}{\partial Z_{12}^{[l]}} & \dots & \frac{\partial ℒ}{\partial Z_{1 H_{l}}^{[l]}} \end{matrix}] [\begin{matrix} W_{11}^{[l]^{T}} \\ W_{21}^{[l]^{T}} \\ ⋮ \\ W_{H_{l} 1}^{[l]^{T}} \end{matrix}]

Now we want to compute the gradient $\frac{\partial ℒ}{\partial Z^{[l - 1]}}$ . Recall that $A^{[l - 1]}$ is obtained by applying an activation function to $Z^{[l - 1]}$ :

\underset{[N \times H_{l - 1}]}{\underset{⏟}{A^{[l - 1]}}} = σ (\underset{[N \times H_{l - 1}]}{\underset{⏟}{Z^{[l - 1]}}})

In our examples, we use the ReLU activation function. Its derivative with respect to each element is given by:

\frac{\partial ReLU (Z_{i j}^{[l - 1]})}{\partial Z_{i j}^{[l - 1]}} = {\begin{matrix} 1 & if Z_{i j}^{[l]} > 0, \\ 0 & otherwise \end{matrix}

For $N \times H_{l - 1}$ inputs, we must compute a gradient for each element, producing a matrix of 0s and 1s. The derivative of ReLU applied element-wise can be written as the indicator (mask) of positive entries:

\frac{\partial ReLU (Z^{[l - 1]})}{\partial Z^{[l - 1]}} = 1 (Z^{[l - 1]} > 0)

where $1 (\cdot)$ is the element-wise indicator function (1 when the condition is true, 0 otherwise). In matrix form:

\frac{\partial ReLU (Z^{[l - 1]})}{\partial Z^{[l - 1]}} = [\begin{matrix} 1 (Z_{11}^{[l - 1]} > 0) & 1 (Z_{12}^{[l - 1]} > 0) & \dots & 1 (Z_{1 H_{l - 1}}^{[l - 1]} > 0) \\ 1 (Z_{21}^{[l - 1]} > 0) & 1 (Z_{22}^{[l - 1]} > 0) & \dots & 1 (Z_{2 H_{l - 1}}^{[l - 1]} > 0) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 1 (Z_{N 1}^{[l - 1]} > 0) & 1 (Z_{N 2}^{[l - 1]} > 0) & \dots & 1 (Z_{N H_{l - 1}}^{[l - 1]} > 0) \end{matrix}]

Because the derivative is applied element-wise, the upstream gradient $\frac{\partial ℒ}{\partial A^{[l - 1]}}$ is combined with this mask element-wise, not by matrix multiplication:

\underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l - 1]}}}} = \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} ⊙ \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial A^{[l - 1]}}{\partial Z^{[l - 1]}}}}

\underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l - 1]}}}} = \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} ⊙ \underset{[N \times H_{l - 1}]}{\underset{⏟}{1 (Z^{[l - 1]} > 0)}}

3.3 - Repetition

At this point, we completed the backpropagation using the matrix multiplication step. From here on, the process is essentially a repetition: each layer applies the same logic and propagates the gradients backwards through the weights (also biases implicitly) and activations.

The only real exception was the output layer, where we combined the softmax function with cross-entropy loss. This required a more detailed derivation, but the model is consistent for all hidden layers and is repeated across the entire network. You can see the pattern as follows:

\begin{matrix} \underset{[N \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L]}}}} & = & \frac{1}{N} \underset{N \times C}{\underset{⏟}{(\hat{Y} - Y)}} \\ \underset{[H_{L - 1} \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial W^{[L]}}}} & = & \underset{[H_{L - 1} \times N]}{\underset{⏟}{\frac{\partial (A^{[L - 1]} W^{[L]})}{\partial W^{[L]}}}} \underset{[N \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L]}}}} & = & \underset{[H_{L - 1} \times N]}{\underset{⏟}{A^{[L - 1]^{T}}}} \underset{[N \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L]}}}} \\ \underset{[N \times H_{L - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[L - 1]}}}} & = & \underset{[N \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L]}}}} \underset{[C \times H_{L - 1}]}{\underset{⏟}{\frac{\partial (A^{[L - 1]} W^{[L]})}{\partial A^{[L - 1]}}}} & = & \underset{[N \times C]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L]}}}} \underset{[C \times H_{L - 1}]}{\underset{⏟}{W^{[L]^{T}}}} \\ \underset{[N \times H_{L - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[L - 1]}}}} & = & \underset{[N \times H_{L - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[L - 1]}}}} ⊙ \underset{[N \times H_{L - 1}]}{\underset{⏟}{\frac{\partial (σ (Z^{[L - 1]}))}{\partial Z^{[L - 1]}}}} & = & \underset{[N \times H_{L - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[L - 1]}}}} ⊙ \underset{[N \times H_{L - 1}]}{\underset{⏟}{1 (Z^{[L - 1]} > 0)}} \\ ⋮ \\ \underset{[H_{l - 1} \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial W^{[l]}}}} & = & \underset{[H_{l - 1} \times N]}{\underset{⏟}{\frac{\partial (A^{[l - 1]} W^{[l]})}{\partial W^{[l]}}}} \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} & = & \underset{[H_{l - 1} \times N]}{\underset{⏟}{A^{[l - 1]^{T}}}} \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \\ \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} & = & \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{[H_{l} \times H_{l - 1}]}{\underset{⏟}{\frac{\partial (A^{[l - 1]} W^{[l]})}{\partial A^{[l - 1]}}}} & = & \underset{[N \times H_{l}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l]}}}} \underset{[H_{l} \times H_{l - 1}]}{\underset{⏟}{W^{[l]^{T}}}} \\ \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[l - 1]}}}} & = & \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} ⊙ \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial (σ (Z^{[l - 1]}))}{\partial Z^{[l - 1]}}}} & = & \underset{[N \times H_{l - 1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[l - 1]}}}} ⊙ \underset{[N \times H_{l - 1}]}{\underset{⏟}{1 (Z^{[l - 1]} > 0)}} \\ ⋮ \\ \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[1]}}}} & = & \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[1]}}}} ⊙ \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial (σ (Z^{[1]}))}{\partial Z^{[1]}}}} & = & \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial A^{[1]}}}} ⊙ \underset{[N \times H_{1}]}{\underset{⏟}{1 (Z^{[1]} > 0)}} \\ \underset{[H_{0} \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial W^{[1]}}}} & = & \underset{[H_{0} \times N]}{\underset{⏟}{\frac{\partial (A^{[0]} W^{[1]})}{\partial W^{[1]}}}} \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[1]}}}} & = & \underset{[H_{0} \times N]}{\underset{⏟}{A^{[0]^{T}}}} \underset{[N \times H_{1}]}{\underset{⏟}{\frac{\partial ℒ}{\partial Z^{[1]}}}} \end{matrix}

I know it looks ugly but if you pay some attention, you will see the pattern for gradient calculation. There are just some rules that you need to follow for each block.

For comments, please send me an email.