Gradient Descent in PyTorch

7 min readJul 19, 2020

All you need to succeed is 10.000 “epochs” of practice. Malcom Gladwell

gradient descent likes this element

Introduction

The goal of this article will be to walk the reader through all the steps of the gradient descend optimisation process. First things first we will provide the definition of the algorithm, and explain why the process is so important for a Machine Learning Model. Furthermore, to make things clearer, examples will be provided along the way.

Note to the reader

This article will require the reader to have some sort of familiarity with the definition, and the scope of a Machine Learning model. The language that is going to be used is PyTorch. More complex facets of the optimisation algorithms, such as momentum or cyclical learning rates, are beyond the scope of this article. But, if you want a more comprehensive outlook on the topic I strongly suggest you to read An overview of gradient descent optimization algorithms by Sebastian Ruder. He gives a thorough explanation of all the most important aspects of the algorithm.

What is gradient descent?

Gradient descent is the optimisation algorithm that minimise a differentiable function, by iteratively subtracting to its weights their partial derivatives, thus moving them towards the bottom of it.

To put it in more simple words, gradient descent is the process through which a Machine Learning model learns.

How does it work?

We just said that gradient descent is the optimisation process of some sort of differentiable function, that in our case will be represented by the MSE loss function, which looks like this:

The job of the loss function is to assess how far the predictions of the model are from the actual targets. The farthest they are, the greater will be the loss. And usually, since we start with a model whose weights are initialised randomly, at the beginning the value of the loss function is likely to be very high. This is where the optimisation process steps in!

The steps of the gradient descend algorithm are the following:

Generate predictions
Calculate the loss
Compute the gradients with respect to the weights and biases
Iteratively adjust the weights and biases by subtracting a small quantity proportional to the gradients, the learning rate
Reset all the gradients back to zero

To be brief I won’t explain the steps where I initialise the weights and the biases, but if you want you can still find them on my GitHub.

Once that the predictions are computed, the next step is to calculate the loss.

#in this way we compute the predictions
preds = model(inputs)#we define the loss function, mse loss
loss_fn = F.mse_loss
loss = loss_fn(preds, targets)
loss = 8620795.

Like this we measure how far off are the predictions from the actual targets. As expected, the loss value is quite high, is over 86*10⁵, meaning that the predictions are so far from the actual values. Once more, this is because in the first step we try to compute the predictions by using a set of weights and biases which are randomly initialised. It is a stab in the dark!

And this is why gradient descent is so crucially important, and at the heart of of ML models. In fact, after having computed the loss, the following step is to calculate its gradients with respect to each weight and bias.

#in PyTorch we compute the gradients w.r.t. the weights and biases by calling backward
loss.backward()

The gradient is the vector whose components are the partial derivatives of a differentiable function. In very simple, and non-technical words, is the partial derivative of a weight (or a bias) while we keep the others froze.

But why is the gradient necessary? How is it going to be used? Let me show how.

In the graph below is plotted a quadratic function w.r.t any single weights or biases. Therefore, by keeping in mind what we said at the beginning, and so that gradient descent is the optimisation process that looks for the bottom of the function (the place where the loss is the lowest) then the gradient can be seen as the rate of change of the loss, the slope. And so, gradient descent is the way we can change the loss function, the way to decreasing it, by adjusting those weights and biases that at the beginning had been initialised randomly. Because, in the following steps they won’t be random anymore, no they are going to be adjusted according to the value of the loss function.

Gradient descent can be interpreted as the way we teach the model to be better at predicting.

So, if the gradient (and so the slope) is positive, increasing the weight’s value will decrease the loss. Conversely, if the gradient is negative (and so a negative slope), by decreasing the weight’s value the loss will decrease.

The GIF is from “Optimizers Explained — Adam, Momentum and Stochastic Gradient Descent” by Casper Hansen, 2019.

To go back at our example, we previously got a loss value of 86*10⁵, now let’s try to subtract to the original and random weights and biases the gradients (that were computed in the foregoing step with loss.backward()).

#I call torch.no_grad() to tell PyTorch not to track gradients operations.with torch.no_grad():    #subtracting the gradient from the weight    
    weight -= weight.grad * 1e-7     #subtracting the gradient from the bias
    bias -= bias.grad * 1e-7    #setting the gradient back to zero 
    weight.grad.zero_()    #setting the gradient back to zero  
    bias.grad.zero_()Adjusted loss = 26*10^5

By applying gradient descent only one time we reduced the loss from 86*10⁵ to 26*10⁵. Amazing, isn’t it?

One more thing, you may have noticed that when I adjusted the weights and the biases, I multi-plicated their gradients (partial derivatives) by 1e-7, this number here is called the learning rate. In fact, sometimes when we compute the gradients they may result in very big numbers, that if directly subtracted to the weights would be too much of a big step.

We can verify this by calling model.parameters() which is a method that records the weights and the biases of the model (we could also call model.weight, to check for the weights or model.bias, to check for the biases), and compare those values with the gradients (which are computed with weight.grad and bias.grad)

Model weight: Parameter containing:
tensor([[ 0.4463, -0.1423, -0.3371]], requires_grad=True)
Model bias: Parameter containing:
tensor([0.1331], requires_grad=True)Weights gradients: tensor([[-3831077.7500, -6185880.5000, -3233904.2500]])
Bias gradients: tensor([-5086.3018])

As we can easily notice the first weight has a value of 0.4463, while its respective gradient has a value of -3831077.7500. So, if we were to subtract this value, as it is, to the weight, well this would be of no help, since we want to take small steps towards the bottom of the function, and not risking to jump to the opposite end of it, where the loss might be even higher. And so we multiply the gradient by a learning rate, a small amount that we get to pick, thus avoiding risky and unstable moves.

Now you might be wondering, how do I pick the correct learning rate? The short answer is by continuous and small tweaks. My advice is to try to start with a small value and see what effect it has on the loss. But, this is a much more complicated topic that goes beyond the scope of this article, and if you want to go deeper in it I recommend reading the article “Estimating an Optimal Learning Rate For a Deep Neural Network” by Pavel Surmenok.

Going back to our example, all this was achieved with just one round of optimisation. But, what would happen if we would repeat this learning process, let’s say for 10.000 times? Well it’s time to find out.

for i in range(10000):    #compute the predictions    
    preds = model(inputs)     #compute the loss   
    loss = loss_fn(preds, targets)     #calculate the gradients   
    loss.backward()        with torch.no_grad():        
        weight -= weight.grad * 1e-7        
        bias -= bias.grad * 1e-7        
        weight.grad.zero_()        
        bias.grad.zero_()Final loss: 12767.9688

The loss drops from 86*10⁵ till to 12.767.9., which is a squared value. This means that the predictions of the model, on average, are 357.32 (the root of the loss) far apart from the actual values.

Congratulations you taught to your first model how to learn!

Conclusions

First of all I feel obliged to you for having reached the end of the article, and I hope you found it stimulating, since I enjoyed writing it so much. Also, if you are interested on the topic stay tuned for more articles on ML models!

DSRs

References

[1] https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

[2] Ruder S., “An overview of gradient descent optimization algorithms”, 2016. Available: https://ruder.io/optimizing-gradient-descent/

[3] Surmenok P., “Estimating an Optimal Learning Rate For a Deep Neural Network”, 2017. Available: https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0#:~:text=There%20are%20multiple%20ways%20to,%3A%200.01%2C%200.001%2C%20etc.

[4] Aakash NS, Linear regression with PyTorch, part 2 “PyTorch: Zero to GANs”, 2020. Available: https://jovian.ml/aakashns/02-linear-regression

[5] Hansen C., “Optimizers Explained — Adam, Momentum and Stochastic Gradient Descent”, 2019. Available: https://mlfromscratch.com/optimizers-explained/#/