Facial expression recognition using PyTorch

10 min readJul 3, 2020

Computer Vision is a very well-known keyword today. Yes, there actually exists a hype wherein research teams are trying to make computers see and understand just as a human does. One of the features of human vision is understanding the facial expression of friends, families, and strangers. Today, we will be doing a project to make the computer understand these too!

The main idea
Computer Vision is a field of Artificial Intelligence that trains computers to understand the visual world by replicating the complexity of the human visual system. Using digital images acquired from cameras and videos, machines can be ‘trained’ to accurately identify and classify objects and even react to what they perceive.

The Facial Expression Recognition can be featured as one of the classification jobs people might like to include in the set of computer vision. The job of our project will be to look through a camera that will be used as eyes for the machine and classify the face of the person (if any) based on his current expression/mood.

The planning

So, now we will plan a checklist that we would follow from the starting to completion of the project. First, let’s find ourselves a dataset based on which we will train the model. Then we will explore the dataset to gain insights about how we can edit the mode later.

Secondly, we will get the data ready so that we can work on the model, and then we will create a model pipeline. After the model is ready we will train on the model and tweak the hyperparameters based on need. After the model is fully trained we will test the model on test data and save the model.

Finally, we will write a python script using the Haar-Cascade Classifier with the help of the OpenCV module to detect our face through the camera then use the saved model to classify the expression.

Planning for the project. First, collect the data andmake the model. Then, train the model and finally script the program.

Our planning of the roadmap is ready. Now we can get on with the project!

Working with the data

After going around the internet for some time I found a dataset that is of my interest and perfect for the problem we are going to solve in this project!

The dataset is called FER2013 that I found here. Unfortunately, there was no information regarding which classes were which. Now we have to work to extract that information so that we can make the project fruitful.

After taking a look at the data frame, we can see clearly that there are three columns in the frame. First has the index number of the classes — 0 for happy face and 1 for sad, but we do not know which is what yet. So, we will categorize it ourselves later.

The second column contains the pixel values for the photo of the faces with the respective reactions. Usage column states which data rows are for testing and which are for training.

After little exploration of the dataset, this is what I found:

The total number of data sets here is 35,887. That is a great number that will be helpful for the training. There are actually three kinds of classes in the usage column — Training, PrivateTest, and PublicTest. Probably, they ran a contest and the Training class has the dataset used by the contestants to train. They had to use the PublicTest for the prediction and the online leaderboard used the PrivateTest to test the accuracy for the leaderboard.

We can see that the test datasets each take up to 10 percent of the total data frame while the train takes up 80 percent.

We can also see that the total number of expression classes are seven. So, let us find out which class represents which mood by looking at the photos and at their respective expression class number.

While performing the code, I came to know that the total number of pixel values is 2,304. This means that each picture is of the dimension 48x48. After looking at the photos, I also concluded that 0 represents anger, 1 represents disgust, 2 represents fear, 3 represents happiness, 4 represents sadness, 5 represents surprised, and 6 represents neutral.

classes = {
    0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6: 'Neutral'
}

Since we also have fewer data compared to a big Deep Learning model, we will use some transformation on the images. This will help us avoiding overfitting and get more number of datasets. I am going to give random cropping of 48x48 dimensions on the images after applying 4-pixel-wide padding on each side of each image. Also, I will randomize the horizontal flip on the images by 50 percent. I will also add normalization with 0.5 mean and variance.

# this is for the transforms
train_trfm = transforms.Compose(
    [
        transforms.ToPILImage(),
        transforms.Grayscale(num_output_channels=1),
        transforms.RandomCrop(48, padding=4, padding_mode='reflect'),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5), (0.5), inplace=True)
    ])
val_trfm = transforms.Compose(
    [
        transforms.ToPILImage(),
        transforms.Grayscale(num_output_channels=1),
        transforms.ToTensor(),
        transforms.Normalize((0.5), (0.5))
    ])

Now, we will create the class for the dataset which will also transform our pixels to shaped Tensors and will be helpful to use in the model later. I am going to use the test dataset for the validation dataset since the labels for the test images are available.

# Creating the class for our dataset for the FER
class FERDataset(Dataset):
    
    def __init__(self, images, labels, transforms):
        self.X = images
        self.y = labels
        self.transforms = transforms
        
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, i):
        data = [int(m) for m in self.X[i].split(' ')]
        data = np.asarray(data).astype(np.uint8).reshape(48,48,1)
        data = self.transforms(data)
        label = self.y[i]
        return (data, label)
    
# assigning the transformed data
train_data = FERDataset(train_images, train_labels, train_trfm)
val_data = FERDataset(test_images, test_labels, val_trfm)

When the dataset is fully ready, we will have our data loader-ready with a batch size of 128. This extra validation data loader is a good practice and will let us know when we may overfit the model and also help us understand the accuracy better.

batch_num = 400train_dl = DataLoader(train_data, batch_num, shuffle=True, num_workers=4, pin_memory=True)
val_dl = DataLoader(val_data, batch_num*2, num_workers=4, pin_memory=True)

Since everything is ready now with the data, let us take a look at the photos of the first batch of the training data loader:

Looks like all is well with the dataset and we can begin with our model!

Getting the model ready

Since the data is ready, we will now create the model. First, we will create a base classifier class that will hold on to the validation data loss and accuracy and the overall epoch loss and accuracy.

def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.tensor(torch.sum(preds==labels).item()/len(preds))class FERBase(nn.Module):
    
    # this takes is batch from training dl
    def training_step(self, batch):
        images, labels = batch
        out = self(images)                     # calls the training model and generates predictions
        loss = F.cross_entropy(out, labels)    # calculates loss compare to real labels using cross entropy
        return loss
    
    # this takes in batch from validation dl
    def validation_step(self, batch):
        images, labels = batch
        out = self(images)
        loss = F.cross_entropy(out, labels)
        acc = accuracy(out, labels)            # calls the accuracy function to measure the accuracy
        return {'val_loss': loss.detach(), 'val_acc': acc}
    
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()    # finds out the mean loss of the epoch batch
        
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()       # finds out the mean acc of the epoch batch
        
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    
    def epoch_end(self, epoch, result):
        print("Epoch [{}], last_lr: {:.5f}, train_loss: {:.4f}, val_loss: {:.4f}, val_acc: {:.4f}".format(
            epoch, result['lrs'][-1], result['train_loss'], result['val_loss'], result['val_acc']))

Next, we will create a Convolutional Neural Network Model with Residual Networks that will gradually increase the number of channels of facial data and decrease the dimension. This will be followed by a fully connected neural network which will output an array of 7 values between -1 to 1 describing the probability of which facial expression class it could be.

def conv_block(in_chnl, out_chnl, pool=False, padding=1):
    layers = [
        nn.Conv2d(in_chnl, out_chnl, kernel_size=3, padding=padding),
        nn.BatchNorm2d(out_chnl),
        nn.ReLU(inplace=True)]
    if pool: layers.append(nn.MaxPool2d(2))
    return nn.Sequential(*layers)class FERModel(FERBase):
    def __init__(self, in_chnls, num_cls):
        super().__init__()
        
        self.conv1 = conv_block(in_chnls, 64, pool=True)           # 64x24x24 
        self.conv2 = conv_block(64, 128, pool=True)                # 128x12x12
        self.resnet1 = nn.Sequential(conv_block(128, 128), conv_block(128, 128))    # Resnet layer 1: includes 2 conv2d
        
        self.conv3 = conv_block(128, 256, pool=True)       # 256x6x6 
        self.conv4 = conv_block(256, 512, pool=True)       # 512x3x3
        self.resnet2 = nn.Sequential(conv_block(512, 512), conv_block(512, 512))    # Resnet layer 2: includes 2 conv2d
        
        self.classifier = nn.Sequential(nn.MaxPool2d(3),
                                        nn.Flatten(),
                                        nn.Linear(512, num_cls))    # num_cls
        
    def forward(self, xb):
        out = self.conv1(xb)
        out = self.conv2(out)
        out = self.resnet1(out) + out
        
        out = self.conv3(out)
        out = self.conv4(out)
        out = self.resnet2(out) + out
        
        return self.classifier(out)

The network architecture I used here is also famously called the Resnet9.

The model is now working perfectly. Next, let us set up the torch so that it can use GPU for the training. We can now happily head on to training our model!

Training the model

In the model here, I am going to use the 1Cycle learning rate scheduler where the learning rate isn’t manually implemented. It starts from a very low learning rate, increases and then again gets reduced.

@torch.no_grad()    # this is for stopping the model from keeping track of old parameters
def evaluate(model, val_loader):
    # This function will evaluate the model and give back the val acc and loss
    model.eval()
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)# getting the current learning rate
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']# this fit function follows the intuition of 1cycle lr
def fit(epochs, max_lr, model, train_loader=train_dl, val_loader=val_dl, weight_decay=0, grad_clip=None, opt_func=torch.optim.Adam):
    torch.cuda.empty_cache()
    history = []    #keep track of the evaluation results
    
    # setting upcustom optimizer including weight decay
    optimizer = opt_func(model.parameters(), max_lr, weight_decay=weight_decay)
    # setting up 1cycle lr scheduler
    sched = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr, epochs=epochs, steps_per_epoch=len(train_loader))
    
    for epoch in range(epochs):
        # training
        model.train()
        train_losses = []
        lrs = []
        for batch in train_loader:
            loss = model.training_step(batch)
            train_losses.append(loss)
            loss.backward()
            
            # gradient clipping
            if grad_clip:
                nn.utils.clip_grad_value_(model.parameters(), grad_clip)
                
            optimizer.step()
            optimizer.zero_grad()
            
            # record the lr
            lrs.append(get_lr(optimizer))
            sched.step()
            
        #validation
        result = evaluate(model, val_loader)
        result['train_loss'] = torch.stack(train_losses).mean().item()
        result['lrs'] = lrs
        model.epoch_end(epoch, result)
        history.append(result)
    return history

Let us use a gradient clipping of from -0.1 to 0.1 so that the gradient descent jump falls far too away. After a lot of training, I finally chose the epoch of 30 and the maximum learning rate for the 1Cycle as 0.001. This gives me 70% accuracy at this moment.

Please note that you can find the log of all the model architectures I have used and their corresponding accuracy in the Jupyter notebook file from the link of the completed repo that I will add at the end of this article.

Saving the model

Since training the model is successfully done, we can now go on and save the model so that we can access it without re-training it each time we use the facial expression recognition script.

torch.save(model.state_dict(), 'FER2013-Resnet9.pth')

Getting the scripts ready

First things first, the FERModel.py is ready. It will take you a long time to go through that file. I suggest that you go to the repo link and look at the file yourself. Now, it’s time for the main script!

Since we need to take the ROI (Region of Interest) of the face, transform it to Tensors, feed it to the model and then predict, we need to define some prediction functions to load the saved model and predict.

# function to turn photos to tensor
def img2tensor(x):
    transform = transforms.Compose(
            [transforms.ToPILImage(),
            transforms.Grayscale(num_output_channels=1),
            transforms.ToTensor(),
            transforms.Normalize((0.5), (0.5))])
    return transform(x)# the model for predicting
model = FERModel(1, 7)
softmax = torch.nn.Softmax(dim=1)
model.load_state_dict(torch.load('FER2013-Resnet9.pth', map_location=get_default_device()))def predict(x):
    out = model(img2tensor(img)[None])
    scaled = softmax(out)
    prob = torch.max(scaled).item()
    label = classes[torch.argmax(scaled).item()]
    return {'label': label, 'probability': prob}

Next, let us code the script to connect with our camera and load up the HaarCascade.xml file to detect faces. It will also feed each face it finds to the predict model and get the labels of the prediction.

face_cascade = cv2.CascadeClassifier('haarcascade_frontalface_default.xml')while vid.isOpened():
    _, frame = vid.read()# takes in a gray coloured filter of the frame
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)# initializing the haarcascade face detector
    faces = face_cascade.detectMultiScale(frame)
    for (x,y,w,h) in faces:
        cv2.rectangle(frame, (x,y), (x+w,y+h), (0,255,0), 2)# takes the region of interest of the face only in gray
        roi_gray = gray[y:y+h, x:x+h]
        resized = cv2.resize(roi_gray, (48, 48))    # resizes to 48x48 sized image# predict the mood
        img = img2tensor(resized)
        prediction = predict(img)cv2.putText(frame, f"{prediction['label']}", (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,255,0))cv2.imshow('video', frame)
    if cv2.waitKey(1) == 27:
        breakvid.release()
cv2.destroyAllWindows()

With all done, it’s time for the model to run!

End Note

The code is perfectly done! Setting up the data was a bit tiring. There was some biased data, that is, the feared, sad and angry faces looked pretty much the same. Even in the final model, the computer did find it difficult to understand the difference.

Here is the link to the repository. You can download it and try it out yourself or take reference from it.

This was one of the most interesting Deep Learning projects that I have ever worked on. Thanks to Jovian and zerotogans.com for the inspiration!!