Covid-19 Pandemic has really taught us to see the perspective of life that is often hidden under the sheets of stress and work related anxiety amongst employees. While organizations are assembling their resources to wave off this issue in their culture, a multi-dimensional barrier needs more than just psychological approach and therapies to eradicate the root cause.

Image Source: www.lattice.com

Ever evolving culture in the finest and biggest of the organizations put forward an ample amount of opportunities in terms of empowering technologies, life-changing innovations, better pay, and many more. With a vision to change the world, the finest and brightest of the minds from across the globe pool in their resources and bring forward products that have huge potential.

With this pace of revolutionization, every organization scrambles their best of the resources to hunt the most optimal talents that could actually shape the vision of a company to reality. However, oftentimes this brightly enthusiastic picture covers a darker side of performance-based anxiety, work-related stress, the balance between work and personal life, family issues, and many more.

With this context in mind, we aim to provide a multi-dimensional route capable enough to overcome this static issue. A comprehensive, Machine Learning based approach to predict stress amongst the employees and take pre-emptive measures to eradicate the issue before becoming a stagnant one!

Before we begin, some key points to keep remember:

  1. Code and Data for the study can be found at this GitHub Repository.
  2. 5 different algorithms have been used in the study(Logistic Regression, Decision Tree, Support Vector Machines, k- nearest neighbors, Random forest).
  3. Out of these: Logistic Regression, Support Vector Machines and Random Forest performs the best and therefore have been optimized further using ‘Grid Search’ to fine tune hyperparameters.
  4. The flow of this study can be applied in any scenario even though the data is different from the one used in this study.

Import some useful libraries, tools, and metrics that will be used during the study.

You can also set a standardized display option for visualizing data in graphs as we have done in this study as well.

Once we are done with the foundational structure, it’s time to load the dataset and get a brief insight into the same. The attributes of the data, statistical information regarding the nature of the data present, and the trend/pattern (if any) in the same.

A brief overview of the attributes of the dataset and the type of data present.
A brief statistical description of the dataset.

Before proceeding further, it is the best practice to analyze the amount of data missing in the dataset, whether the data has null values or not, and if there are any data that doesn’t support the data type of the attribute it is filled in.

For this, we use the Heatmap. Although a more practical scenario would tend to lean towards imputing the missing values in the dataset, that might get out of the scope of this discussion as of now. So for the time being, we choose to skip the missing values and proceed with the remaining data.

The yellow bars in the heatmap show the values that are missing (and for the time being will be dropped) in the respective columns of the dataframe.

Visualizing the data for a better insight:

Graphic tools provide a general but foundational insight into the data and define the basic type and dimensions that the problem deals with. It is quite helpful to get a quick visual grasp of the topic to plan the structure and flow of the study.

  1. Firstly, let’s take a quick look into the ratio of the employees under study that is stressed.
Graph depicting the number of employees in stress.

2. Secondly, let us look at some relation between the Level of Job Satisfaction amongst the employees that are stressed or not and the Education Field they are associated with.

Although the level of Job Satisfaction does seem to have a relation with the stress amongst employees, Educational Field doesn’t seem to play an important role.

Similarly, we can go on finding visual relations amongst the various attributes present in the data, however, that would be quite time consuming (due to which we go on selecting features later on…) and thus left to your curiosity and the nature of data you might be using.

Next, let’s define some important utility functions in order to automate the task of processing our data for the study. We have some attributes in the data that contain categorical data. For that, we choose to encode them with labels (using Label Encoder) and prepare the data for the model.

Note: You can use other encoding techniques as well (apart from the ones we have used here) however, some machine learning algorithms might function differently and may give weightage to these encoded labels considering them as numeric data points and therefore altering the performance of the model. We have kept in mind that this doesn’t happen here, but variations may invite differences!

Now, as a best practice measure, we should try to find a correlation between the Dependent and Independent variables of the data. Although a proper technique involves more than just a visual representation, that would be out of the scope of this study.

Heatmap to find Correlations visually amongst the Dependent and Independent variables

Let us now split our data into the training and validation sets, for training the models.

  1. First algorithm used in the study is Logistic Regression. Parameters passed into the same and results achieved are displayed below:
Logistic Regression
Jaccard Similarity Score
Logarithmic Loss

2. Second algorithm used in the study is Decision Tree. Parameters passed into the same and results achieved are displayed below:

Decision Tree
Accuracy Score

3. Third algorithm used in the study is Support Vector Machines. Parameters passed into the same and results achieved are displayed below:

Support Vector Machines
Classification Report and Accuracy Score

4. Fourth algorithm used in the study is K-Nearest Neighbors. Parameters passed into the same and results achieved are displayed below:

K-Nearest Neighbors
Error rate vs k-value
Accuracy

5. Fifth algorithm used in the study is Random Forest Classifier. Parameters passed into the same and results achieved are displayed below:

Random Forest Classifier
Accuracy

Once we have our models, we can tune them to optimize the training and improve their accuracy using Grid Search. This way, we would have our best models upon which we can experiment further in this study (instead of going for all of them) and find the applicable result from the study.

Optimized parameters from the Grid Searches are as follows:

Logistic Regression
Decision Tree
Support Vector Machines
k-Nearest Neighbors
Random Forest

From the above results, we can conclude that Logistic Regression, Support Vector Machines, and Random Forest algorithms have outperformed all the other models of the study.

Hence, we now proceed towards our Feature Selection (to find the attributes of the data that contribute the most towards the decision making and accuracy of the model) in order to find the relevant features of the study.

We have used Recursive Feature Selection for our case which is a type of wrapper feature selection method.

The selected features have the value True in the list of boolean values returned by the method and have their respective rankings as well (discarded features have their rankings as 1 in the list). We have used 30 best features in this study, but this number might vary.

After this, we again prepare our data using the utility functions that we have defined above and split the same into training and validation sets.

Thereafter, we apply our best performing algorithms on the selected features using the parameters that have been optimized previously using grid search.

The final results of the training in study are shown below:

Logistic Regression
Random Forest
Support Vector Machines

Here, we can conclude that Support Vector Machine outperformed all the other algorithms of this study (even though Logistic Regression was performing better than all before feature selection)!

We can continue testing this model’s efficiency on our test dataset, but let’s leave that for you to explore.

Key takeaways from this study:

  1. The flow of this study can be imitated on any type of dataset to build actually applicable models able to solve this issue within an organization.
  2. More classification algorithms (like XGBoost, etc.) can be applied in the study as well. The only limitation being the categorical values’ encoding that has to be considered in context to the algorithm being used.
  3. With ongoing adaptation to unconventional and different working styles and patterns in organizations, the need for robust solutions for employees is becoming inevitable with each passing day. Organizations need to have a foundational base from this study from where they can scale up the applications of this project.

--

--