Data Science

Has Gold been a Good Investment Over the Stock Market?

An Exploratory Data Analysis project to compare Gold vs S&P 500 index

Vinod Dhole

--

Image source : https://www.pexels.com/photo/coins-on-top-of-a-laptop-6770774/
Index Of Contents
· Introduction
What is EDA?
What is the S&P 500?
How to run the code
Outline of Project
Import and Install the required libraries
Download real-world dataset
Perform Data Preprocessing and Cleaning
Remove unwanted Columns
Transform the DataType
Handle Missing Values
Duplicate data
Set Proper Index
Create additional Required Columns
Perform Exploratory Analysis and Visualization
Analyze Closing Price
Analyze Trade Volume
Analyze the distribution
Calculate & Analyze Returns
Moving Average
Ask and Answer Interesting Questions
Q1: What was the impact on prices during COVID 19 Outbreak, Does gold crash with the stock market?
Q2: Is there any Correlation between S&P 500 and Gold
Q3: Does Gold Beat S&P 500?
Q4: Calculate all Year end to Date Returns
Q5: Which asset is more Volatile
Inferences
Future Work
References
Conclusion

Introduction

In this article we are going to analyze historical prices of Gold and S&P 500 index and try to find some useful insights using Exploratory Data Analysis (EDA).

Disclaimer: The purpose of this article is only to demonstrate the EDA techniques with comparative analysis of Gold and S&P 500. This is not investment advice.

What is EDA?

Exploratory Data Analysis (EDA) is the process of exploring, investigating and gathering insights from data using statistical measures and visualizations. The objective of EDA is to develop an understanding of data, by uncovering trends, relationships and patterns.

EDA is both a science and an art. On the one hand it requires the knowledge of statistics, visualization techniques and data analysis tools like Numpy, Pandas, Seaborn etc. On the other hand, it requires asking interesting questions to guide the investigation and interpreting numbers & figures to generate useful insights.

What is the S&P 500?

The S&P 500 is a market-capitalization-weighted stock market index that tracks the stock performance of about 500 of some of the largest U.S. public companies.

Investors and economists use the S&P 500 as a benchmark for the overall U.S. stock market and the U.S. economy as a whole. The S&P 500 is a key economic indicator because it largely reflects investors’ collective expectations for the future, unlike other economic data that reflects economic conditions in the present or recent past. When investors or economists discuss the performance of the U.S. “stock market,” they commonly refer to the S&P 500 as shorthand. That’s the reason I am using ‘S&P 500’ to represent the US stock market and we will do a comparative analysis against Gold.

How to run the code

You can execute the code using “Google Colab” or “Run Locally”

The code is available on Github: https://github.com/vinodvidhole/gold_vs_sp500_eda_project

Setup and Tools

Run on Colab : You will need to provide the Google login to run this notebook on Colab.
Run Locally : Download and install Anaconda framework. We will be using Jupyter Notebook for writing & executing code.

Outline of Project

Here is the outline of the project

  • Import and Install the required libraries
  • Download real-world dataset
  • Perform data Preprocessing & cleaning
  • Perform Exploratory analysis & Visualization
  • Ask and answer Interesting questions about the data
  • Summarize your inferences & write a conclusion

Import and Install the required libraries

The first step is to install & Import required Python Libraries.

Download real-world dataset

We are going to use historical data available in Yahoo! Finance. The following 2 urls will point us to the required dataset.

GOLD : https://finance.yahoo.com/quote/GC%3DF/history?period1=967593600&period2=1660003200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

S&P 500 : https://finance.yahoo.com/quote/%5EGSPC/history?period1=-1325635200&period2=1660003200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

You can use web scraping techniques shown here to download this data. But in this tutorial we are using Python module yahoofinancials to get data into the pandas DataFrame.

Let’s create a helper function to download data into Pandas DataFrame

Let’s begin by downloading the data, we will call the helper function for Gold and S&P 500

The dataset has been downloaded into Pandas DataFrame sp500_df & gold_df. Both the dataset has 8 columns and around 5.5 K rows each, so total 11 K rows.

Perform Data Preprocessing and Cleaning

This is one of the most Important stage in Data Science. Real world data could be Incomplete, Noisy or Inconsistent and this could affect negatively in the further steps.

Remove unwanted Columns

Most of the time Datasets may contain columns which may not be useful for the Analysis, in that case we should remove the unused columns. This would save memory usage. This would be handy when we deal with huge datasets of the size of GBs.

We can get rid of date column from our dataset as we already have formatted_date

Transform the DataType

By default, Pandas uses large data types like int64 and float64 for numerical data and object for datetime. Depending on the situation we can use smaller data types like int32, float32, int16 to save memory. Also, we should parse the date field to date time data type.

Let’s check if there is any scope of data type modifications.

formatted_date column represents the date but the date type is object, Lets convert this into the DateTime.

Handle Missing Values

There are several techniques to handle missing data. Choosing the right one is of the most importance. The choice of technique to deal with missing data depends on the problem statement & context of the data.

Let’s check if we have any missing values.

S&P 500 Dataset doesn’t have any missing values.

In the Gold Dataset for 84 items data is missing across 6 columns, Let’s check if there is a sequence or pattern of missing data and columns.

Looks like there are total 84 rows in which data is missing for all columns except date.

Now to handle the missing values there are lots of techniques, but since this is daily stock price data we can use forward fillna method, meaning we will replace null value with next not null item from the same column.

Duplicate data

If required, duplicate rows can be removed using the pd.drop_duplicates method. But this should be done carefully. Sometimes duplicate data might be valid data (e.g. Runs Scored in any game).

Our Dataset doesn’t have any duplicates.

Set Proper Index

This may or may not be applicable in all cases, I am setting formatted_date as Index of the DataFrame.

Create additional Required Columns

This may or may not be applicable in all cases. In our analysis for identification purposes, I am creating a new column asset_class in both Data Frame. This would be useful when we concatenate both the Data frames.

Computing year , month etc. from formatted_date (Index Column) and adding it as a separate column.

This is how the final data will look like.

Data Preprocessing and Cleaning is done, let’s move ahead with the next topic.

Perform Exploratory Analysis and Visualization

In this phase we will check each column and try to find some useful insides using Visualization. Let’s begin by importing mattplotlib.pyplot, seaborn & plotly.express

Analyze Closing Price

Here, first I am combining both ‘S&P 500’ & ‘Gold’ data and then we will use a line chart to study the trend of Historical closing price.

By looking at the Closing Price trend you can see that the price of Gold and S&P 500 has increased in this century. Although there are some instances where prices are declining, overall prices have increased as compared to the start of this century. It would be interesting to check the time-line and corresponding events when there is a price fall, we will cover it in the next section.

Analyze Trade Volume

Let’s check the volume field, what it has to offer. Here I am creating one more data frame volume_df This will provide us the monthly sum of the volume for Gold and S&P 500.

You can clearly observe the S&P 500 is more popular than Gold. There is Literally No Comparison in terms of volume. The S&P 500 is the winner in this category.

Volume of S&P 500 is in Billions & Volume of Gold is in Thousands and In the above visualization x & y axis is shared hence we could not visualize the gold trend properly.

Lets try to visualize them Individually.

It seems events like the 2008 Crash & Pandemic tend to increase the trade volume of the Stock Market, but the same is not true for Gold.

Analyze the distribution

Now let’s try to find the min and max values of various columns, Pandas provides a method df.describe() which provides descriptive statistics that summarize the central tendency, dispersion and shape of a dataset distribution.

But isn’t it nice to Visualize it, let’s see if we can show this in a Plot.

Above Box plot displays Min, Max, Median & Quartiles of Closing price for S&P 500 & Gold. We can easily compare the data by placing box plots side by side.

It’s interesting to notice that the Average Price of Gold and S&P 500 is almost the same, however Prices are fluctuating more in the Stock market vs Gold. The Gold Prices are more stable than the S&P 500 in this century.

Calculate & Analyze Returns

Stock daily returns indicate the gain or loss per day for a given stock. We get it by subtracting the opening price from the closing price. In Pandas there is a built-in function available pct_change() to calculate daily returns.

I have used adjclose column to calculate the returns.

Lets plot the graph for the Daily Returns.

Looking at the daily returns chart we can conclude that the Stocks returns are quite volatile and it can move more than 5% on any given day. To get a sense of the distribution of the returns we can plot a histogram.

Returns for both Gold and S&P 500 are normally distributed, so a daily buy-sell strategy may not produce high profits.

We can also see that Gold has a narrow distribution as compared to the S&P 500 has a wider distribution this indicates more volatility.

Moving Average

A “Moving Average” is an indicator which removes the “noise” from a chart by smoothing it. It makes it easier to see a pattern forming over time and helps predict future prices. This is commonly used in technical analysis.

Here I am going to show the Simple Moving Average. Below function will add moving average columns.

Lets calculate 1 month & 1 Year moving average

Plotting Moving Average vs Closing Price.

The moving average helps to level the price data over a specified period by creating a constantly updated average price.

Let’s move onto the next section.

Ask and Answer Interesting Questions

In this section based on the Visualization and data we will try to answer some interesting questions.

Q1: What was the impact on prices during COVID 19 Outbreak, Does gold crash with the stock market?

We will use classic candlestick plot to visualize this event

Looks like S&P 500 crashed when COVID 19 outbreak started but after certain point prices keep on increasing at higher rate

There was no major impact on Gold prices, This shows that the gold did not crash with the stock market.

Q2: Is there any Correlation between S&P 500 and Gold

Lets prepare data in the required form, we will try to find correlation between daily returns of Gold and S&P 500.

If we need to check the correlation between 2 columns we can use the built-in .corr() function. If there is a perfect correlation then this will return 1.

In the above example to show perfect correlation I used the same column. Now let’s try to find a correlation between S&P 500 Returns & Gold Returns

Negative value indicate no or very little correlation.

Now let’s try to visualize this.

By looking at the red graph we can say that there is no correlation between S&P 500 & Gold. And we should apply independent investment strategies for both.

Q3: Does Gold Beat S&P 500?

We will use cumulative returns to find this

Lets plot this in a graph

Looks like growth of Gold is way more than S&P 500, and it would not be wrong to say that Gold has beaten S&P 500 so far in this century.

Q4: Calculate all Year end to Date Returns

We will calculate the returns from each year end to the current date.

Wow… this is a very interesting finding. Although overall this century Gold has outperformed the S&P 500, it seems after the first 6 years (2007 onwards) the S&P 500 started to lead the race. In simple terms if someone invests equal amounts in Gold and S&P 500 back in 2000 by 2022 his Gold portfolio would be more than S&P 500, but this would be the opposite if someone invests after 2007.

2020 & 2021 seems to be a bad time for someone to start investing in both assets.

Q5: Which asset is more Volatile

Volatility is the variation in its price over a period of time, we will calculate Daily, Monthly & Annual volatility.

Daily volatility: The standard deviation of the daily returns.

Monthly volatility: Assuming 21 trading days in the month so we multiply the daily volatility by the square root of 21.

Annual volatility: Similarly let’s assume 252 trading days in a calendar year and we multiply the daily volatility by the square root of 252.

Lets try plot Rolling Annual Volatility

Looking at the graph & calculations you can see S&P 500 is more volatile than Gold but at some point Gold was more volatile.

Inferences

From the above exploratory data analysis I have drawn many inferences and here’s a brief summary.

(This summary is prepared based on data captured from 08–31–2000 to 08–12–2022)

  • Overall prices of Gold and S&P 500 have increased throughout this century.
  • 2008 Crash & Pandemic tend to increase in the trade volume for the Stock Market, but the same is not true for Gold. S&P 500 seems to be more popular than Gold
  • Prices are fluctuating more in the Stock market vs Gold. Gold Price is more stable than S&P 500.
  • Stock returns are quite volatile and it can move more than 5% on any given day.
  • Returns for both Gold and S&P 500 are normally distributed, so a daily buy-sell strategy may not produce high profits.
  • S&P 500 is more volatile than Gold.
  • Gold does not crash with the stock market.
  • There is no correlation between S&P 500 & Gold. And we should apply independent investment strategies for both.
  • Gold has Beaten S&P 500 so far in this century.
  • Year 2000–2006 Gold has dominated over S&P 500 but from 2007–2019 S&P 500 leads the race.
  • The years 2020 & 2021 seem to be a bad time for someone to start investing in Gold and S&P 500.

Future Work

Ideas for future work

  • You can perform a similar EDA to compare Gold, Crypto currency or any Individual stock.
  • You can try to predict future prices using this EDA and applying different Machine learning models.

Conclusion

Hope I was able to teach you EDA methods, and now you can use this knowledge in your future projects.

If you have any questions or feedback, feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!

--

--

Vinod Dhole

Sr. Technical Architect | Machine Learning | Python