Data Science
Has Gold been a Good Investment Over the Stock Market?
An Exploratory Data Analysis project to compare Gold vs S&P 500 index
Index Of Contents
· Introduction
∘ What is EDA?
∘ What is the S&P 500?
∘ How to run the code
∘ Outline of Project
∘ Import and Install the required libraries
∘ Download real-world dataset
∘ Perform Data Preprocessing and Cleaning
∘ Remove unwanted Columns
∘ Transform the DataType
∘ Handle Missing Values
∘ Duplicate data
∘ Set Proper Index
∘ Create additional Required Columns
∘ Perform Exploratory Analysis and Visualization
∘ Analyze Closing Price
∘ Analyze Trade Volume
∘ Analyze the distribution
∘ Calculate & Analyze Returns
∘ Moving Average
∘ Ask and Answer Interesting Questions
∘ Q1: What was the impact on prices during COVID 19 Outbreak, Does gold crash with the stock market?
∘ Q2: Is there any Correlation between S&P 500 and Gold
∘ Q3: Does Gold Beat S&P 500?
∘ Q4: Calculate all Year end to Date Returns
∘ Q5: Which asset is more Volatile
∘ Inferences
∘ Future Work
∘ References
∘ Conclusion
Introduction
In this article we are going to analyze historical prices of Gold and S&P 500 index and try to find some useful insights using Exploratory Data Analysis (EDA).
Disclaimer: The purpose of this article is only to demonstrate the EDA techniques with comparative analysis of Gold and S&P 500. This is not investment advice.
What is EDA?
Exploratory Data Analysis (EDA) is the process of exploring, investigating and gathering insights from data using statistical measures and visualizations. The objective of EDA is to develop an understanding of data, by uncovering trends, relationships and patterns.
EDA is both a science and an art. On the one hand it requires the knowledge of statistics, visualization techniques and data analysis tools like Numpy, Pandas, Seaborn etc. On the other hand, it requires asking interesting questions to guide the investigation and interpreting numbers & figures to generate useful insights.
What is the S&P 500?
The S&P 500 is a market-capitalization-weighted stock market index that tracks the stock performance of about 500 of some of the largest U.S. public companies.
Investors and economists use the S&P 500 as a benchmark for the overall U.S. stock market and the U.S. economy as a whole. The S&P 500 is a key economic indicator because it largely reflects investors’ collective expectations for the future, unlike other economic data that reflects economic conditions in the present or recent past. When investors or economists discuss the performance of the U.S. “stock market,” they commonly refer to the S&P 500 as shorthand. That’s the reason I am using ‘S&P 500’ to represent the US stock market and we will do a comparative analysis against Gold.
How to run the code
You can execute the code using “Google Colab” or “Run Locally”
The code is available on Github: https://github.com/vinodvidhole/gold_vs_sp500_eda_project
Setup and Tools
Run on Colab : You will need to provide the Google login to run this notebook on Colab.
Run Locally : Download and install Anaconda framework. We will be using Jupyter Notebook for writing & executing code.
Outline of Project
Here is the outline of the project
- Import and Install the required libraries
- Download real-world dataset
- Perform data Preprocessing & cleaning
- Perform Exploratory analysis & Visualization
- Ask and answer Interesting questions about the data
- Summarize your inferences & write a conclusion
Import and Install the required libraries
The first step is to install & Import required Python Libraries.
Download real-world dataset
We are going to use historical data available in Yahoo! Finance. The following 2 urls will point us to the required dataset.
You can use web scraping techniques shown here to download this data. But in this tutorial we are using Python module yahoofinancials to get data into the pandas DataFrame.
Let’s create a helper function to download data into Pandas DataFrame
Let’s begin by downloading the data, we will call the helper function for Gold and S&P 500
The dataset has been downloaded into Pandas DataFrame sp500_df
& gold_df
. Both the dataset has 8 columns and around 5.5 K rows each, so total 11 K rows.
Perform Data Preprocessing and Cleaning
This is one of the most Important stage in Data Science. Real world data could be Incomplete
, Noisy
or Inconsistent
and this could affect negatively in the further steps.
Remove unwanted Columns
Most of the time Datasets may contain columns which may not be useful for the Analysis, in that case we should remove the unused columns. This would save memory usage. This would be handy when we deal with huge datasets of the size of GBs.
We can get rid of date
column from our dataset as we already have formatted_date
Transform the DataType
By default, Pandas uses large data types like int64 and float64 for numerical data and object
for datetime. Depending on the situation we can use smaller data types like int32, float32, int16 to save memory. Also, we should parse the date field to date time data type.
Let’s check if there is any scope of data type modifications.
formatted_date column represents the date but the date type is object
, Lets convert this into the DateTime.
Handle Missing Values
There are several techniques to handle missing data. Choosing the right one is of the most importance. The choice of technique to deal with missing data depends on the problem statement & context of the data.
Let’s check if we have any missing values.
S&P 500 Dataset doesn’t have any missing values.
In the Gold Dataset for 84 items data is missing across 6 columns, Let’s check if there is a sequence or pattern of missing data and columns.
Looks like there are total 84 rows in which data is missing for all columns except date.
Now to handle the missing values there are lots of techniques, but since this is daily stock price data we can use forward fillna
method, meaning we will replace null value with next not null item from the same column.
Duplicate data
If required, duplicate rows can be removed using the pd.drop_duplicates
method. But this should be done carefully. Sometimes duplicate data might be valid data (e.g. Runs Scored in any game).
Our Dataset doesn’t have any duplicates.
Set Proper Index
This may or may not be applicable in all cases, I am setting formatted_date
as Index of the DataFrame.
Create additional Required Columns
This may or may not be applicable in all cases. In our analysis for identification purposes, I am creating a new column asset_class
in both Data Frame. This would be useful when we concatenate both the Data frames.
Computing year
, month
etc. from formatted_date
(Index Column) and adding it as a separate column.
This is how the final data will look like.
Data Preprocessing and Cleaning is done, let’s move ahead with the next topic.
Perform Exploratory Analysis and Visualization
In this phase we will check each column and try to find some useful insides using Visualization. Let’s begin by importing mattplotlib.pyplot
, seaborn
& plotly.express
Analyze Closing Price
Here, first I am combining both ‘S&P 500’ & ‘Gold’ data and then we will use a line chart to study the trend of Historical closing price.
By looking at the Closing Price
trend you can see that the price of Gold
and S&P 500
has increased in this century. Although there are some instances where prices are declining, overall prices have increased as compared to the start of this century. It would be interesting to check the time-line and corresponding events when there is a price fall, we will cover it in the next section.
Analyze Trade Volume
Let’s check the volume
field, what it has to offer. Here I am creating one more data frame volume_df
This will provide us the monthly sum of the volume
for Gold and S&P 500.
You can clearly observe the S&P 500 is more popular than Gold. There is Literally No Comparison in terms of volume. The S&P 500 is the winner in this category.
Volume of S&P 500 is in Billions & Volume of Gold is in Thousands and In the above visualization x & y axis is shared hence we could not visualize the gold trend properly.
Lets try to visualize them Individually.
It seems events like the 2008 Crash & Pandemic tend to increase the trade volume of the Stock Market, but the same is not true for Gold.
Analyze the distribution
Now let’s try to find the min and max values of various columns, Pandas provides a method df.describe()
which provides descriptive statistics that summarize the central tendency, dispersion and shape of a dataset distribution.
But isn’t it nice to Visualize it, let’s see if we can show this in a Plot.
Above Box plot displays Min, Max, Median & Quartiles of Closing price for S&P 500 & Gold. We can easily compare the data by placing box plots side by side.
It’s interesting to notice that the Average Price of Gold and S&P 500 is almost the same, however Prices are fluctuating more in the Stock market vs Gold. The Gold Prices are more stable than the S&P 500 in this century.
Calculate & Analyze Returns
Stock daily returns indicate the gain or loss per day for a given stock. We get it by subtracting the opening price from the closing price. In Pandas there is a built-in function available pct_change()
to calculate daily returns.
I have used adjclose
column to calculate the returns.
Lets plot the graph for the Daily Returns.
Looking at the daily returns chart we can conclude that the Stocks returns are quite volatile and it can move more than 5% on any given day. To get a sense of the distribution of the returns we can plot a histogram.
Returns for both Gold and S&P 500 are normally distributed, so a daily buy-sell strategy may not produce high profits.
We can also see that Gold has a narrow distribution as compared to the S&P 500 has a wider distribution this indicates more volatility.
Moving Average
A “Moving Average” is an indicator which removes the “noise” from a chart by smoothing it. It makes it easier to see a pattern forming over time and helps predict future prices. This is commonly used in technical analysis.
Here I am going to show the Simple Moving Average. Below function will add moving average columns.
Lets calculate 1 month & 1 Year moving average
Plotting Moving Average vs Closing Price.
The moving average helps to level the price data over a specified period by creating a constantly updated average price.
Let’s move onto the next section.
Ask and Answer Interesting Questions
In this section based on the Visualization and data we will try to answer some interesting questions.
Q1: What was the impact on prices during COVID 19 Outbreak, Does gold crash with the stock market?
We will use classic candlestick plot to visualize this event
Looks like S&P 500 crashed when COVID 19 outbreak started but after certain point prices keep on increasing at higher rate
There was no major impact on Gold prices, This shows that the gold did not crash with the stock market.
Q2: Is there any Correlation between S&P 500 and Gold
Lets prepare data in the required form, we will try to find correlation between daily returns of Gold and S&P 500.
If we need to check the correlation between 2 columns we can use the built-in .corr()
function. If there is a perfect correlation then this will return 1.
In the above example to show perfect correlation I used the same column. Now let’s try to find a correlation between S&P 500 Returns
& Gold Returns
Negative value indicate no or very little correlation.
Now let’s try to visualize this.
By looking at the red graph we can say that there is no correlation between S&P 500 & Gold. And we should apply independent investment strategies for both.
Q3: Does Gold Beat S&P 500?
We will use cumulative returns to find this
Lets plot this in a graph
Looks like growth of Gold is way more than S&P 500, and it would not be wrong to say that Gold has beaten S&P 500 so far in this century.
Q4: Calculate all Year end to Date Returns
We will calculate the returns from each year end to the current date.
Wow… this is a very interesting finding. Although overall this century Gold has outperformed the S&P 500, it seems after the first 6 years (2007 onwards) the S&P 500 started to lead the race. In simple terms if someone invests equal amounts in Gold and S&P 500 back in 2000 by 2022 his Gold portfolio would be more than S&P 500, but this would be the opposite if someone invests after 2007.
2020 & 2021 seems to be a bad time for someone to start investing in both assets.
Q5: Which asset is more Volatile
Volatility is the variation in its price over a period of time, we will calculate Daily, Monthly & Annual volatility.
Daily volatility: The standard deviation of the daily returns.
Monthly volatility: Assuming 21 trading days in the month so we multiply the daily volatility by the square root of 21.
Annual volatility: Similarly let’s assume 252 trading days in a calendar year and we multiply the daily volatility by the square root of 252.
Lets try plot Rolling Annual Volatility
Looking at the graph & calculations you can see S&P 500 is more volatile than Gold but at some point Gold was more volatile.
Inferences
From the above exploratory data analysis I have drawn many inferences and here’s a brief summary.
(This summary is prepared based on data captured from 08–31–2000 to 08–12–2022)
- Overall prices of Gold and S&P 500 have increased throughout this century.
- 2008 Crash & Pandemic tend to increase in the trade volume for the Stock Market, but the same is not true for Gold. S&P 500 seems to be more popular than Gold
- Prices are fluctuating more in the Stock market vs Gold. Gold Price is more stable than S&P 500.
- Stock returns are quite volatile and it can move more than 5% on any given day.
- Returns for both Gold and S&P 500 are normally distributed, so a daily buy-sell strategy may not produce high profits.
- S&P 500 is more volatile than Gold.
- Gold does not crash with the stock market.
- There is no correlation between S&P 500 & Gold. And we should apply independent investment strategies for both.
- Gold has Beaten S&P 500 so far in this century.
- Year 2000–2006 Gold has dominated over S&P 500 but from 2007–2019 S&P 500 leads the race.
- The years 2020 & 2021 seem to be a bad time for someone to start investing in Gold and S&P 500.
Future Work
Ideas for future work
- You can perform a similar EDA to compare Gold, Crypto currency or any Individual stock.
- You can try to predict future prices using this EDA and applying different Machine learning models.
References
References to some useful links.
- https://github.com/vinodvidhole/gold_vs_sp500_eda_project
- https://medium.com/@vinodvidhole
- https://money.usnews.com/investing/term/sp500
- https://finance.yahoo.com/
- https://pypi.org/project/yahoofinancials/
- https://www.quantconnect.com/tutorials/introduction-to-financial-python/rate-of-return,-mean-and-variance
Conclusion
Hope I was able to teach you EDA methods, and now you can use this knowledge in your future projects.
If you have any questions or feedback, feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!