DATA SCIENCE

Web Scraping Yahoo! Finance using Python

A detailed guide for web scraping https://finance.yahoo.com using Requests, BeautifulSoup, Selenium, HTML tags, and embedded JSON data.

Vinod Dhole

--

Table Of Contents
· Introduction
What is “web scraping”?
Objective
The problem statement
Prerequisites
How to run the Code
Setup and Tools
· 1. Web Scraping Stock Market News
1.1 Download & Parse webpage using Requests and BeautifulSoup
1.2 Exploring and locating Elements
1.3 Extract & Compile the information into a Python list
1.4 Save the extracted information to a CSV file
· 2. Web Scraping Cryptocurrencies
2.1 Introduction of selenium
2.2 Download & Set-up
2.3 Install & Import libraries
2.4 Create Web Driver
2.5 Exploring and locating Elements
2.6 Extract & Compile the information into a Python list
2.7 Save the extracted information to a CSV file
· 3. Web Scraping Market Events Calendar
3.1 Install & Import libraries
3.2 Download & Parse web page
3.3 Get Embedded Json data
3.4 Locating Json Keys
3.5 Pagination & Compiling the information into a Python list
3.6 Save the extracted information to a CSV file
· References
· Future Work
· Conclusion

Introduction

What is “web scraping”?

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It’s a useful technique for creating datasets for research and learning.

Objective

The main objective of this tutorial is to showcase different web scraping methods that can be applied to any web page. This is for educational purposes only. Please read the terms and conditions carefully for any website to see whether you can legally use the data.

In this project, we will perform web scraping using the following 3 techniques:

  • Use Requests, BeautifulSoup, and HTML tags to extract web page content.
  • Use Selenium to scrape data from dynamically loading websites.
  • Use embedded JSON data to scrape the website.

The problem statement

  1. Web Scraping Stock Market News (URL: https://finance.yahoo.com/topic/stock-market-news/)
    This web page shows the latest news related to the stock market. We will try to extract data from this web page and store it in a CSV (Comma-separated values) file. The file layout would be as mentioned below.

2. Web Scraping Cryptocurrencies (URL: https://finance.yahoo.com/cryptocurrencies)
This Yahoo! Finance web page shows a list of trending cryptocurrencies in table format. We will perform web scraping to retrieve the first 10 columns for the top 100 cryptocurrencies in the following CSV format.

3. Web Scraping Market Events Calendar (URL: https://finance.yahoo.com/calendar)
This page shows date-wise market events. Users can select the date and choose any one of the following market events Earnings, Stock Splits, Economic Events & IPO. Our aim is to create a script that can be run for any single date and market event which grabs the data and loads it in CSV format as shown below.

Prerequisites

  • Basic knowledge of Python
  • Basic knowledge of HTML, although it is not necessary

How to run the Code

You can execute the code using “Google Colab” or “Run Locally”

The code is available on Github: https://github.com/vinodvidhole/yahoo-finance-scraper

Setup and Tools

  • Run on Google Colab: You will need to provide the Google login.
  • Run on Local Machine: Download and install the Anaconda framework, and We will be using Jupyter Notebook for writing & executing code.

1. Web Scraping Stock Market News

In this section, we will learn a basic Python web scraping technique using Requests, BeautifulSoup, and HTML tags. The objective here is to perform web scraping of Yahoo! Finance Stock Market News.

Let’s kick-start with the first objective. Here’s an outline of the steps we’ll follow.

1.1 Download & Parse web page using Requests & BeautifulSoup
1.2 Exploring and locating Elements
1.3 Extract & Compile the information into a Python list
1.4 Save the extracted information to a CSV file

1.1 Download & Parse webpage using Requests and BeautifulSoup

The first step is to install requests & beautifulsoup4 Libraries using pip.

To download the web page we can use the requests.get() function which returns a response object. This object contains the data from the web page and some other information.

The response.ok & response.status_code can be used for error trapping & tracking.

We can get the contents of the page using response.text

Finally, we can use BeautifulSoup to parse the HTML data. This will return a bs4.BeautifulSoup object. This will enable us to get hold of the required data with the help of different methods offered by BeautifulSoup. We are going to learn some of these methods in the next subsection.

Let’s put all this together into a function.

Calling function get_page and analyzing the output.

You can access different properties, data, images of HTML web page using methods like .find(), .find_all() etc. The following example will display the Title of the web page.

We can use the function get_page to download any web page and parse it using beautiful soup.

1.2 Exploring and locating Elements

Now it’s time to explore the elements linked to the required data points from the web page. Web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of tags (also called nodes or elements) e.g. <a href="https://finance.yahoo.com/" target="_blank">Go to Yahoo! Finance</a>. The HTML tag has three parts:

  1. Name: (html, head, body, div, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
  2. Attributes: (href, target, class, id, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
  3. Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g., <div>Some content</div>.

Let’s inspect the source code of the web page by right-clicking → selecting the “Inspect” option. First, we need to identify the tag which represents the news listing.

In this case we can see the <div> tag having class name "Ov(h) Pend(44px) Pstart(25px)" is representing news listing. We can use the find_all function to grab this information.

Total elements in the <div>tag list match with the number of news items displayed on the webpage, so we are heading in the right direction.

The next step is to inspect the single <div>tag and try to research more information. I am using "Visual Studio Code", but you can use any tool as simple as a notepad.

I copied the above output and pasted it into the “Visual Studio Code" and then identified tags & properties representing News Source, Headline, News content, etc.

Luckily, most of the required data points are available in one single <div>tag, so now we can use the find method to grab each item.

If any tag is not accessible directly, then you can use methods like findParent() or findChild() to point to the required tag.

The key takeout from this exercise is to identify the optimal tag/element which will provide us the required information. This is mostly straightforward, but sometimes you will have to perform a little more research.

1.3 Extract & Compile the information into a Python list

We’ve identified all the required tags and information. Let’s put this together in the helper function.

We will create one more function, to parse individual <div> tags and return the information in dictionary form.

1.4 Save the extracted information to a CSV file

This is the last step. We are going to use Python library pandas to save the data in CSV format. Let’s install and then import the pandas library.

Let’s create a wrapper function, The first step is to use the get_page function to download the HTML page, then we can pass the output in get_news_tags to identify a list of <div> tags for news.

After that we will use a List Comprehension technique to parse each <div> tag using parse_news, the output will be in the form of lists of dictionaries.

Finally, we will use the .DataFrame() method to create pandas dataframe and use the to_csv function to store required data in CSV format.

Scraping the news using scrape_yahoo_news function

The “stock-market-news.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_news function.

Summary: Hopefully I was able to explain this simple but very powerful Python technique to scrape the Yahoo! Finance market news. These steps can be used to scrape any web page. You just have to do a little research to identify the required <tags>/elements and use relevant python methods to collect the data.

2. Web Scraping Cryptocurrencies

In phase one we were able to scrape the Yahoo market news web page. However, if you’ve noticed as we scroll down the web page, more news will appear at the bottom of the page. This is called dynamic page loading. The previous technique is a basic Python method useful for scraping static data. To scrape the dynamically loaded data will use a different method called web scraping using Selenium. Let’s move ahead with this topic. The goal of this section is to extract top listing Crypto currencies from Yahoo! Finance.

Here’s an outline of the steps we’ll follow.

2.1 Introduction of selenium
2.2 Download & Set-up
2.3 Install & Import libraries
2.4 Create Web Driver
2.5 Exploring and locating Elements
2.6 Extract & Compile the information into a Python list
2.7 Save the extracted information to a CSV file

2.1 Introduction of selenium

Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping. Here we will use the Chrome browser, but you can try it on any browser.

You can find proper documentation on selenium here

Why should you use Selenium?

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot
  • Refreshing the page

The following methods will be helpful to find elements in a web page (these methods will return a list, if you are looking for only a single element then just remove ‘s’ from the following methods e.g. find_element_by_<…>)

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

In this tutorial we will use only find_elements_by_xpath and find_elements_by_tag_name. You can find complete documentation of these methods here

2.2 Download & Set-up

In this section we’ll have to do some prep work to implement this method. We will need to install Selenium & proper web browser driver.

Google Colab:

If you are using a Google Colab platform, then execute the following code to perform the Initial installation. This piece of code 'google.colab' in str(get_ipython()) is used to identify the Google Colab platform.

Local Machine:

To run it locally, you will need Webdriver for Chrome on your machine. You can download it from this link https://chromedriver.chromium.org/downloads and just copy the driver in the folder where we will execute the python file (No need for installation). But make sure that the driver version matches the Chrome browser version installed on the local machine.

2.3 Install & Import libraries

Installation of the required libraries.

Once the Libraries installation is done, the next step is to import all the required modules/libraries. Please note that for the Local platform we need to import additional modules.

So all the necessary prep work is done. Let’s move ahead to implement this method.

2.4 Create Web Driver

In this step first we will create the instance of Chrome WebDriver using webdriver.Chrome() method. After that, the driver.get() method will initiate a page mentioned in the URL. In this case also, there is slight variation based on platform.

We have used some options parameters for e.g. --headless option will load the driver in the background. Check this link for more details.

Test run of get_driver

2.5 Exploring and locating Elements

This is almost a similar step that we have done in Phase One. We will try to identify relevant information like <tags>, class , XPath etc. from the web page.

Get Table Headers (Column names):

Right-click and select the "Inspect" to do further analysis. As the web page shows cryptocurrency information in the Table form. We can grab the table header by using tag <th>. Let’s use find_elements by TAG to get the table headers. These will represent the columns in the CSV file.

Creating a helper function to get the first 10 columns from the header, we have used List comprehension with conditions. You can also check out usage of the enumerate function.

Get Table Row count:

Next we will find out the number of rows available in a Page, you can see table rows are placed in <tr> tag. Here we will use XPath to find <tr> tag. We can capture the XPath by selecting <tr> tag then Right Click → Copy → Copy XPath.

So we get the XPath value as //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1], Let's use this with find_element() & By.XPATH.

Above XPath points to the first row, we can get rid of the row number part from XPath and use it with find_elements to get hold of all the available rows. Let's implement this with a function. Checkout the XPath variations and the output in both the examples.

Get Table Column data:

Similarly, we can capture the XPath for any column value.

This is the XPath of a column //*[@id="scr-res-table"]/div[1]/table/tbody/tr[1]/td[2]

Note that the number after tr & td represents the row_number and column_number. Let’s verify this with the find_element() method.

We can change the row_number & column_number in XPath and loop it through row count and column count to get required column values. Let's generalize this and put it in a function. We will get the data for one row at a time and return column values in the form of a dictionary.

Pagination:

The Yahoo! Finance web page shows only 25 Cryptocurrencies per page and users will have to click Next Button to load the other sets of crypto currencies. This is called pagination.

This is the main reason we are using Selenium to tackle scenarios like pagination. With the help of Selenium you can perform multiple actions/events on a web page like clicking, scrolling, refreshing etc. The possibilities are endless, which makes this tool very powerful in web scraping.

Now we will grab the XPath of Next button.

Then get the element for Next Button using find_element method, and after that we can perform click action using the .click() method.

Now I am trying to check the first row on the web page to verify if .click() really worked, and you will see the first row has changed. Click action was successful.

In this section we learned how to get required data points, and how to perform events / actions on the web page.

https://jovian.ml/vinodvidhole/yahoo-finance-web-scraper/v/505&cellId=92

2.6 Extract & Compile the information into a Python list

Let’s put all the pieces of the puzzle. We will pass the total number of rows to be scraped (in this case 100 rows) as an integer argument (total_crypto) . After that, parse each row of the page and append the data in the List till the total parsed row count matches to the total_crypto. In addition, we will perform the Next Button click if we reach the last row of the web page.

Note: Here to identify the Next button element, we have used the WebDriverWait class instead of using find_element() method. In this technique we can pass some wait-time before grabbing the element. This type of implementation is done to avoid the StaleElementReferenceException.

Code Sample:

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="scr-res-table"]/div[2]/button[3]')))

2.7 Save the extracted information to a CSV file

This is the last step of this section. We are creating a placeholder function which calls all previously created helper functions and then we will save the data in CSV format using the pd.to_csv method.

Time to scrape some cryptos!!! , we will scrape the top 100 cryptos in Yahoo! Finance web page by calling scrape_yahoo_crypto.

The “crypto-currencies.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.

You can also check the data by grabbing a few rows from the data frame returned by the scrape_yahoo_crypto function.

Summary: Hope you’ve enjoyed this tutorial. Selenium enables us to perform multiple actions on the web browser, which is really very handy for scraping different types of data from any webpage.

3. Web Scraping Market Events Calendar

This is the final segment of the tutorial. In this section, we will learn how to extract embedded JSON formatted data from the HTML web page which can be easily converted to a Python dictionary. Problem statement for this section is to scrape date-wise market events from Yahoo! Finance.

Here’s an outline of the steps we’ll follow.

3.1 Install & Import libraries
3.2 Download & Parse web page
3.3 Get Embedded Json data
3.4 Locating Json Keys
3.5 Pagination & Compiling the information in a Python list
3.6 Save the extracted information to a CSV file

3.1 Install & Import libraries

The first step to install and import Python Libraries

3.2 Download & Parse web page

This is exactly the same step that we’ve performed to download the webpage in section 1.1. Here we have used custom headers in requests.get()

Most of the things are explained in section 1.1. Let’s create the helper function.

3.3 Get Embedded Json data

In this step we will locate the Json formatted data which stores all information displayed on the webpage.

Open the web page and do Right Click → View Page Source, If you scroll down to the source page you will find the Json formatted data. Apparently this information is available in the <script> tag. Fortunately there is a very convenient way to grab this tag by locating the following text in the webpage source code /* -- Data -- */.

We will use Regular expressions to get the text inside the <script> tag.

Next step is to grab the Json formatted string from the <script> tag. I am printing first and last 150 characters from the <script> tag.

On further analysis we can see that the formatted string has the starting key as context and it ends at 12 characters from the end. So we can grab the Json string using Python slicing.

Lastly, we will use the json.loads()method to convert Json string into Python Dictionary. Now creating a function using this information.

3.4 Locating Json Keys

Basically, the Json text is a multi-level nested dictionary, and some keys are used to store all the metadata displayed on the webpage. In this section we will identify the keys for the data we are trying to scrape.

We’ll need a Json Formatter tool to navigate through multiple keys. I am using an online tool https://jsonblob.com/. However, you can choose any tool.

To simplify the analysis, we will write the Json text into my_json_file.json file. After that, copy the file content and paste it to the left panel of the JSON Blob and it will do a neat formatting. We can easily navigate through each Keys and search for any item.

Next step is to find the Required Key location. Let’s search the company name 3D Systems Corporation from the webpage in Json text using the JSON Blob formatter.

You can see the table data is stored in the rows key, and we can track down the parent keys as shown in the above screen, check out the content of the row key.

This sub-dictionary shows all the data displayed on the current page. You can do more research and exploration to get more useful information from the web page, a few examples shown below.

Putting this in helper functions.

3.5 Pagination & Compiling the information into a Python list

As we saw in the previous section on how to handle pagination using the Selenium methods, here we'll learn a new technique for accessing multiple pages (pagination).

Most of the times webpage URL gets changed at runtime depending on the user actions. For example, in the below screenshot, I selected the Earnings for 1-March-2022. You can notice how that information is passed in the URL.

Similarly, when I click the next Button, offset and size values get changed in the url.

So we can figure out the pattern & structure of the url and how it affects page navigation.

In this case the web page URL pattern is mentioned below:

  • The following values are used for calendar event types event_types = ['splits','economic','ipo','earnings']
  • Date passed in yyyy-mm-dd format
  • Page number is controlled by offset value (for first page offset=0)
  • Maximum number of rows in a page is assigned to size

Based on the above information, we can build the URL at runtime, download the page, and then extract the information. This is how we handle pagination.

Let’s create a function in which we will pass event_type and date, then we will calculate the total rows for matching criteria using the get_total_rows function. Maximum rows per page are constant (i.e., 100), so we can build iterating summation logic to calculate the total number of pages involved for the current criteria and extract each page data in the loop.

3.6 Save the extracted information to a CSV file

In this last section, we will save the data to CSV format using pd.DataFrame() & to_csv(), and call everything in a single placeholder function.

Executing the final function scrape_yahoo_calendar

Total 4 CSV files “event_type_yyyy-mm-dd.csv” should be available in the File → Open Menu. You can download the file or directly open it in a browser. Please verify the file content and compare it with the actual information available on the webpage.

Summary: This is a very useful technique which can be easily replicable. Without writing any customized code, we were able to extract data from multiple types of web pages just by changing one variable (in this case event_type).

References

References to some useful links.

Future Work

Ideas for future work

  • Automate this process using AWS Lambda to download daily market calendar, crypto-currencies & market news in CSV format.
  • Move the old files to an Archive folder, append date-stamp to the file if required, and also delete the Archived files older than two weeks.
  • Process the raw data extracted from the third technique using pandas.

Conclusion

In this tutorial, we have implemented the following web scraping techniques.

  • We have used Requests, BeautifulSoup, and HTML tags to extract data from a web page.
  • We used Selenium to perform clicks on dynamically loading websites and captured the information.
  • We extracted the existing embedded JSON data to scrape a website.

I hope I was able to teach you these web-scraping methods, and now you can use this knowledge to scrape any website.

If you have any questions or feedback, please feel free to post a comment or contact me on LinkedIn. Thank you for reading and if you liked this post, please consider following me. Until next time… Happy coding !!

Don’t forget to give your 👏 !

--

--