Web Scraping

Q1. What is Web Scraping?
In the most simple terms, Web Scraping is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on.

When we say ‘Easy to work on’, we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in a CSV File, an Excel File or a Database.

Q2. How does web scraping work?

What is Web Scraping

To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages — the most common being HTML.

A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs.
Once the HTML is parsed, the scraper then extracts the necessary data and stores it.

Note : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don’t get our hands on information which might belong to someone else.
Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.

About IMDB

IMDB Homepage

IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online — including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
Almost all of us, at some point in time have looked up for a movie’s/show’s reviews and ratings on IMDB, to decide if we want to go ahead with watching it or not.
As of December 2020, IMDb has approximately 7.5 million titles (including episodes) and 10.4 million personalities in its database, as well as 83 million registered users.

Project Idea

In this project, we will parse through the IMDB’s Top rated Movies page to get details about the top rated movies from around the world.

We will retrieve information from the page ’Top Rated Movies’ using web scraping: a process of extracting information from a website programmatically.
Web scraping isn’t magic, and yet some readers may grab information on a daily basis. For example, a recent graduate may copy and paste information about companies they applied for into a spreadsheet for job application management.

Project Goal

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

Desired Output Format

Project steps

Here is an outline of the steps we’ll follow :

  1. Download the webpage using requests
  2. Parse the HTML source code using BeautifulSoup library and extract the desired infromation
  3. Building the scraper components
  4. Compile the extracted information into Python list and dictionaries
  5. Converting the python dictionaries into Pandas DataFrames
  6. Write information to the final CSV file
  7. Future work and references

Packages Used

  1. Requests — For downloading the HTML code from the IMDB URL
  2. BeautifulSoup4 — For parsing and extracting data from the HTML string
  3. Pandas — to gather my data into a dataframe for further processing

Lets Begin →

We use the Jovian library and its commit() function throughout our work to save our progress as we move along.

Download the webpage using ‘requests'

Q. What is requests?

Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.
We use ‘pip’ , a package-management system, to install and manage softwares. You will see lots codes of ‘!pip’ when installing other packages.

When we attempt to use some prewritten functions from a certain library, we would use the ‘import’ statement. e.g. When we would have to type ‘import requests’ after installation, we are able to use any function from ‘requests’ library.

requests.get()

In order to download a web page, we use requests.get() to send the HTTP request to the IMDB server and what the function returns is a response object, which is the HTTP response.

Status code()

Now, we have to check if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get the feedback directly if we didn't send HTTP requests successfully.

In general, the method to check out if the server sended a HTTP response back is the status code. In requests library, requests.get returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, response.status_code is set to a value between 200 and 299.

The HTTP response contains HTML that is ready to be displayed in browser. Here we can use response.text to retrive the HTML document.

WOW ! We have ~5.4Lac characters within the HTML that we just downloaded in a second

  • What we see above is the source code of the web page. It is written in a language called HTML.
  • It defines and display the content and structure of the web page by the help of the browsers like Chrome

Here, we save the text that we have got into a HTML file with open statement.
Now, a HTML File is created by the name top_rated_movies.html

top-rated-movies.html created

Parse the HTML source code using Beautiful Soup library

Q. What is Beautiful Soup?

Beautiful Soup is a ‘Python package’ for ‘parsing HTML and XML documents’. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It’s a handy tool when it comes to web scraping.
You can read more on their documentation site : https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help

To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library.
Let’s install the library and import ‘the BeautifulSoup class’ from ‘the bs4 module.’

Inspecting the HTML source code of a web page

In Beautiful Soup library, we can specify `html.parser` to ask Python to read components of the page, instead of reading it as a long string.

Q. What is HTML?
Before we dive into how to inspect HTML, we should know the basic knowledge about HTML.
The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.

A HTML File

An HTML tag comprises of three parts:

  1. Name: (html, head, body, div, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
  2. Attributes: (href, target, class, id, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
  3. Children: A tag can contain some text or other tags or both between the opening and closing segments, e.g., <div>Some content</div>.

Common tags and attributes :

Tags in HTML
There are around 100 types of HTML tags but on a day to day basis, around 15 to 20 of them are the most common use, such as <div> tag, <p> tag, <section> tag, <img> tag, <a> tags.
Of many tags, I wanted to highlight <a> tag, which can contain attributes such as href (hyperlink reference), because <a>tag allows users to click and they would be directed to another site. That's why the name of <a> tag is anchor.

Attributes

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags

  • id
  • style
  • class
  • href (used with <a>)
  • src (used with <img>)

What we can do with a 'BeautifulSoup object' is to get 'a specifc types of a tag in HTML' by calling the name of a tag, as shown in code cell below.

Here, we use the find() function of BeautifulSoup to find the first <title> tag in the HTML document and display its content

Inspecting HTML in the Browser

To view the source code of any webpage right within your browser, you can right click anywhere on a page and selectthe “Inspect” option. You access the “Developer Tools” mode, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

As shown in the photo above, I’ve cursored over one of the Movie Names to display how the entire content was presented.
I found out that each ‘moviename’ was present inside the ‘<a>’ tag. Since it does not have any specific class, or other attribute, I will have to check for the desired ‘<a>’ tags among all the ‘<a>’ tags present on the page.

Since I’ve pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.

Movie Name

Now we will use BeautifulSoup to extract the Names and URLs of the top 250 Movies from the HTML Page.

Fetching Movie Names

Since, the Movie Name is directly wriiten as the text of <a> tag, we could directly access the same using the find_all()function of the BeautifulSoup object, i.e. doc here.
But, for the Movie URL we will have to access one of the attributes of the <a> tag, i.e. href which contains our desired URL Links

Fetching Movie URLs

Creating a DataFrame using Pandas for Lists derived till now

Q1. What is Pandas?
Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Q2. What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. DataFrame makes it easier for us to work with tablular data and analse it.

Now, First we will create a Python Dictionary with the Movie Names and Movie URLs that we have extracted above.

We can see that the DataFrame consists of 250 items, that is equal to the number of movies that we have on the page Most Rated Movies.
Therefore, we can be sure that we have extracted the complete information that we had intended to.

Original Data Source with 250 Items

Next Steps

Now, we will go into each individual Movie's page and extract the rest of the required information

Movie Page for the movie ‘Shawshank Redemption’

Let’s start with extracting all the information for the movie Shawshank Redemption, which is the first movie in our list.

As we can see, This page has 2.3Lac characters.

Now, we will use BeautifulSoup to extract the required info from the page.

Now we have all the required information for the movie Shawshank Redemption. Let us see what all we could get :

Now, we will write functions to combine what we have done above and get all the details at once for any given Movie URL

Let us test, if the functions above work fine :

To do this, we will call the get_movie_info function for the movie The Godfather

‘The Godfather’ movie page

As we can see, we have successfully extracted all the information for the said movie at once

Now, similarly we will call the function get_movie_info , for all the 250 Movies of our list .
To do this, We will first create a dictionary movie_dict to store the results for all the desired information for all 250 movies :

Now that we have the information for all the movies, Let us convert this dictionary to a DataFrame just like we did previously to easily work with the tabular data using Pandas.

DataFrame with movie details

Having all the information from each individual Movie Page and from the Most Rated Movies page, let us combine both the DataFrames into one single DataFrame :

Let us see what we have in the final DataFrame that we have come up with :

This is exactly what our desired output was when we began this project.

Let us now save this DataFrame as a CSV file

Summary

Finally, we have managed to parse 'IMDB - Top Rated Movies' to get our hands on very interesting and insightful data when it comes to the world of entertainment.
We have saved all the information we could extract from that website for our needs in a CSV file using which we can further get answers to a lot of questions we may want to ask, e.g - Which director has directed the most movies which are top ranked in the world?

The use of CSV File ( Data )

Let us look at the steps that we took from start to finish :

  1. We downloaded the webpage using `requests`

2. We `parsed` the HTML source code using `BeautifulSoup` library and extracted the desired infromation, i.e.
* The names of ‘Top Rated Movies’
* URLs of each of those movies

3. We created a `DataFrame` using `Pandas` for `Python Lists` that we derived from the previous step

4. We extracted detailed information for each movie among the list of `Top Rated Movies`, such as :
* Movie Name
* Summary of the movie
* Year of Release
* Genre
* Rating
* Number of Reviews
* Director Name
* Lead Actors
* Movie Poster Image

5. We then created a ‘Python Dictionary’ to save all these details

6. We converted the python dictionary into ‘Pandas DataFrames’

7. Now that we had ‘2 DataFrames’, we merged both these dataframes into one single DataFrame.

8. With one single DataFrame in hand, we then converted it into a single ‘CSV’ file, which was the goal of our project.

Future Work

We can now work forward to explore this data more and more to fetch meaningful information out of it.

With all the insights , and further analysis into the data, we can have answers to a lot of questions like -

  • Which actor has worked in most top rated movies across the world?
  • The Top Rated Movies as per the Genre of our interest?
  • Which Director has directed the most top rated movies?
  • Which year gave us the most Top Rated Movies till date?

And the list goes on..

In the future, I would like to work to make this DataSet even richer with more data from other lists created by IMDB like - Most Trending Movies, Top Rated Indian Movies, Lowest Rated Movies etc. I would then like to work on analysing the entire data, to know a lot more about movies than I currently know.

References

[1] Python offical documentation. https://docs.python.org/3/

[2] Requests library. https://pypi.org/project/requests/

[3] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[4] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[5] Pandas library documentation. https://pandas.pydata.org/docs/

[6] IMDB Website. https://www.imdb.com/chart/top

[7] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python

[8] Web Scraping Image. https://morioh.com/p/431153538ecb

[8] Working with Jupyter Notebook https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd

--

--

Harshit Gupta

Who am I? A budding Data Scientist. With a 3yrs Work ex in retail, I am now building my Data Science skills to get into the industry to make a mark of my own!!!