Scraping Amazon Best Sellers with Python

using Requests, Beautiful Soup, and Pandas

10 min readApr 29, 2021

Learn how to scrape data from a website using Python, Requests, and Beautiful Soup libraries.

If you are just starting with Python and wish to learn more, take a look at Jovian Zero to Data Analyst Bootcamp course where you have interesting courses such as Programming with Python, Data Analysis with Python.

What is web scraping?

web scraping or web content extraction is an automatic method to extract a large amount of data from web pages and store it in a suitable format.

Why doing web scraping?

web scraping is commonly used in the marketing field, like businesses scrape data from competitor’s websites to get some information such as articles or prices, hot sales products, to come out with a new strategy that can be helpful to adjust their services and make a profit. Web scraping includes but is not limited to Market analysis, Machine learning and large datasets, Search engine optimization.

How to do the scraping stuff on Amazon?

Amazon.com known as Amazon is an American online business and cloud computing company, it’s a vast Internet-based enterprise that sells a large variety of goods. It’s one of the most popular e-commerce platforms in the world where people can do online shopping and find many items that they want to purchase. For example electronics accessories, clothes, etc.

Amazon has listed best sellers in alphabetic order that could be found in Amazon Best Sellers. The page provides a list of items categories regrouped in department(about 40 variety). In this project, we are going to retrieve amazon best seller items in a variety of categories using web scraping. To achieve that we will use Python libraries request and BeautifulSoup to fetch, parse and extract the information we need from the web page.

Here is an outline of the steps we will follow:

Install and import libraries
Download and Parse the Bestseller HTML page source code using request and Beautifulsoup to get item categories topics URL.
Repeat step 2 for each item topic obtained using the corresponding URL
Extract information from each page
Combine the extracted information Extract information from each page's data in a Python Dictionaries
Save the information data to CSV file Using Pandas library

By the end of the project, we’ll create a CSV file in the following format:

Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615699,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1844,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,"Amazon Smart Plug, works with Alexa – A Certified for Humans Device",4.7,24.9,0.0,425090,"https://images-na.ssl-images-amazon.com/images/I/41uF7hO8FtL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick Lite with Alexa Voice Remote Lite (no TV controls) | HD streaming device | 2020 release,4.7,29.9,0.0,151007,"https://images-na.ssl-images-amazon.com/images/I/51Da2Z%2BFTFL._AC_UL200_SR200,200_.jpg"

The full code for this post could be found here:

https://jovian.ai/landryroni/data-analyst-project-1-web-scraping

How to Run the Code

You can execute the code using the “Run” button at the top of the page and selecting “ Run on Binder “. You can make changes and save your version of the notebook in Jovian by executing the following cells.

Notice: Any department on the bestseller page got 40 items categories wherein each category is listed the best 100 items on 2 pages(50 items per page)Due to captcha problems few pages couldn't be accessible.

Install and import the Libraries we are going to use

Let’s start with necessary libraries:

Install libraries via the pip command. Import the required packages that will be useful for scraping the data from the website.

Download and Parse the Bestseller HTML page source code using requests and Beautifulsoup to get item categories topics URL.

We use the get function from requests library to download the page. We define a User-Agent header string to let servers and network peers identify the application, operating system, vendor, and/or version of the requesting. It helps to bypass the detection as a scraper.

The requests.get returns a response object containing the data from the web page. The.status_code property is used to check if the response was successful. A successful response will have the HTTP response code between 200 and 299.

HTTP response status codes - HTTP | MDN

This interim response indicates that the client should continue the request or ignore the response if the request is…

developer.mozilla.org

response.text can be used to look at the page contents we just downloaded, we can also check the length by apply len(response.text). Here we just print the first 500 characters of the page content.

Let’s save the contents to a file with the HTML extension.

We can view the file using the “File > Open” menu option within Jupyter and clicking on bestseller.html in the list of files displayed.

We can also pay attention to the file size, for this task having close to 250kB as file size means the page has content and has been successfully downloaded whereas having about 6.6kB as file size means failed to download the exact page content. Failing can be because of captcha, or others security conditions the web page request.

Here is what we see when we open the file By clicking on it:

While this looks similar to the original web page, note that it’s simply a copy. You will notice that none of the links or buttons work. To view or edit the source code of the file, click “File > Open” within Jupyter, then select the file bestseller.html from the list and click the “Edit” button.

We have just read the file content and printed the first 500 characters. Now we parse the web page information using BeautifulSoup and check to see the type.

Let’s access the parent’s tag and find all information data tags Attributes

Here we find a variety of item topics(categories) Url and their title, then store them in a dictionary.

We obtain a variety of 40 topics or categories on the Bestseller page, which is correct.

Repeat step 2 for each item category obtained using the corresponding URL.

Let’s import time library, to avoid facing several pages access denied by captcha we will observe a few seconds' sleeping time between each page request by applying the function sleep from time library.

Here we defined a function called parse_page to fetch and parse each single page URL downloaded from the topic existing in any department of the bestseller

Here we defined the reparse_failed_page function for a second time fetching and parsing pages that failed in the first attempt, we try using a while loop to repeat still getting successfully parsing but it does fail even when applying a sleeping time.

Here we defined the function parse where the 2 levels of parsing will be done to get the maximum number of pages.

We successfully downloaded 76 out of 80 pages, which is not bad.

Extract information data from each page.

Here we defined a function to extract information such as items description, rating, maximum price, minimum price, review, and image URL from the page. Information is extracted through corresponding HTML element from the parsed HTML page content.

What is an HTML element?

An HTML element is an ensemble consisting of the tag name, attributes, and child nodes(also include text nodes and other elements); Data can be extracted from an element, it can also manipulate the HTML.

What stands for HTML Tag, Attributes, and Child nodes?

One easy way to understand what means an HTML tag is by answering questions such as how does a computer know what content to display, how to display it, and where to display it? what makes some text different than the others, distinguishing the title from the main text or body paragraph. Most of what we listed is done by using an HTML tag which is a command in a web page that tells the browser to do something; Tags have the start <> and the ending</> brackets in order to work.

HTML tags have names(html, head, body, div, etc)to indicate what the tag represents, they are like keywords that define how the browser will format and display the content. Examples of tags such <body> define the body section of an HTML content,<b> is used to make a text bold.

HTML attributes (href, src, class, id,alt ,etc.) are a modifier of an HTML element type, used inside the opening tag to control the element's behavior; it provides additional information about the HTML elements. Example of attribute class="intro" ,id:"firstname" .

HTML Child nodes(or Children) are elements that are direct children, elements that are nested exactly in the given one. Example <head> and <body> are children of <html> element.

Several different types of tags and attributes can be used in a single HMTL page content; Here we are going to find all HTML elements that contain the information we need by right-clicking on the specific part we want to get the corresponding information then select inspect the page. The picture below is an example to find item price tags information.

Here we defined the get_topic_url_item_description function to extract the corresponding item description.

Here we defined get_item_price to extract corresponding item minimum and maximum price.

Here we defined get_item_rate and get_item_review function to extract corresponding item rate and costumers review.

Here we defined get_item_url function to extract corresponding item image URL.

Item information data are extracted and directly store in a practicable data type to be convenient when making data analytic.

Item description, image URL are given string data type, Item price, rating are given float data type and customer review integer data type.

Combine data information extracted from each page into a Python Dictionary.

Here we defined a function called get_info to collect all item information data needed as a list of data and store it in a dictionary.

After 2 parsing attempts we got the maximum number of pages where we extracted data information, and now will have to store in a Dataframe using Pandas library.

Save the information data to CSV file Using Pandas library.

Let’s save the obtained data to a pandas DataFrame.

Let’s print and see the result, the data length(number of rows, number of columns)

We have a DataFrame with more than 3500 rows of data and 8 columns. let’s save the DataFrame as a CSV file using pandas.

The CSV file created can be accessible by clicking on File > open.

Let’s open the CSV file read lines, and print out the first 5 lines of data.

Let’s do a simple preprocessing by looking for the item with the highest number of customers review.

Summary

What we have done so far was:

Install and import libraries
Download and Parse the Bestseller HTML page source code using request and Beautifulsoup to get item categories topics URL.
Repeat step 2 on each obtained item topic obtained using the corresponding URL
Extract information from each page
Combine the extracted information Extract information from each page’s data in a Python Dictionaries
Save the information data to CSV file Using Pandas library

By the end of the project, we’ll create a CSV file in the following format:

Topic,Topic_url,Item_description,Rating out of 5,Minimum_price,Maximum_price,Review,Item Url
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick 4K streaming device with Alexa Voice Remote | Dolby Vision | 2018 release,4.7,39.9,0.0,615699,"https://images-na.ssl-images-amazon.com/images/I/51CgKGfMelL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick (3rd Gen) with Alexa Voice Remote (includes TV controls) | HD streaming device | 2021 release,4.7,39.9,0.0,1844,"https://images-na.ssl-images-amazon.com/images/I/51KKR5uGn6L._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,"Amazon Smart Plug, works with Alexa – A Certified for Humans Device",4.7,24.9,0.0,425090,"https://images-na.ssl-images-amazon.com/images/I/41uF7hO8FtL._AC_UL200_SR200,200_.jpg"
Amazon Devices & Accessories,https://www.amazon.com/Best-Sellers/zgbs/amazon-devices/ref=zg_bs_nav_0/131-6756172-7735956,Fire TV Stick Lite with Alexa Voice Remote Lite (no TV controls) | HD streaming device | 2020 release,4.7,29.9,0.0,151007,"https://images-na.ssl-images-amazon.com/images/I/51Da2Z%2BFTFL._AC_UL200_SR200,200_.jpg"

Even though we do have a promising result, and covered interesting features that will be helpful in future web scraping projects using python, requests, BeautifulSoup libraries, there is still a long way to go, extra work can be done like improving the scraping process through writing better code. Any other suggestions are welcome to help to deal with the failure of some pages when parsing due to captcha. Please feel free to let a comment that will help to improve this work.

Future Work

addition work can be done to avoid captcha and get access to all of the pages.
explore other more complex websites.
explore how we might go about scraping data using Selenium and Scrapy.

References

Here is some link to learn more about the used libraries:

here is some tutorial on web scraping:

Scraping Amazon Best Sellers with Python

using Requests, Beautiful Soup, and Pandas

How to Run the Code

Install and import the Libraries we are going to use

Download and Parse the Bestseller HTML page source code using requests and Beautifulsoup to get item categories topics URL.

HTTP response status codes - HTTP | MDN

This interim response indicates that the client should continue the request or ignore the response if the request is…

Repeat step 2 for each item category obtained using the corresponding URL.

Extract information data from each page.

Combine data information extracted from each page into a Python Dictionary.

Save the information data to CSV file Using Pandas library.

Summary

Future Work

References

Written by Dejoli Tientcheu Touko Landry