Data Science

Web Scraping Data Science & Related Jobs

Successfully scraped 100+ jobs openings using Requests & Beautiful Soup

Pankaj Singh

--

Data Science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data. Data Science practitioner applies machine learning algorithms to numbers, text, images, video, audio, and more to produce artificial intelligence (AI) system to perform tasks that hardly requires human intelligence. In return, these AI-based systems generate meaningful insights that analysts and business users can translate into business value.

Web Scraping is an automatic method to obtain large amounts of data from websites. In this article, we’ll assemble data regarding the Job opportunities available in Data Science and its related field from INDEED using the Python requests, Beautiful Soup.

Index Of Contents
· Download the Webpage using requests
· Parse the HTML source code using a beautiful soup
· Extract details about Job-postings
· Compile extracted information into Pandas DataFrame
· Create a Python list to store all the scraped data using for-loop
· Save the extracted information to a CSV file.
· Summary
· Future Work
· References

Download the Webpage using requests

Indeed Job Portal

Let’s now grab the URL save it into a container and use the requests library allows us to send HTTP requests to the website server to download the content.

Let’s now check the status of the downloaded content

  • The response to the requests. get() should be between 200 to 299 to download the web content.
  • The response status can be verified by status_code()

We have now successfully downloaded the web page.

Parse the HTML source code using a beautiful soup

We’ll use the Beautiful Soup library to parse the HTML source code of the downloaded web page.

We will now extract the details from the Beautiful Soup object, using the “object.” functionality. Let us find out the title of the Webpage.

We have successfully extracted the title for the Web-page.

Extract details about Job-postings

Let’s now extract the Job Titles, Organization Name, Job Location, Ratings, Salary, and Date of Job Posting.

Let’s now create a helper function “get_title()” to download all the Job titles using BeautifulSoup “.find_all” functionality.

Let’s now create a helper function “get_title()” to download all the Job titles using BeautifulSoup “.find_all” functionality.

We’ll repeat the same process for the Organization Name, Job Location, Rating of the Organization, Job Description, Date of Posting, and Salary.
However, in extracting the Rating & Salary, it is found that some organizations do not have any ratings or mention the salary offered.

Here we’ll use the “try & except” concept of python programming to fill up the missing value with “NA” .

Let’s create the helper function download_web(), to download the web- page and parse it.

As you can view, “docs” (Beautiful Soup Object) stores the Parsed HTML data.

Compile extracted information into Pandas DataFrame

Let’s first now create containers for each different Job Role.

Now we’ll create a list of these containers.

Let’s now create a final helper function details() to compile all the data under one hood, calling the previously created function to extract all the details regarding a particular of jobs, eg. Data Scientist

As you can check we have generated the Pandas DataFrame using pd.DataFrame functionality and return it.

Create a Python list to store all the scraped data using for-loop

Here we’ll be scraping and parsing the data just by using a single for-loop, for all the URLs we have created for different job roles and saving it on a python list.

Let’s now check out the details of scraped data using the print function

Here we can see that a list of lists is created with the details. We’ll use the pd. Concat function to generate a new pandas DataFrame.

This is the complete and final pandas DataFrame containing 105 open jobs related to Data Science and related domain.

Save the extracted information to a CSV file.

Let’s save the extracted information into a “.CSV” file format

Summary

Here in this blog post, we tried to scrape the different job opportunities in Data Science & its related domains in the PAN INDIA location with the Job Location, description, and Salary offered.

  1. We downloaded the Data Science and related Job Posting from Indeed using requests library.
  2. Parsed the HTML source code using the Beautiful Soup library.
  3. Extracted details about the Job-postings like Job Title, Organization Name, Job Location, and Salary as Python list.
  4. Assembled extracted information into Pandas DataFrame and saved it into a CSV file.

Future Work

Here are some ideas for future work:-

  • Here we have only parsed single pages of each one of the job postings that contain 15 jobs. We can use the code to parse all the pages separately for non-dynamically loaded websites.
  • Since Indeed is a dynamically loaded webpage, we are limited to the first page only. We can use the AWS Selenium to web scrape all the data.
  • We can use a similar code to web scrape from Glassdoor, Naukri, and other famous job portals.

References

--

--

Pankaj Singh

AI Engineer | Specializing in NLP and Prompt Engineering | Turning Challenges into Opportunities