Exploratory Data Analysis of Used Cars in the United States

using Python, Pandas, Seaborn and Plotly

14 min readSep 8, 2021

What is Exploratory Data Analysis?

Exploratory data analysis is a process for exploring datasets, answering questions, and visualizing results.

Here we explore the dataset, after which we make use of whatever data we can, by cleaning the data, i.e. converting it into the simplest form which can be used by our system and program to extract useful information. Once we have the cleaned data, we visualise it to get answers to some useful questions which then helps us to take meaningful actions.

EDA can help us deliver great business results, by improving our existing knowledge about our business as well as giving out new insights that we might not be aware of.

Tools and Libraries Used for Exploratory Data Analysis :

opendatasets ( Jovian library to download a 'Kaggle' dataset )
Data Cleaning :

Pandas
Numpy

Data Visualisation :

Matplotlib
Seaborn
Plotly
Heatmap

About the Project

In this project, we analyse the US Used Cars Dataset(3 million Cars), which has information about ‘3 million’ cars listed in the Used Car Market in US.

The dataset has 3 million rows and 66 columns, though for our analysis we will be making use of 1.5million rows and 15 columns.

Here, we assume that the analysis that we do for half the dataset would hold true and would be very close to the results that we would get even if we analyse the complete dataset.

Through the analysis, we aim to :

Get to know the ‘Used Car Market in US’ better
To understand from sellers/buyers perspective about how they can make the best decisions for their sale/purchase.
To answer some of the common questions that can help the party involved to have the best deal for their car

Steps Followed

Step 1: Selecting a real-world dataset

Step 2: Performing data preparation & cleaning

Step 3: Perform exploratory analysis & visualization and asking interesting questions

Step 4: Summarizing inferences & writing a conclusion

Downloading the Dataset

We will download the dataset from ‘Kaggle’, using the opendatasets library created by Jovian.

So let’s begin by downloading the data, and listing the files within the dataset : US Used Cars Dataset(3 million Cars)

Data Preparation and Cleaning

Data Cleaning is the method by which we make sure that the data that we are using for our analysis is completely ready, i.e. it does not have duplicates, or missing values, the data is in the right format, not corrupted and thus ready to be used for analysis.

Here we read 1M rows out of 3M, and 66 columns to save time. Later, we will use 1.5M rows and 15 columns for the purpose of our analysis and visualisation.

Let us have a look at a sample of the data that we have :

Let us now look at the numeric columns of our data set and try to have a rough idea about the data that we have in hand.

Now, Let us select only the required columns from the dataset to work upon :

Now we have finally read 1.5mn rows from the ‘usedcars_csv’ with our selected columns and the selected dataypes.

We can now proceed with Data Cleaning :

Let us begin by deleting the duplicate entries in the data and then finding and handling the missing values :

Finding and Deleting Duplicate entries

Number of Missing values

Now, that gives us a count of 9.81L missing values. That is quite a lot, so let us see how we can handle these missing values in the best way to move ahead.

As we can see, We have 6 columns with Missing Values.

We have the following options for dealing with missing values in numerical columns:

Leave them as is, if they won’t affect our analysis
Replace them with an average
Replace them with some other fixed value
Remove the rows containing missing values
Use the values from other rows & columns to estimate the missing value (imputation)

Now, Let us check column by column for all the columns where we have missing values to see what we can do about them and handle them in the best way possible

Column 1 : engine_cylinders

This column contains information about the type of engine cylinders that the car holds.

Since, there are quite a few types of engine cylinders each of which has its own features, we cannot randomly assign any value to the cars for which we do not have this information, as that would basically be false information for that car.

So the best way is to set the value of missing cars to ‘Unknown’.

Replaced Missing Values

Column 2 : frame_damaged

This column tells us if the frame of the car is damaged/dented or not.

Column 3: horsepower

This columns gives us the value of the horsepower that a car generates. Let us look at the stats of this data :

Now, we can see that the mean and median value of this column is quite close, i.e. ~245, that shows us that there is no Michael Jordan Fallacy here, i.e. we do not have a lot of outliers. Also, the standard deviation is not very high, which gives us the confidence that almost all cars with missing values can have a horsepower which is close to the mean/average

So, let us replace the ‘Nan’ values with the values in the range of our 25%ile — 70%ile values.

Count of missing values

Replacing missing values

Column 4: maximum_seating

This column tells us about the maximum seating capacity of the said car.

Here, we can see that out of ~14Lac Cars, ~9Lac cars are 5 Seater. Now, if we consider the missing 22k values as 5 seater, it would not really hamper our analysis as there is a very good chance that out of 67k cars, ~45k cars actually are a 5 seater car. So, let us replace all the NaN values, with ‘5 seats’.

Going one step ahead, what we should also do is to remove the word ‘seats’ from this data and then convert it into a numerical column. It is quite obvious that we are talking about the number of seats here, so removing the word ‘seats’ is logical and correct.

Also, it would help us in our analysis and visualisation ahead to have numerical values that can be compared.

Let us write a function for the same :

We have cleaned and prepared this column completely with just numerical values and a numerical datatype

Column 5: Seller Rating

This column gives the ratings of the various sellers that have listed the cars in the market.

Since there are just 16k missing values, and the values are also very close to each other, let us simply drop these 16k rows.

Column 5: wheel_system

This columns gives the information about the kind of wheel system that the car has.

We have 62k missing values here, let us delete these rows for this column.

Now we have worked on all the duplicate as well as missing values across all the columns of our dataset. Let us have a look at our final dataset :

As we can see, we now have 0 Null Values, 1 DateTime Column, 7 Numerical columns and 7 String columns, each with the correct datatype according to the data it holds

Exploratory Analysis and Visualization

Here, we explore the various columns in our data and understand the data better.

This will also help us to have answers to some interesting questions delivering meaningful insights from the data for a car buyer in US.

Let’s begin by importingmatplotlib.pyplot, seaborn, and plotly

These are the three visualisation libraries that we will be using to visualise our data.

Let us analyse the numeric columns and their statistics in our dataset.

Days on Market: Here we can see that on an average, a car stays in the market for approximately 1–2 months before selling. Some cars move as fast as within a day, but then some might also take a lot longer as well.
Horsepower: Most cars have a horspower ~245, where the most powerful car has a horsepower of 1000, and the least power a car has is 65hp.
Price : The average price of a second hand car in US is 26,000USD. We see that the minimum value is 165USD and the maximum value is 3299995USD, which seems unreal with a very wide range, so we can further analyse this to check for any manual entry errors or outliers.
Seller Rating : On an average most of the selleres have a good rating which is above 4 stars, with a minority with less ratings like 1, so we can safely say that most of the sellers in US do a good job when it comes to customer relations.

Let us start with visualising the daysonmarket column that gives us the number of days a used car stays in the market.

Here, we create a histogram and boxplot using plotly :

As we can see from the graph above, our initial idea that a car being on the market for ~3500days is probably an outlier was true.
We can clearly see that it is an outlier and 75% of the cars are sold within 80days of being in the market, with the median being 35 days.

2. Let us look at the variation of maximum_seating that these second hand cars have.

Here, we create a bar graph using matplotlib :

Very clearly, majority of the cars are 5 seater, followed by 7 and then 6 seater.

3. Now,lets analyse the different types of wheel systems that these cars have.

So we have 5 types of wheel drives, out of which most of the cars are a Front Wheel Drive, followed by an Automatic, and then a 4WD.

4. Let us look at the price distribution of the Used Cars in US.

As we can see, most of the cars are priced at less than 75k USD, so let us now closely look at the price distribution in that range excluding the outliers.

As we can see much better now that majority of the cars are priced betwen 18–25k USD, where the average price of a new car has always been around 38k USD for a light motor vehicle as per ‘Statista 2021’ Data as shown in the graph from their official website below.
Thus, buying a used car can be very beneficial and a wise decision!

5. Let us now look at the ratings that the Used Car sellers hold in US.

As we can see, the median value of the seller ratings in US is around 4.4, with 75% of the sellers having a rating above 4.6.
This shows us that the sellers are doing a pretty good job when it comes to customer satisfaction, be it a car seller or a buyer. Both the parties are rating the sellers quite high in general which is remarkable for these sellers considering how competitive and technical the domain is.

Asking and Answering Questions

Let us now ask and answer some interesting questions to further understand the Used Cars market in US.

Q1: How does the price of a car vary with the power it can deliver?

As we can see, there is no direct or proportional relation between the price and power of a car, which implies that there must be other factors which play a crucial role in deciding the price of a second hand car and not the horsepower it delivers.

Q2: Which brands are the ones mostly being sold in the second hand market?

As we can see certain brands like Ford, Chevrolet, Toyota definitely have high number of cars being sold in the market, and luxury brands like Janguar, Mini, Porsche etc have a very limited number of cars in the second hand market. This can be due to various reasons like-
- Luxury cars are very limited in number as compared to other cars on the roads and thus cars in the second hand market are also limited.
- People buying luxury cars do not sell them as much as the regular car buyers. Unlike regular middle class buyers they might keep more than one car with them, thus not bringing their cars in the second hand market.
- The quality of luxury cars is such that the users tend to use them for a longer period of time as compared to a regular car.

Q3: Which cities are the most active in the second hand car market?

As we can see, there is a good distribution of the number of cars across cities.
Here we have just visualised the top 30 cities, but going by the trend we can be certain that there are a lot of cities with a handsome number of cars in the used cars market.

Q4: Are car prices affected with seasonality? In which months do we see the highest and lowest prices for a second hand car?

Here, We can see that the prices have been increasing over the years owing to inflation, or other reasons just like the prices of new cars increase over the years. There is not a lot of differene among the prices of the cars sold in different months and the price is almost evenly distributed. So, it would be safe to say that seasonality does not play a crucial role in selling or buying a second hand car.

Q5: How many people Buy or Sell a dented Car?

Well, This is quite surprising because every day we see numerous cars on the street with dents and scratches. Yet, what we see here is quite different where majority of the cars in the Used market are not that way.
The reason for this can be that when someone goes to sell or buy a second hand car, both the parties ensure that there is no physical damage on the vehicle, as it is not desirable to any of the parties involved and thus most of the people get it fixed before it reaches the market.

Q6: Which brands are the most sought in the used cars market?

We can analyse this by having a look at how many days it takes for a used car of a particular brand to be sold or bought in the market in general.
As we can see, most of the brands are in the market for a similar amount of time. Yet, some brands such as Daewood, Karma, Maybach, Aston Martin stay in the market for quite a long time.

This can be due to multiple factors :

a. The cars in the sample were not in the best condition to be desirable by another user.

b. Cars like Rolls Royce are one of a kind and thus the people interested in buying such a car would prefer a new one instead of a used one as it is more of a status symbol for people than just a car.

At the same time, we have brands like Subaru, Toyota, Mazda, Chevrolet, Lexus, GMC, Toyota which are quite fast moving in the market.

The reasons can be as follows:

a. A majority of people buy and sell these cars, and thus the number of cars available in the market are huge in number thus they move fast.

b. These brands offer economical cars for people and thus the resale value becomes a lot more cheap for a user to buy. This results in a lot of people buying these cars as compared to an expensive one.

Q7: How do the Used Car Prices vary as per the Brands?

Here, we can see that brands like McLaren, RollsRoyce, Lamborghini, Buggati are very highly priced just like they are when new,and some of the brands with the least prices in the used market are — Geo, Saturn, Suzuki, Isuzu, Mercury, Oldsmobile.

Q8:How does the Used Car Market vary in US from one region to another???

So we can see that the used cars market is quite active on the West Coast of US, and not so much on the East Coast. This may be due to a higher population and thus a higher number of cars in the region, or just the behaviour of the population that resides there to change cars frequently and thus buying and selling cars more often.
So if you want a great deal for your vehicle, chances are that you will get it in the West Coast region!

Inferences and Conclusion

Here are the conclusions that we could draw about the Used Cars Market in US from our analysis :

A car would most likely stay in the the Used Cars market for 60 days at max, and if it stays there for more than that chances are that you either should reconsider the price offer or there is some or the other defect with the car which should be rectified to increase its chances of moving out fast.
70% of the cars in the Used Cars market are a 5-seater. So if you are not interested in that, you will have a lot less options to choose from.
Power that a car delivers is not a major factor in deciding the price of a car in the market. You may get a very powerful car at a lesser price and a less powerful car at a higher one. This also goes on to show that the users must consider other factors more important to decide the price at which they sell or buy a car.
Luxury cars are much less in number in the market as compared to regular cars. This observation is in line with how the new car market is, where the quantities sold of a luxury car are much lesser than the other cars.
The number of cars in the used cars market are pretty good across most of the cities in the West Coast of US, and not so much on the East Coast. So if you want to buy or sell a car, chances are that you will get the best deals/offers and product at a West Coast city.
Seasonality does not affect the pricing of cars in the market. So you csn buy your purchase/sell at any time convenient to you, not worrying about any specific months for the best offers.
Majority of the cars in the used market are not damaged/dented, which shows us that most users get their cars rectified before putting the out in the market, as only 1% of the cars are reported as Damaged.
Majority of the used cars are priced betwen 18–25k USD, where the average price of a new car has always been around 38k USD for a light motor vehicle. So buying a used car can be a wise decision, also considering the depriciating value and insurance costs of a new car.

Future Work

In the future, I would like to improve this project further taking following actions on this dataset :

Analysing more and different columns from the dataset to derive some more results.
Asking more questions, more specific to certain users or types of cars to go deep into the market analysis
Visualising some more relationships among the given data which can help us to take better actions for buying or selling a car in US.
Using the Pin codes of the sellers to know about the distribution of the sellers across the different United States.

References

[1] Matplotlib Documentation https://matplotlib.org

[2] Stackoverflow https://stackoverflow.com

[3] Folium Documentation http://python-visualization.github.io/folium/

[4] Aakash N S. Analyzing Tabular Data with Pandas, 2021. https://jovian.ai/aakashns/python-pandas-data-analysis

[5] Aakash N S. Data Visualization using Python Matplotlib and Seaborn, 2021. https://jovian.ai/aakashns/python-matplotlib-data-visualization

[6] Aakash N S. Advanced Data Analysis Techniques with Python & Pandas, 2021. https://jovian.ai/aakashns/advanced-data-analysis-pandas

[7] Aakash N S. Interactive Visualization with Plotly, 2021. https://jovian.ai/aakashns/interactive-visualization-plotly

[8] Plotly Documentation. https://plotly.com/python/

[9] Aakash N S. plotly-line-chart, 2021. https://jovian.ai/aakashns/plotly-line-chart