Data Science
Exploratory Data Analysis on Google Play Store Apps
Analyzing the Apps found on Google Playstore to gain an insight into the present android market
Index Of Contents
· Introduction
· Data Preparation and Cleaning
· Exploratory Analysis and Visualization
∘ Let’s have a look at the distribution of the ratings of the data frame.
∘ Let’s plot a visualization graph to view what portion of the apps in the play store are paid and free.
∘ Which category App’s have the most number of installs?
· Asking and Answering Questions
∘ What are the Top 10 installed apps in any category?
∘ Which are the top 10 expensive Apps in the play store?
∘ Which are the Apps with the highest number of reviews?
∘ What are the count of Apps in different genres?
∘ Which are the apps that have made the highest-earning?
· Inferences and Conclusion
· References
Introduction
There are more than 3.04 million apps found on Google Play Store. With this project/article I will take you through a journey of analyzing various apps found on the play store with the help of different python libraries.
Dataset: The dataset is been taken from Kaggle, which can be found here Link
It consists of 13 columns:- App
, Category
, Rating
, Reviews
, Size
, Installs
, Type
, Price
, Content Rating
, Genres
, Last Updated
, Current Ver
, and Android Ver
with 10841 Rows.
Before starting make sure you have downloaded the dataset and placed it in the appropriate location.
Imports: Let us start by importing some of the required libraries with which we will be working on.
Loading the dataset as a pandas data frame.
After loading the dataset, we can start the exploration but before that, we need to check and see that the dataset is ready for performing several exploration operations or not, so let’s first have a look at the structure and the manner in which the data is organized.
To know if there is any missing value or Nan
value in the dataset, we can use the isnull()
function.
So, we will need to prepare the dataset before performing exploratory data analysis on it.
Data Preparation and Cleaning
Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and the combining of data sets to enrich data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
We saw that the dataset contains many Null
or missing values. The column Rating
, Type
, Content Rating
, Current Ver
, andAndroid Ver
contains 1474, 1, 1, 8, and 3 missing values respectively.
Will it not be better if we can define a function to get more useful information about the different attributes of the dataset, also there is one more valid point in defining a function which it will be reusable, and we are going to utilize our defined function several times in future.
Let’s call the function and see what it returns:
We have some useful information about the dataset. i.e., we can now see the missing number of values of any attribute, its unique count, and its respective data types.
Now we can start the process of data cleaning, lets start with the column Type
:-
Since there is only one missing value in this column, So, let’s fill the missing value. After cross-checking in the play store the missing value is found to be Free, So now we can fill the missing value with free
.
After filling the value we can check and see if that has been correctly placed.
Now, we can move on to the column Content Rating
:
By, looking only at these rows it is not easy to say what's actually missing in this row. let us have a look at all of its near rows data. For this purpose, we have iloc
and loc
function.
We can clearly see that row 10472 has missing data for the Category
column and all the prevailing column values are being replaced with its previous column. A better idea will be to drop this row from our data frame.
We are having some of the unwanted columns which will be of not much use in the analysis process. So let’s drop those columns.
Now, we can fix the Rating
column which contains a total of 1474 of missing values. Replacing the missing values with the Mode
value of that entire column.
Finally, after fixing all the missing values, we should have a look at our data frame, We defined a function as printinfo()
. So, it’s time to use that function.
All the columns have the null_count
as zero, which indicates that now the data frame doesn’t contain any missing values.
Now we are done with the data cleansing part and in a state to start the work for data preparation
Columns like Reviews
, Size
, Installs
, & price
should have an int
or float
datatype, But here we can see of object
type, So let’s convert them to their respective correct type.
Starting with the column Reviews
, converting its type to int
.
We can see that the changes have taken its effect or not by calling our printinfo()
function.
Now, the reviews column has been converted to int
type, so now we can move to the Column: Size
Converting the Size Column from object to integer, but this column contains some of the special characters like ,
, +
, M
, K
& also it has a some of the value as Varies with device
. We need to remove all of these and then convert it to int
or float
.
Removing the +
Symbol:
Removing the ,
symbol:
Replacing the M
symbol by multiplying the value with 1000000:
Replacing the k
by multiplying the value with 1000:
Replacing the Varies with device
value with Nan
:
Now, finally converting all these values to numeric type:
So, after performing all of these operations, we should have a detailed look at that column, so yes again we will call our useful function which we defined. i.e., printinfo()
Since we converted the Varies with device
value to Nan
, so we have to do something with those set of Nan
values data. It will be a better idea to drop the Rows of the column Size
having Nan
values because it will be not an efficient idea to replace those values with mean or mode since the size of some apps would be too large and some of them too small.
Column: Installs
:
To convert this column from object
to integer
type. First of all, we will need to remove the +
symbol from these values.
and then let’s remove the ,
symbol from the numbers.
Lastly, we can now convert it from string type to numeric type, and then have a look at our dataset.
So, now we are only left with the Price
column.
Column: Price
:
Converting this column from object
to Numeric
type.
The values contain a special symbol $
which can be removed and then converted to the numeric type.
After fixing all the issues, we should have a final look at the data frame.
Now, we are finally done. In this section Data Preparation and Cleaning. We can see that the original dataset contained 10841 Rows and 13 Columns. It contained App
, Category
, Rating
, Reviews
, Size
, Installs
, Type
, Price
, Content Rating
, Genres
, Last Updated
, Curernt Ver
, and Android Ver
Columns. But after cleansing the dataset and dropping the unwanted rows and columns having Null
Values and Garbage data from the data frame, we are left with 8434 Rows and 10 Columns.
Exploratory Analysis and Visualization
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent the property of a graphic mark, such as size or color, will change to reflect changes in the value of a datum.
Let’s begin by importing matplotlib.pyplot
and seaborn
, and at the same time set our fig size, font size, etc.
Now it is time to unveil the real strength of data analysis, i.e., to get an insight, and learn the trend, pattern and get answers to some of the questions related to the dataset.
Can we see what are the top categories in the play store, which contains the highest number of apps? Well, let us try to.
We have defined our x
and y
axis. Let us plot and see:-
So there are all total of 33 categories in the dataset from the above output we can come to the conclusion that in the play store most of the apps are under
Family
&Game
category and least are ofBeauty
&Comics
Category.
Which category of Apps from the ‘Content Rating’ column is found more on the play store?
From the above plot, we can see that the
Everyone
category has the highest number of apps.
Let’s have a look at the distribution of the ratings of the data frame.
From the above graph, we can come to the conclusion that most of the apps in the google play store are rated between 3.5 to 4.8.
Let’s plot a visualization graph to view what portion of the apps in the play store are paid and free.
From the above graph, we can see that 92%(Approx.) of apps in the google play store are free and 8%(Approx.) are paid.
Which category App’s have the most number of installs?
To answer this question we need to create a separate data frame out of our googlestore_df
data frame which will contain a grouped value by Category
and Installs
.
Now, let us plot it out:
From the above visualization, it can be interpreted that the top categories with the highest installs are
Game
,Family
,Communication
,News & Magazines
, &Tools
.
We have done somewhat a good number of exploratory data analysis till now and in a state to finally answer some of the most common and in-demand questions which every App developer or any business company will love to know.
Asking and Answering Questions
With the help of Data Analysis, we can answer many questions that can’t be answered just by looking at the dataset. By querying on any dataset, and understanding the pattern and rate of growth and fall of any values we can come to many conclusions, and get insightful information from it. So let's start
What are the Top 10 installed apps in any category
?
So, we have to be able to answer this not only for a single category but for many, i.e., we will need to define a function which should be able to return us a nice plot for any Category
the name provided by any user as an argument to it.
After we are done with defining the function, it’s time to check and see if everything is working fine. So let’s test it by passing Sports
category to the above-defined function.
From the above graph, we can see that in the
Sports
category FIFA Soccer, and Dream League Soccer 2018 has the highest installs. In the same way by passing different category names to the function, we can get the top 10 installed apps.
Which are the top 10 expensive Apps in the play store?
We will again need to create a separate data frame.
From the above data frame, we will need to drop an app name, because its’ name will be creating a mess in the plot.
So Finally let’s plot and visualize the top 10 paid apps on the play store.
From the above graph, we can interpret that the App
I am rich
is the most expensive app in the google play store followed byI am Rich Premium
. we also had to drop one-row data for this visualization because the language of the app wasChinese
and it was messing with the pie chart, visualization.
Which are the Apps with the highest number of reviews?
From the above data frame we can interpret, and come to the conclusion that the Apps like
Clash of Clans
,Subway Surfers
,Clash Royale
, andCandy Crush Saga
has the highest number of reviews on google play store.
What are the count of Apps in different genres?
By creating a data frame, let’s define our x
and y
axis, which will be required for plotting the graph.
Finally, we are in a state to plot and gain an insight into our raised question.
From the above visualization, we can see that the Highest Number of Apps found in the
Tools
andEntertainment
genres followed byEducation
,Medical
and many more.
The last question which we are going to answer is:
Which are the apps that have made the highest-earning?
For answering these questions we will need to perform some extra operation to the data frame, i.e., we will need to create a separate data frame, and then multiply the Price
column and the Installs
column in order to get the earning of any particular app. So, let's start the process.
Now from the above data frame, we will need to separate out the columns which we will require.
We can now add a separate column Earnings
to our new data frame which we will create by multiplying the two-column Price
and Installs
.
Now let us sort the above data by Earnings
and Price
.
Finally, we can plot the graph and find out which are the apps with the highest number of earnings.
The top five apps with the highest earnings found on google play store are:-
I am Rich
I am Rich Premium
Hitman Sniper
Grand Theft Auto: San Andreas
Facetune - For Free
We have finally come to an end to our analysis, and hope that if you have reached till here it must have been interesting or useful to you.
Inferences and Conclusion
After Analyzing the dataset we have got answers to some of the serious & interesting questions which any of the android users would love to know.
The original Dataset can be found here:- Link
- Top categories on Google Playstore?
- Which category of Content is found more?
- Distribution of the ratings of the apps?
- What percentage of apps are Free and Paid?
- Which category of apps has the most number of installs?
- What are the Top 10 installed apps in different categories?
- Which are the top expensive Apps?
- Which are the Apps with the highest number of reviews?
- Count of Apps found in different genres?
- Which are the apps that have made the highest-earning?
You can also get answers to some more questions in the Notebook. Which can be found here:- Link
I regularly post articles in the field of Data Science If you found my work interesting you can connect with me on:-