Introduction

There are more than 3.04 million apps found on Google Play Store. With this project/article I will take you through a journey of analyzing various apps found on the play store with the help of different python libraries.
Dataset: The dataset is been taken from Kaggle, which can be found here Link
It consists of 13 columns:- App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Current Ver, and Android Ver with 10841 Rows.

Before starting make sure you have downloaded the dataset and placed it in the appropriate location.

Imports: Let us start by importing some of the required libraries with which we will be working on.

Loading the dataset as a pandas data frame.

After loading the dataset, we can start the exploration but before that, we need to check and see that the dataset is ready for performing several exploration operations or not, so let’s first have a look at the structure and the manner in which the data is organized.

To know if there is any missing value or Nan value in the dataset, we can use the isnull() function.

So, we will need to prepare the dataset before performing exploratory data analysis on it.

Data Preparation and Cleaning

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and the combining of data sets to enrich data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

We saw that the dataset contains many Null or missing values. The column Rating, Type , Content Rating , Current Ver , andAndroid Ver contains 1474, 1, 1, 8, and 3 missing values respectively.

Will it not be better if we can define a function to get more useful information about the different attributes of the dataset, also there is one more valid point in defining a function which it will be reusable, and we are going to utilize our defined function several times in future.

Let’s call the function and see what it returns:

We have some useful information about the dataset. i.e., we can now see the missing number of values of any attribute, its unique count, and its respective data types.

Now we can start the process of data cleaning, lets start with the column Type :-

Since there is only one missing value in this column, So, let’s fill the missing value. After cross-checking in the play store the missing value is found to be Free, So now we can fill the missing value with free .

After filling the value we can check and see if that has been correctly placed.

Now, we can move on to the column Content Rating :

By, looking only at these rows it is not easy to say what's actually missing in this row. let us have a look at all of its near rows data. For this purpose, we have iloc and loc function.

We can clearly see that row 10472 has missing data for the Categorycolumn and all the prevailing column values are being replaced with its previous column. A better idea will be to drop this row from our data frame.

We are having some of the unwanted columns which will be of not much use in the analysis process. So let’s drop those columns.

Now, we can fix the Rating column which contains a total of 1474 of missing values. Replacing the missing values with the Modevalue of that entire column.

Finally, after fixing all the missing values, we should have a look at our data frame, We defined a function as printinfo() . So, it’s time to use that function.

All the columns have the null_count as zero, which indicates that now the data frame doesn’t contain any missing values.

Now we are done with the data cleansing part and in a state to start the work for data preparation

Columns like Reviews, Size, Installs, & priceshould have an intor floatdatatype, But here we can see of objecttype, So let’s convert them to their respective correct type.

Starting with the column Reviews , converting its type to int .

We can see that the changes have taken its effect or not by calling our printinfo() function.

Now, the reviews column has been converted to int type, so now we can move to the Column: Size
Converting the Size Column from object to integer, but this column contains some of the special characters like , , + , M , K & also it has a some of the value as Varies with device . We need to remove all of these and then convert it to int or float .

Removing the +Symbol:

Removing the , symbol:

Replacing the M symbol by multiplying the value with 1000000:

Replacing the k by multiplying the value with 1000:

Replacing the Varies with device value with Nan :

Now, finally converting all these values to numeric type:

So, after performing all of these operations, we should have a detailed look at that column, so yes again we will call our useful function which we defined. i.e., printinfo()

Since we converted the Varies with device value to Nan , so we have to do something with those set of Nan values data. It will be a better idea to drop the Rows of the column Size having Nanvalues because it will be not an efficient idea to replace those values with mean or mode since the size of some apps would be too large and some of them too small.

Column: Installs :
To convert this column from object to integer type. First of all, we will need to remove the +symbol from these values.

and then let’s remove the , symbol from the numbers.

Lastly, we can now convert it from string type to numeric type, and then have a look at our dataset.

So, now we are only left with the Price column.
Column: Price :
Converting this column from objectto Numeric type.

The values contain a special symbol $ which can be removed and then converted to the numeric type.

After fixing all the issues, we should have a final look at the data frame.

Now, we are finally done. In this section Data Preparation and Cleaning. We can see that the original dataset contained 10841 Rows and 13 Columns. It contained App, Category, Rating, Reviews, Size, Installs, Type, Price, Content Rating, Genres, Last Updated, Curernt Ver, and Android VerColumns. But after cleansing the dataset and dropping the unwanted rows and columns having Null Values and Garbage data from the data frame, we are left with 8434 Rows and 10 Columns.

Exploratory Analysis and Visualization

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent the property of a graphic mark, such as size or color, will change to reflect changes in the value of a datum.

Let’s begin by importing matplotlib.pyplot and seaborn , and at the same time set our fig size, font size, etc.

Now it is time to unveil the real strength of data analysis, i.e., to get an insight, and learn the trend, pattern and get answers to some of the questions related to the dataset.

Can we see what are the top categories in the play store, which contains the highest number of apps? Well, let us try to.

We have defined our x and y axis. Let us plot and see:-

So there are all total of 33 categories in the dataset from the above output we can come to the conclusion that in the play store most of the apps are under Family & Game category and least are of Beauty & Comics Category.

Which category of Apps from the ‘Content Rating’ column is found more on the play store?

From the above plot, we can see that the Everyone category has the highest number of apps.

Let’s have a look at the distribution of the ratings of the data frame.

From the above graph, we can come to the conclusion that most of the apps in the google play store are rated between 3.5 to 4.8.

Let’s plot a visualization graph to view what portion of the apps in the play store are paid and free.

From the above graph, we can see that 92%(Approx.) of apps in the google play store are free and 8%(Approx.) are paid.

Which category App’s have the most number of installs?

To answer this question we need to create a separate data frame out of our googlestore_df data frame which will contain a grouped value by Category and Installs .

Now, let us plot it out:

From the above visualization, it can be interpreted that the top categories with the highest installs are Game, Family, Communication, News & Magazines, & Tools.

We have done somewhat a good number of exploratory data analysis till now and in a state to finally answer some of the most common and in-demand questions which every App developer or any business company will love to know.

Asking and Answering Questions

With the help of Data Analysis, we can answer many questions that can’t be answered just by looking at the dataset. By querying on any dataset, and understanding the pattern and rate of growth and fall of any values we can come to many conclusions, and get insightful information from it. So let's start

What are the Top 10 installed apps in any category?

So, we have to be able to answer this not only for a single category but for many, i.e., we will need to define a function which should be able to return us a nice plot for any Category the name provided by any user as an argument to it.

After we are done with defining the function, it’s time to check and see if everything is working fine. So let’s test it by passing Sports category to the above-defined function.

From the above graph, we can see that in the Sports category FIFA Soccer, and Dream League Soccer 2018 has the highest installs. In the same way by passing different category names to the function, we can get the top 10 installed apps.

Which are the top 10 expensive Apps in the play store?

We will again need to create a separate data frame.

From the above data frame, we will need to drop an app name, because its’ name will be creating a mess in the plot.

So Finally let’s plot and visualize the top 10 paid apps on the play store.

From the above graph, we can interpret that the App I am rich is the most expensive app in the google play store followed by I am Rich Premium. we also had to drop one-row data for this visualization because the language of the app was Chinese and it was messing with the pie chart, visualization.

Which are the Apps with the highest number of reviews?

From the above data frame we can interpret, and come to the conclusion that the Apps like Clash of Clans, Subway Surfers, Clash Royale, and Candy Crush Saga has the highest number of reviews on google play store.

What are the count of Apps in different genres?

By creating a data frame, let’s define our x and y axis, which will be required for plotting the graph.

Finally, we are in a state to plot and gain an insight into our raised question.

From the above visualization, we can see that the Highest Number of Apps found in the Tools and Entertainment genres followed by Education, Medical and many more.

The last question which we are going to answer is:

Which are the apps that have made the highest-earning?

For answering these questions we will need to perform some extra operation to the data frame, i.e., we will need to create a separate data frame, and then multiply the Price column and the Installs column in order to get the earning of any particular app. So, let's start the process.

Now from the above data frame, we will need to separate out the columns which we will require.

We can now add a separate column Earnings to our new data frame which we will create by multiplying the two-column Price and Installs .

Now let us sort the above data by Earnings and Price .

Finally, we can plot the graph and find out which are the apps with the highest number of earnings.

The top five apps with the highest earnings found on google play store are:-

  • I am Rich
  • I am Rich Premium
  • Hitman Sniper
  • Grand Theft Auto: San Andreas
  • Facetune - For Free

We have finally come to an end to our analysis, and hope that if you have reached till here it must have been interesting or useful to you.

Inferences and Conclusion

After Analyzing the dataset we have got answers to some of the serious & interesting questions which any of the android users would love to know.
The original Dataset can be found here:- Link

  • Top categories on Google Playstore?
  • Which category of Content is found more?
  • Distribution of the ratings of the apps?
  • What percentage of apps are Free and Paid?
  • Which category of apps has the most number of installs?
  • What are the Top 10 installed apps in different categories?
  • Which are the top expensive Apps?
  • Which are the Apps with the highest number of reviews?
  • Count of Apps found in different genres?
  • Which are the apps that have made the highest-earning?

You can also get answers to some more questions in the Notebook. Which can be found here:- Link

I regularly post articles in the field of Data Science If you found my work interesting you can connect with me on:-

References

--

--