When importing a new data set for the very first time, the first thing to do is to going deep into the data and finding some initial patterns and trends within the underlying data. This includes steps like.
- Determining the range of specific predictors.
- Identifying each predictor’s data type.
- Computing the number or percentage of missing values for each predictor,
- Establishing correlations between variables to curate some interesting insights.
- Plotting some simple graphs
- Looking at different descriptive statistics like min, max, median, average etc.
These analysis steps are commonly part of EDA(Exploratory Data Analysis).In my opinion, data quality, description, shape, patterns, and relationships complete the EDA cycle.
The Pandas library itself provides many extremely useful functions for EDA. However, before being able to apply most of them, you generally have to start with more general functions, like df.describe(). Nevertheless, the functionality provided by such functions is limited and more often than not your initial EDA cycle is very similar for each new data set and project. We do not want to waste our precious time completing repetitive tasks, I recently searched for alternatives to make our life easier.
Getting all your standard data analysis done in 30 seconds or less. The Magic of Pandas Profiling.
pandas-profiling enables its user to quickly generate a very broadly structured HTML file containing most of what you might need to know before diving into a more specific and individual data exploration. For each column the following statistics - if relevant for the column type - are presented in an interactive HTML reports:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
(Source definition https://pandas-profiling.github.io/pandas-profiling/docs/)
In the following paragraphs, I will be applying pandas-profiling on Titanic data set.
I used df.describe() for general descriptive statistics. While the output above contains lots of information about the dataset, but it doesn't tell you everything you might be interested in. For instance, you could assume that the data frame has 891 rows(it's not). If you wanted to know, you have to add another line of code to determine the length of the data frame. While these computations are not very expensive, repeating them over and over again does take up time you could probably better use while doing other tasks.
Now, let’s do the same analysis pandas-profiling:
#installing package inside Google Colab ! pip install pandas_profiling
while installing the package make sure to fulfill dependancy.
import pandas as pd import pandas_profiling df=pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
%matplotlib inline profile = pandas_profiling.ProfileReport(df) profile
After running this single line of code HTML EDA report of your data will be created. The code displayed above will create an inline output of the result; however, you could also choose to save your EDA report as an HTML to be able to share it more easily with others. I have taken a video of the report for better visuals.
Watch it at 2X speed
All in all, pandas-profiling provides some powerful features, especially if your main objective is either to get a quick and shortt understanding of your data or to share your initial EDA with others in a visual format. Nevertheless, it does not come close to automating EDA, you still have to do a lot of things by yourself.
Our Other Post related to Machine Learning and Deep Learning:
- Top 5 Deep Learning Interview Questions
- A Complete Guide to Real-time Object Detection with TensorFlow
- How to use Google Colab( Free GPU for Deep Learning
- Why Numpy Arrays are faster
- What is Machine Learning
What we should cover in next post please tell us in comment.