Exploratory Data Analysis (EDA) and Data Visualization with Python

The blog post covers various techniques such as univariate and bivariate analysis, missing value treatment, building a correlation matrix, outlier analysis etc with code samples.

Exploratory Data Analysis (EDA) and Data Visualization with Python

Introduction

There is so much data in today’s world. Modern businesses and academics alike collect vast amounts of data on myriad processes and phenomena. While much of the world’s data is processed using Excel or (manually!), new data analysis and visualization programs allow for reaching even deeper understanding. The programming language Python, with its English commands and easy-to-follow syntax, offers an amazingly powerful (and free!) open-source alternative to traditional techniques and applications.

Data analytics allow businesses to understand their efficiency and performance, and ultimately helps the business make more informed decisions. For example, an e-commerce company might be interested in analyzing customer attributes in order to display targeted ads for improving sales. Data analysis can be applied to almost any aspect of a business if one understands the tools available to process information.

Defining Exploratory Data Analysis

Exploratory Data Analysis – EDA – plays a critical role in understanding the what, why, and how of the problem statement. It’s first in the order of operations that a data analyst will perform when handed a new data source and problem statement.

Here’s a direct definition: exploratory data analysis is an approach to analyzing data sets by summarizing their main characteristics with visualizations. The EDA process is a crucial step prior to building a model in order to unravel various insights that later become important in developing a robust algorithmic model.

Let’s try to break down this definition and understand different operations where EDA comes into play:

  • First and foremost, EDA provides a stage for breaking down problem statements into smaller experiments which can help understand the dataset
  • EDA provides relevant insights which help analysts make key business decisions
  • The EDA step provides a platform to run all thought experiments and ultimately guides us towards making a critical decision

Overview

This post introduces key components of Exploratory Data Analysis along with a few examples to get you started on analyzing your own data. We’ll cover a few relevant theoretical explanations, as well as use sample code as an example so ultimately, you can apply these techniques to your own data set.

The main objective of the introductory article is to cover how to:

  • Read and examine a dataset and classify variables by their type: quantitative vs. categorical
  • Handle categorical variables with numerically coded values
  • Perform univariate and bivariate analysis and derive meaningful insights about the dataset
  • Identify and treat missing values and remove dataset outliers
  • Build a correlation matrix to identify relevant variables

Above all, we’ll learn about the important API’s of the python packages that will help us perform various EDA techniques.

Read rest of the blog here.

Written on November 15, 2018