Exploratory Data Analysis (EDA) in Python:

Exploratory Data Analysis (EDA) is a crucial step in data science workflows. It involves summarizing datasets, identifying patterns, detecting anomalies, and gaining insights to prepare data for modeling. This post covers essential and advanced EDA techniques using Python, complete with code snippets and explanations.

Why Perform EDA?

  1. Understand Data Structure: Identify data types, shape, and size.
  2. Spot Anomalies: Find missing values, outliers, or inconsistent entries.
  3. Generate Hypotheses: Understand trends and relationships.
  4. Validate Assumptions: Ensure your dataset aligns with business objectives.

Sample Dataset

We’ll use a sample dataset for demonstration:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Alice', 'Frank'],
    'Age': [25, 30, 35, 40, None, 25, 50],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Chicago', None, 'Seattle'],
    'Salary': [70000, 80000, 120000, 90000, 60000, 70000, 110000],
    'Experience': [2, 5, 8, 10, 1, 2, 15]
}

df = pd.DataFrame(data)

The DataFrame:

NameAgeCitySalaryExperience
Alice25.0New York700002
Bob30.0Los Angeles800005
Charlie35.0New York1200008
David40.0Chicago9000010
EvaNaNChicago600001
Alice25.0None700002
Frank50.0Seattle11000015

1. Basic Dataset Overview

Dataset Structure

df.info()

This shows the number of non-null entries, data types, and memory usage:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         7 non-null      object
 1   Age          6 non-null      float64
 2   City         6 non-null      object
 3   Salary       7 non-null      int64  
 4   Experience   7 non-null      int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 408.0+ bytes

Dataset Size

df.shape

Output:

(7, 5)

The dataset has 7 rows and 5 columns.


2. Missing Values

Check for Missing Values

df.isnull().sum()

Output:

Name          0
Age           1
City          1
Salary        0
Experience    0
dtype: int64

Visualize Missing Data

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.show()

This creates a heatmap, where missing values are highlighted.


3. Summarizing Data

Descriptive Statistics

df.describe()
AgeSalaryExperience
count6.07.07.0
mean34.285714.296.14
std9.3520445.785.21
min25.060000.01.0
25%27.570000.02.0
50%35.080000.05.0
75%40.0110000.010.0
max50.0120000.015.0

4. Handling Duplicates

Detect and Count Duplicates

df.duplicated().sum()

Output:

1

Remove Duplicates

df.drop_duplicates(inplace=True)

5. Correlation Analysis

Correlation Matrix

df.select_dtypes(include=['number']).corr()
AgeSalaryExperience
Age1.00.870.88
Salary0.871.00.84
Experience0.880.841.0

Heatmap for Correlation

sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.show()

6. Outlier Detection

Using IQR Method

q1 = df['Salary'].quantile(0.25)
q3 = df['Salary'].quantile(0.75)
iqr = q3 - q1
outliers = df[(df['Salary'] < (q1 - 1.5 * iqr)) | (df['Salary'] > (q3 + 1.5 * iqr))]

Output:
Rows with outliers in the Salary column.


7. Categorical Analysis

Value Counts

df['City'].value_counts()

Output:

New York    2
Chicago     2
Los Angeles 1
Seattle     1
Name: City, dtype: int64

8. Feature Engineering

Normalize a Column

df['normalized_salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())

Create a New Column Using Lambda

df['Salary_Category'] = df['Salary'].apply(lambda x: 'High' if x > 85000 else 'Low')

9. Skewness Check

df.skew()

Output:

Age           0.460
Salary        0.522
Experience    0.827
dtype: float64

Conclusion

EDA is an essential process to explore data and uncover patterns, relationships, and anomalies. By combining these techniques, you’ll be better prepared to preprocess data and build robust models.

Let us know how you perform EDA and if I missed any must-have techniques!

Please visit https://subhadip.ca/blog/ for more topics..

A good article on this: Read here

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top