Pandas - Cleaning Data

Pandas - Cleaning Data

·

3 min read

In the field of data science and analysis, the quality of your data often dictates the accuracy and reliability of your insights. Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It offers powerful data manipulation capabilities, making it an indispensable tool for data cleaning, exploration, and analysis tasks.

The Importance of Data Cleaning

Before diving into the analysis phase, it's crucial to ensure that the data is clean and structured. Data cleaning involves identifying and rectifying errors, inconsistencies, and missing values in the dataset. Neglecting this step can lead to inaccurate analysis, misleading insights, and flawed decision-making.

Common Data Cleaning Tasks with Pandas

1. Handling Missing Values

Missing data is a common issue encountered in datasets. Pandas provides various methods to handle missing values, including dropping missing values, filling missing values with a specific value or method (such as mean, median, or mode), and interpolation.

Note: By default, the dropna() method returns a new DataFrame, and will not change the original.

If you want to change the original DataFrame, use the inplace =True argument:

# Dropping rows with missing values
df.dropna(inplace=True)

# Filling missing values with mean
df.fillna(df.mean(), inplace=True)

Replace Only For Specified Columns

To only replace empty values for one column, specify the column name for the DataFrame:

Replace NULL values in the "Calories" columns with the number 130:

import pandas as pd
# fill 130 value
df = pd.read_csv('data.csv')
df["column1"].fillna(130, inplace = True)
# fill median value of column2
x = df["column2"].median()
df["column2"].fillna(x, inplace = True)

2.Data Formatting

Cells with data of the wrong format can make it difficult, or even impossible, to analyze data. Pandas provides functions for converting data types, such as strings to datetime objects, and vice versa.

import pandas as pd

df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string())

3.Wrong Data

Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong. Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.

Replacing Values

# If the value is higher than 120, set it to 120:

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.loc[x, "Duration"] = 120

Removing Rows
Another way of handling wrong data is to remove the rows that contains wrong data.

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

4. Removing Duplicates

Duplicate entries can skew analysis results and lead to incorrect conclusions. Pandas makes it easy to identify and remove duplicate rows from a DataFrame.

#Returns True for every row that is a duplicate, otherwise False:
print(df.duplicated())

# Remove duplicates
df.drop_duplicates(inplace = True)

Best Practices for Data Cleaning with Pandas

  1. Understand Your Data: Before cleaning, thoroughly understand the structure, characteristics, and quirks of your dataset.

  2. Document Your Process: Keep track of the cleaning steps applied to maintain transparency and reproducibility.

  3. Use Method Chaining: Leverage Pandas' method chaining capabilities to perform multiple operations in a single line, enhancing readability and efficiency.

  4. Handle Missing Values Strategically: Choose appropriate techniques for handling missing values based on the nature of the data and the analysis requirements.

  5. Visualize Data: Visualizing the data before and after cleaning can help identify patterns, outliers, and inconsistencies.

Conclusion

In this blog, we delved into the essential aspects of data cleaning using Pandas. We explored various common data cleaning tasks, including handling missing values, removing duplicates, formatting data, dealing with outliers, and performing data imputation. So, embrace Pandas, embrace data cleaning, and embark on your journey towards mastering the art of data analysis.