Rodney Gitonga
Back to Challenges
Python Automation
June 2024 - 2025

Automated Data Cleaning Pipeline

PythonPandasOpenPyXL

Challenge Overview

The Problem

Creating the monthly sales report involved manually cleaning and merging 15 different Excel files from various branches, taking 6 hours of manual work.

My Approach

Wrote a Python script using Pandas to iterate through the folder, read each Excel file, standardize column names, handle missing values, and merge them into a single master dataset ready for analysis.

💡

Key Lesson

"Automation not only saves time but drastically reduces human error in data preparation tasks."

The Bottleneck

At the end of every month, branch managers would email Excel files with varying formats (merged cells, different headers) to the central team. Collating this took a full working day.

The Script

I developed a Python script using the `pandas` and `os` libraries to:

  1. Scan: Iterate through a designated directory for .xlsx files.
  2. Normalize: Map different column headers (e.g., "CustID", "CIF", "Client ID") to a standard schema.
  3. Clean: Drop empty rows, fill missing numerical values with 0, and validate date formats.
  4. Consolidate: Append all clean dataframes into a master list and export to a single CSV.

Outcome

The process now takes 14 seconds. The team uses the saved time for deeper analysis of the sales data rather than just preparing it.