Automated Data Cleaning Pipeline
Challenge Overview
The Problem
Creating the monthly sales report involved manually cleaning and merging 15 different Excel files from various branches, taking 6 hours of manual work.
My Approach
Wrote a Python script using Pandas to iterate through the folder, read each Excel file, standardize column names, handle missing values, and merge them into a single master dataset ready for analysis.
Key Lesson
"Automation not only saves time but drastically reduces human error in data preparation tasks."
The Bottleneck
At the end of every month, branch managers would email Excel files with varying formats (merged cells, different headers) to the central team. Collating this took a full working day.
The Script
I developed a Python script using the `pandas` and `os` libraries to:
- Scan: Iterate through a designated directory for .xlsx files.
- Normalize: Map different column headers (e.g., "CustID", "CIF", "Client ID") to a standard schema.
- Clean: Drop empty rows, fill missing numerical values with 0, and validate date formats.
- Consolidate: Append all clean dataframes into a master list and export to a single CSV.
Outcome
The process now takes 14 seconds. The team uses the saved time for deeper analysis of the sales data rather than just preparing it.