Data Cleaning and Preprocessing with Python

My one month journey as a data science SCA-c4-mentee

Florence Egwu
6 min readMar 28, 2021

Sometime last year, I started craving to become a data scientist like you would crave your favorite meal. But it was hard to find the time to actually begin. Thanks to the Coronavirus lockdown in my country, I was able to use the family laptop at home to start researching and trying to learn the fundamentals of Python and Microsoft Excel spreadsheets while school was on hold.

However, self-learning comes with a truck load of challenges such as low motivation, lack of resource curation, etc. I couldn’t really do much on my own, so, I decided to jump on the opportunity offered by SheCodeAfrica through her cohort 4 mentee program in data science earlier this year. It’s a three months program that is to run from march 1st, 2021 to May 31st 2021.

So far, It’s been 4 weeks (1month) of intensive learning and mentoring. We learnt the foundational topics in Data Science including Programming with python, Mathematics, and Data Analysis using Python packages and tools this month.

In this article, we would be applying the things I have learnt in the past one month in cleaning and preprocessing datasets my mentor — Precious Kolawole has rightly described as very dirty.

Research Wahlberg

About the dataset

A digital lending company, which prides itself on its effective use of credit risk models to deliver profitable and high-impact loan alternatives. Its assessment approach is based on two main risk drivers of loan default prediction:

1) willingness to pay and

2) ability to pay.

Since not all customers payback, the company has invested in us (inexperienced data scientists) to build robust models to effectively predict the odds of repayment.

We are to predict if a loan was good or bad, i.e. accurately predict binary outcome variable, where Good is 1 and Bad is 0.

But at the moment, We are not to build any model. All we would do is to clean and process the data into something an algorithm can take in.

Importing the datasets and useful libraries

For this task, we were provided with 3 datasets: traindemographics.txt, trainperf.txt and trainprevloan.txt.

First, we import all necessary libraries and packages for the entire process. Some of them include:

  1. Pandas: This is a high-level data manipulation tool in python developed to provide fast, flexible, and expressive data structures. It is designed to make working with structured (tabular, multidimensional, etc) and time series data both easy and intuitive. The pandas package is the most important tool at the disposal of Data Scientists working with python and it is built on top of NumPy.
  2. NumPy: This is a Python library that stands for ‘Numerical Python’. It is a library used for working with arrays in core scientific computing. Arrays are very frequently used in data science, because speed and resources is very important and an array is way faster than a list.
  3. Matplotlib: This is an amazing visualization library in Python used for creating static, animated, and interactive 2D plots of arrays. Pyplot is a Matplotlib module which provides a MATLAB-like interface.
  4. Datetime Module: This is a module in python that offers functions and classes for working with date and time parsing and formatting.
  5. Seaborn: This is a Python data visualization library based on matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.
Python libraries for data processing

To load the dataset, we use:

pandas.read_csv(“filepath”)

we can load each file and use the .info() method view the properties of each datasets.

Reading csv file with pandas

Merging the datasets

Before merging the datasets, it is important we fully understand the problem we are trying to solve. This understanding will help us to intuitively pick which column will be truly useful in the model building. By a careful consideration and observation of the the dataset, our trainperf.txt dataset seems to contain the most useful columns. we can then pick out columns from the other sets and merge them with the trainperf.txt.

Our understanding of function becomes really useful here, as we would create a simple function called “df_merger”. The function takes 2 required arguments: “df” and “cols”. Where cols is a list of columns and df is a data frame. Our function simply loops through the list and merges each column with the data frame using the .merge() method. And finally returns a dataset “df” upon which we would be performing further processing and analysis.

we can then, view our merged dataset to see what it looks like.

Formatting Data types

Some of the features such as date are presented in the wrong datatype. we convert these columns to suitable datatypes. We can also convert the dates to timestamps (i.e. time in seconds equivalence of the dates) . Again, we explore our knowledge of functions to keep things simple and reduce unnecessary repetition.

#code snippet for converting specified column to datetimedf[col] = pd.to_datetime(df[col], format = timfmt, errors = 'coerce')#code snippet for converting specified column from datetime to timestampdf[col +"_ts"] = df[[col]].apply(lambda x: x[0].timestamp(), axis=1).astype(float)
A simple function that converts a date frame column data type to datetime and timestamp

We would also explore the statistical properties of our data using the .describe() and .corr() methods

Handling Missing Values

Our data like many real world data probably contains missing values. It is important that we find and handle these properly, as it could affect the accuracy of the model prediction. There are several ways of handling missing values; these includes:

  • dropping the missing values completely.
  • replacing missing values with the mean, mode or median
  • filling the missing values by interpolation

In our case, we would be replacing the categorical missing values with the mode of the variables and numerical values with the median of the variables. This is because we do not want to loose any important data by dropping missing values.

Handling duplicates

sometimes, our data could contain duplicates. It is important that we find and remove these duplicates as they could mess with our outcome. We can check for duplicates with ,

df.duplicated()

Fortunately, our data contains no duplicates. However, if we ever did encounter duplicates, we can easily get rid of them using the syntax below:

df.drop_duplicates(subset=None, keep='first', inplace=False)

Bivariate Analysis of the Data (visualization)

We can visualize the datasets in pairs to see if there is any form of relationship between them using Matplotlib and Seaborn. Check out my notebook to see some of the visualization are performed

Convert Categorical Data to Numerical

Many machine learning cannot handle categorical data. They are only able to interpret numerical features. Therefore, it is important that we convert all our categorical values to numeric values. There are many methods with which we can achieve this. It should be noted though that each encoding method comes with it’s downside. One-hot-Encoding and label encoding are the most popular methods.

Here, we would be using the label encoder as we do not desire to create more columns like one-hot-encoding like one hot encoding.

Label encoding of categorical values using Pandas

We can now consider our data fairly clean and ready to be received by a machine learning algorithm. Although, depending on the algorithm we choose for our modelling, it might be necessary that we normalize and scale our data. We would be talking about Normalization, scaling and feature engineering in a later article as I do not wish for this article to be longer than it already is.

I hope you enjoyed wrangling and cleaning this data with me? It’s been an amazing and educative journey thus far on the SheCodeAfrica Cohort 4 mentee program. I’d be looking forward to completing more projects and delving deeper into the world of data science in the next few months

To learn more about the libraries or concepts mentioned here, you can check out some of these resources. I found them pretty helpful

  1. https://www.udacity.com/course/introduction-to-python--ud1110
  2. https://www.dataquest.io/blog/learn-statistics-probability-data-science-course/
  3. https://seeing-theory.brown.edu/
  4. https://www.youtube.com/watch?v=8JfDAm9y_7s&t=509s
  5. https://www.youtube.com/watch?v=fwWCw_cE5aI&list=PLVlY_7IJCMJeRfZ68eVfEcu-UcN9BbwiX&index=2
  6. https://www.youtube.com/watch?v=yZTBMMdPOww&t=1124s
  7. https://www.pluralsight.com/guides/data-wrangling-pandas
  8. https://www.pluralsight.com/courses/pandas-data-wrangling-machine-learning-engineers

I would like to interact with you more often. How about checking out my Twitter and LinkedIn handle? You should click the clap button, if you appreciate my effort in making this post and leave a feedback or question on the comment box. Also, don’t take this all in alone, let your buddies know what concepts you are exploring by sharing.

--

--

Florence Egwu

Data Scientist|| Backend python developer|| Food scientist|| Knowledge enthusiast