Using Machine Learning to Predict Diabetes Likelihood

“Good health is not something we can buy. However, it can be an extremely valuable savings account.”-Anne Wilson Schaaf.

Stay safe! be careful! Do this, do that; Can we actually control everything that goes on in our life? If your answer is no, then we probably have the same thoughts. However, we can control somethings, especially those from the inside (our health).

The word Diabetes have become increasingly popular over the years. It is a common chronic disease that poses a great threat to human life. When a person is diabetic, their blood glucose level is higher than normal. This is caused by defective insulin secretion or its impaired biological effects, or both. Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves.

Diabetes is generally divided into two categories: type 1 diabetes (T1D) and type 2 diabetes (T2D). Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels. This type of diabetes cannot be cured effectively with oral medications alone and the patients are required to have insulin therapy. Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases.

Data Source

This hackathon was hosted in advance for the first 24-hour virtual Women In Data Science (WIDS) Worldwide Conference. This organization focuses on patient health, with emphasis on the chronic condition of diabetes having reasoned that healthcare workers around the world struggle with hospitals overloaded by patients in critical condition. Meanwhile, understanding the context of a patient’s overall health has been proving abortive due to lack of verified medical histories for incoming patients at the Intensive Care Unit(ICU).

The way out is to build models to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, Diabetes Mellitus, since the knowledge about chronic conditions such as diabetes can inform clinical decisions about patient care and ultimately improve patient outcomes.

Here is how I did it

Data Analysis and Preprocessing

Before delving into building robust models and making predictions, it is important that we explore our data and understand what it contains.

code snippet for Importing necessary Libraries

Above is a snippet showing all the libraries and dependencies used for the analysis and modeling. After which we get to load our dataset using the popular python pandas library.

Our dataset is an example of a real life dataset as it contains data columns of about 181 features, with large missing values.

categorical columns in the dataset

6 out of the 181 data columns contain categorical data with about 175 columns being numerical.

Filling Missing Values

Large number of categorical columns prompted us to fill all missing values with the integer “0”.

Data Conversion

Since our dataset contains 6 categorical columns, there is a need for conversion into numerical data as they have to be preprocessed into something a machine learning model can understand(0 and 1). Label Encoder is used for normalizing values, for transforming non-numerical labels( as long as they are hashable and comparable) to numerical.

Label encoding of the categorical values

Data Visualization

Remember the saying that, a picture is worth more than a thousand words. We would try to find the relationships that exist between variables in our data visually.

Did you know that:

  • Patients admitted in the Medical_Surgery_ICU are confirmed to be having diabetes than those in the other ICUs’?
  • Patients admitted into the hospital from the Acute Care/Floor Section are having a higher percentage with Diabetes Mellitus?

However, the dataset is kind of imbalanced due to the higher classes of non-diabetic mellitus patients to those truly confirmed with Diabetes Mellitus in the target variable.

Note: “0” means without Diabetes Mellitus while “1” depicts that Diabetes Mellitus is confirmed.

Data Modeling

This is where we get our hands on Machine Learning, we would train the model to predict which patient admitted to an ICU has been diagnosed with Diabetes Mellitus.

Since the dataset is large with about 130157 rows in the train data, we would be using StratifiedKfold for splitting the data into train and validation sets, and also a resampling procedure called Cross Validation, allot a number of groups that the then said train data would be split into. We give it 2 n_splits, with the shuffle parameter set to True.

After necessary hyperparameter tuning, some sets of parameters were generated for our CatboostClassifier algorithm, which improved the accuracy of the model.

On checking our score, hurray, we achieved an 86.6% classification accuracy score. We see, pretty straightforward and interesting.

On submitting the test file predictions to the kaggle leaderboard, we scored 85.25%. Pretty good but we all would desire a higher level of precision in matters concerning our health.

Hence, there are several ways our model can be improved, which are;

  • Proper model hyperparameter tuning with GridSearchCV or RandomizedSearch.
  • Feature Scaling.
  • Dropping Outliers.
  • Class Balancing with either Upsampling, Downsampling or Smote.

I am amazed at how much I have learnt in the space of 3 months. Before joining the SheCodeAfrica mentee program, machine learning skills felt like a reserved power for super heroes. With the help of a great mentor Precious Kolawole — and my teammates I have learnt and grown tremendously.

As this article marks the end of my mentee journey, I would like to appreciate SheCodeAfrica —, for giving me and other African women the opportunity to partake in the 4th industrial revolution and technological trends.

Data Scientist|| Backend python developer|| Food scientist|| Knowledge enthusiast