Predicting Mortality — an approach towards Imbalanced Classification using CatBoost Classifier

Aayush Poddar
5 min readJun 2, 2021
Photo by National Cancer Institute on Unsplash

Ever wondered how Classification of the below events plying with ML tools are possible where there’s a severe imbalance and skewness between the occurring events:

  1. Banks recognizing faulty and broken transactions where most transactions fall on the legitimate end with very few processes to be discovered as defective.
  2. Company checking for its manufactured goods for damaged production where it is quite unlikely for the firm to produce the majority of broken goods.

This article takes an attempt to serve one of the most important concerns of imbalanced class distribution through a problem involving predicting the survival of a patient given his or her medical record. The competition was hosted on the Codalab platform under the title — To be or not to be, link to which is — https://competitions.codalab.org/competitions/30715.

The challenge aims to devise an effective model that anticipates the survival of the patient based on his/her medical records, information over heterogeneous diseases, and medical predicaments.

Background, Data Description, and Evaluation Metric

Quite obvious from the problem statement, the number of people dying will definitely hold a very minimal share of the total admittees’ in the hospital. Hence, this leads to an imbalanced classification situation. Class Imbalance occurs when instances of output labels for different classes are exceedingly disproportionate, to give you an idea, out of 1000 occurrences, 19 of them fall on the ‘0’ class while the remaining 9981 take the ‘1’ class creating a skewed distribution, to begin with.

The training dataset contains information about 80,000 patients, represented by categorical, binary, and numerical features or variables. These features include age, gender, ethnicity, marital status, as well as medical data such as blood pressure rate or glucose rate. There are a total of 342 variables. The class (or label) to be predicted is a binary variable telling if the patient died (represented as ‘1’) or not (represented as ‘0’) while in the hospital.

Accuracy score is not a good measure for imbalanced classification. It means the fraction of correctly classified examples. It may appear quite misleading and unfair when the distribution is imbalanced. A model classifying every patient as “DIDN’T DIE” would get 94% of accuracy while being clearly impractical as a predictive model. The desirable metrics for such circumstances are the F1 score, AUC ROC (preferred by me) score, and Balanced Accuracy score. The competition employed the balanced accuracy as the evaluation criteria and is given by the mean of sensitivity (true positive rate or recall) and specificity (true negative rate or 1-false positive rate). To get more insights on balanced accuracy, visit -https://statisticaloddsandends.wordpress.com/2020/01/23/what-is-balanced-accuracy/#:~:text=Balanced%20accuracy%20is%20a%20metric,the%20presence%20of%20a%20disease.

Data Preprocessing:

Importing necessary libraries
Loading Testing and Training Datasets
Output
A glimpse of the data provided

We need to check the similarity in training and testing datasets. There exist certain values of categorical feature in the training dataset which are absent from the testing dataset (e.g. test dataset consists of language ‘*MAN’ in the ‘LANGUAGE’ column while there exist no such instance in the testing counterpart).

On one hot encoding the categorical variables on the training side, supplemental values present would produce separate columns that would be missing in the testing dataset. Hence, eliminating these records would suffice the need.

Generating all records which need to be eliminated. Once generated, the records can directly be dropped
Dropping all NaNs in the training dataset and checking for NaNs in the test dataset
Filling NaNs of test dataset with mode value
Dropping unnecessary columns and One Hot Encoding categoricals
Data and Feature Scaling
Splitting Data into Training and Validation Sets

Modeling and Results

One very widely used practice for handling skewed classes is the Resampling of data by oversampling, undersampling, or a combination of both. There are numerous oversampling approaches like SMOTE and Random Over Sampler. These techniques tend to improve and adjust the disproportionate ratio existing between classes by introducing more minority class instances. While in the case of undersampling, it merely discards occurrences of majority class to balance the classes. However, in this study, no such method or technique is adopted by me. XGBoost and other Gradient Boosting algorithms have a pre-existing inbuilt parameter called the scale_pos_weight. In my opinion, using something more conservative rather than introducing and handling mechanized non-existing data in this extensively imbalanced scenario with ample records.

The scale_pos_weight parameter is used to scale the gradient for the positive class. Citing an example, for a dataset with a 1 to 100 ratio for instances in the minority to majority classes, the scale_pos_weight can be set to 100. This will give classification errors made by the model on the minority class (positive class) 100 times more impact, and in turn, 100 times more correction than errors made on the majority class.

In our case, the ratio of minority (class — ‘1’) to the majority (class — ‘0’) is about 27 (77158/2796). Therefore, the scale_pos_weight parameter takes the value 27.

Keeping in mind a large number of features and records available at our disposal, implementing CatBoost Classifier seems a beneficial option. CatBoost is a boosting algorithm with exceptional handling techniques for categorical data. It is a robust computational recipe that administers easy implementation and faster training and predictions. CatBoost can improve the performance of the model, reduce overfitting, and narrows down the need for extensive hyper-parameter tuning. Though I have introduced parameters for the CatBoost pipeline, the parameter values used have worked successfully for me in the past however the default parameters inordinately produce great results. Yet the parameter scale_pos_weight needs to be assigned relevant value for imbalanced classification. To gain more understanding about CatBoost visiting — https://dataaspirant.com/catboost-algorithm/ and official CatBoost documentation might help.

Final Results on the validation set
Balanced Accuracy for the validation set

The approach discussed above delivered a satisfactory balanced accuracy score of 0.7254 on a completely unseen validation dataset. Eventually, the predictions were made on the testing data and a decent balanced accuracy score of 0.74 was achieved by the CatBoost classifier model. My submission was placed at the 10th position on the public leaderboard falling short by a mere margin of 2% to the leaderboard topper.

The entire python source code is provided on my Github repo — https://github.com/poddaraayush14/Mortality_Prediction-Codalab-

Follow for more insights on Machine Learning and Data Science

Happy Learning !!!

--

--

Aayush Poddar

A ML, Finance and Data Science freak always looking to mingle with data for a better upshot surviving engineering at IIT Kharagpur.