Binary classification: heart disease prediction – 7 ideas how to start and improve your model
This experiment is based on the original Heart Disease Prediction experiment created by Weehyong Tok from Microsoft, which is one of the Healthcare Industry solutions. This experiment uses the data set from the UCI Machine Learning repository to train and test a model for heart disease prediction. We will use this as a starting point to give you 7 ideas how to start and improve the Cortana Intelligence Gallery examples. Thanks Weehyong for creating and sharing your experiment!
You can download the template for this experiment and try it out yourself!
1. Check your data
Always check your data before you start!
Data Set Information
The original database contains 76 attributes, but ML researchers tend to use a subset only containing 14 attributes. The “goal” field refers to the presence of heart disease in the patient (num). It is integer valued from 0 (no presence) to 4. With this experiment we will attempt to to distinguish presence (values 1,2,3,4) from absence (value 0).
We used the processed Cleveland data, because of a warning from the author.
Attribute Information:
- age
- sex
- cp: chest pain type
— Value 1: typical angina
— Value 2: atypical angina
— Value 3: non-anginal pain
— Value 4: asymptomatic - trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- restecg: resting electrocardiographic results
— Value 0: normal
— Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
— Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria - thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
— Value 1: upsloping
— Value 2: flat
— Value 3: downsloping - ca: number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- num (predicted attribute): diagnosis of heart disease (angiographic disease status)
— Value 0: < 50% diameter narrowing
— Value 1: > 50% diameter narrowing
Note that it’s quite an old dataset (1988)
2. Make a decision about what to do with missing data
In this case, we only miss 4 entries of ca (number of major vessels colored by flourosopy) and 2 entries of thal. In the original sample, all missing values were substituted with -1. But there are many more options. In this case, we decide to replace them with the mode (most occurring value). Replacing the missing values with the mean would be strange in this case, as thal is a categorical variable, and although I’m not a doctor, I think it would be hard to have 0.67 major vessel colored. This is actually also the reason why we use ca as a categorical variable.
3. Split
We use a stratified split to maintain a balanced training and test set [read more on MSDN] and set a seed so we can reproduce this experiment.
4. Decide what to do with your data
As we have a lot of categorical variables, we decide to make dummy variables out of them. This can be easily done by using the “Convert to Indicator Values” module. It helps us to get more insights about the value of a specific categorical variable.
5. Make sure that you use the right modules
We would suggest not to use the “One-vs-All Multiclass” module, as this creates a multiclass classification model from an ensemble of binary classification models. We are here dealing with a binary classification model, and not a multiclass classification model, so we don’t see the added value.
6. Be aware of the model properties
If you use the “Tune Model Hyperparameters” module to train your model, please make sure you use the “Parameter Range” option for the ‘create trainer mode” property.
7. Dive into the details
By using the “Permutation Feature Importance” module, you can gain insights. This module computes the permutation feature importance scores of the feature variables given a trained model and a test data set. From the left model, we can find that “oldpeak”, “sex” and “restecg” have a relative high feature importance score.
A nice thing, as we have transformed the categorical values into dummies, is that we can actually gain more information by looking at the feature importance scores of the right model. Here we find also “oldpeak” with the highest score, second by cp-4, which means asymptomatic chest pain. This chest pain type variable didn’t come up while just using the categorical variables as they were, so sometimes it can really help to transform them in order to gain more insight.
That’s it! We are looking forward to hearing about your experiences!