Redefining Cancer Treatment- The Memorial Sloan Way

Debadri Sengupta
Analytics Vidhya
Published in
12 min readSep 16, 2021

--

Whenever a patient has symptoms of cancer, the cancer tumour is taken out and sequenced. Genetic information in the tumor cell is stored in the form of DNA. It is then transcribed to form RNA which is then translated to form proteins/amino acids. In case of a mutation, or a mistake in DNA sequence, the resultant amino acid is affected giving rise to a variation for the particular gene.

Thousands of genetic mutations may be present in the sequence. We need to distinguish the malignant mutations (drivers leading to tumour growth) from the benign (passenger) ones.

Our goal will be to classify the given gene-variation pair into one of 9 categories of cancer mutations based on knowledge from text-based clinical literature.

For reference, each categorical class label corresponds to the following mutation-

1: Likely Loss-of-function

2: Likely Gain-of-function

3: Neutral

4: Loss-of-function

5: Likely Neutral

6: Inconclusive

7: Gain-of-function

8: Likely Switch-of-function

9: Switch-of-function

Data information-

Data was collected from https://www.kaggle.com/c/msk-redefining-cancer-treatment#evaluation.

Two data files are present- one contains the information about the genetic mutations and the other contains the clinical evidence (text) that human experts/pathologists use to classify the genetic mutations.

· Both these data files are having a common column called ID on which they can be merged.

Columns for each file-

File 1: training_variants (ID , Gene, Variations, Class)

Its fields are

  • ID : the id of the row used to link the mutation to the clinical evidence
  • Gene : the gene where this genetic mutation is located
  • Variation : the aminoacid change for this mutation
  • Class : 1–9 the class this genetic mutation has been classified on

File 2: training_text (ID, Text)

Sample data points from training_variants dataset
A sample data point from the training_text dataset. (Of course, no one except a trained pathologist can understand this!)

Business constraints-

  • No low-latency requirement. (need not return results as fast as say, a search engine)
  • Interpretability is of concern. (a pathologist needs to verify why the model predicted the particular class)
  • Errors can be life-threatening.
  • The probability of a data point belonging to each class is required. (to understand the confidence of the model and decide whether human tests are further needed).

Type of machine learning problem-

As we have 9 classes, it is a multi-class classification problem.

Ideal Performance metrics-

· Multi-class log-loss

· Confusion matrix

Objective-

To predict the probability of each data point belonging to each of the nine classes.

Text preprocessing-

As mentioned before, the data consists of 2 CSV files- one for variants and the other for text. They will later be joined on basis of a common ID column using pd. merge() function.

The text CSV file was not comma-separated but separated by ||. This is taken care of by using separator=’/|/|’ while reading the data. [/=escape character]

Preprocessing using NLTK library is done on text data-

  1. Any special character (characters except A-Z (including the newline character) were replaced by a single space character.
  2. Letters were lowercased.
  3. Multiple spaces were replaced by a single space.
  4. Stopword removal.

We perform left outer join and thus we are bound to find some IDs for whom the text field will be blank. These text fields are replaced by a string containing gene+ ‘ ’ +variation.

Data analysis-

Train-cross validation-test split is done in the ratio 64:16:20 using stratify=y_labels to preserve class distribution in train,cv,test splits.

We plot bar plots to show the distribution of classes in each split. They have similar percentages for each class since the split was stratified.

From the above bar plots, clearly, the dataset is imbalanced with classes 7,4,1,2 occurring more frequently.

Random model-

Because our KPI is log-loss which is interpretable and varies till infinity, we need to judge our model with respect to a random model.

For every data point in the cross-validation dataset and test dataset,we generate an array of size 9 using np.random.rand(1,9) and divide each element by the sum of the array so that total probability reaches 1.

In our case, random model generated log loss of 2.5 which serves as the baseline.

The precision, recall and confusion matrices for the random model are plotted. Ideally, diagonal elements should be 1 and off diagonal elements 0.

Precision matrix for our random model. Note that column sum=1 here because columns represent predicted class here and precision defines of all those classes predicted as k, how many are actually k.

Analysis of gene feature-

Univariate analysis of categorical gene feature is done using value_counts() function. There are 238 different genes and their histogram of no of distributions is plotted. It shows a heavily skewed distribution.

Encoding gene feature-

Two types of encoding may be done on a categorical feature- one hot encoding and response coding.

In one-hot-encoding, each feature is represented by an n-dimensional vector where n is the no. of unique values of the feature. In response coding too, each feature is represented by n-dimensional vector where n is the no. of classes. The vector is filled with probability values (captured from the entire dataset) for the gene feature to belong to the one of the 9 classes.

The below code is used for building the required response-coding dictionary where keys are the categorical variables and values are the n (here 9) dimensional vectors each corresponding to probability for a class label. [Credits: Stack Overflow]

def get_gv_fea_dict(alpha, feature, df):# value_count: it contains a dict like# print(train_df[‘Gene’].value_counts())# output:# {BRCA1 174# TP53 106# EGFR 86# BRCA2 75# PTEN 69# KIT 61# BRAF 60# ERBB2 47# PDGFRA 46# …}# print(train_df[‘Variation’].value_counts())# output:# {# Truncating_Mutations 63# Deletion 43# Amplification 43# Fusions 22# Overexpression 3# E17K 3# Q61L 3# S222D 2# P130S 2# …# }value_count = train_df[feature].value_counts()# gv_dict : Gene Variation Dict, which contains the probability array for each gene/variationgv_dict = dict()# denominator will contain the number of time that particular feature occured in whole datafor i, denominator in value_count.items():# vec will contain (p(yi==1/Gi) probability of gene/variation belongs to perticular class# vec is 9 diamensional vectorvec = []for k in range(1,10):# print(train_df.loc[(train_df[‘Class’]==1) & (train_df[‘Gene’]==’BRCA1')])# ID Gene Variation Class# 2470 2470 BRCA1 S1715C 1# 2486 2486 BRCA1 S1841R 1# 2614 2614 BRCA1 M1R 1# 2432 2432 BRCA1 L1657P 1# 2567 2567 BRCA1 T1685A 1# 2583 2583 BRCA1 E1660G 1# 2634 2634 BRCA1 W1718L 1# cls_cnt.shape[0] will return the number of rowscls_cnt = train_df.loc[(train_df[‘Class’]==k) & (train_df[feature]==i)]# cls_cnt.shape[0](numerator) will contain the number of time that particular feature occured in whole datavec.append((cls_cnt.shape[0] + alpha*10)/ (denominator + 90*alpha))# we are adding the gene/variation to the dict as key and vec as valuegv_dict[i]=vecreturn gv_dict

Notice how Laplace (additive) smoothing is also performed. It is done to ensure very small values are handled well and the model does not overfit.

Consider a value ‘x’ for the feature ‘gene’:

1st element of the vector corresponding to ‘x’= (number of times ‘x’ occured such that data point belongs to class1 + 10*alpha / number of time ‘x’ occurred in total data+n*alpha) [where n is the number of class labels and alpha is a hyperparameter]

Code to get the response-coded matrix for our dataset-

# Get Gene variation featuredef get_gv_feature(alpha, feature, df):# print(gv_dict)#     {'BRCA1': [0.20075757575757575, 0.03787878787878788, 0.068181818181818177, 0.13636363636363635, 0.25, 0.19318181818181818, 0.03787878787878788, 0.03787878787878788, 0.03787878787878788],#      'TP53': [0.32142857142857145, 0.061224489795918366, 0.061224489795918366, 0.27040816326530615, 0.061224489795918366, 0.066326530612244902, 0.051020408163265307, 0.051020408163265307, 0.056122448979591837],#      'EGFR': [0.056818181818181816, 0.21590909090909091, 0.0625, 0.068181818181818177, 0.068181818181818177, 0.0625, 0.34659090909090912, 0.0625, 0.056818181818181816],#      'BRCA2': [0.13333333333333333, 0.060606060606060608, 0.060606060606060608, 0.078787878787878782, 0.1393939393939394, 0.34545454545454546, 0.060606060606060608, 0.060606060606060608, 0.060606060606060608],#      'PTEN': [0.069182389937106917, 0.062893081761006289, 0.069182389937106917, 0.46540880503144655, 0.075471698113207544, 0.062893081761006289, 0.069182389937106917, 0.062893081761006289, 0.062893081761006289],#      'KIT': [0.066225165562913912, 0.25165562913907286, 0.072847682119205295, 0.072847682119205295, 0.066225165562913912, 0.066225165562913912, 0.27152317880794702, 0.066225165562913912, 0.066225165562913912],#      'BRAF': [0.066666666666666666, 0.17999999999999999, 0.073333333333333334, 0.073333333333333334, 0.093333333333333338, 0.080000000000000002, 0.29999999999999999, 0.066666666666666666, 0.066666666666666666],#      ...#     }gv_dict = get_gv_fea_dict(alpha, feature, df)# value_count is similar in get_gv_fea_dictvalue_count = train_df[feature].value_counts()# gv_fea: Gene_variation feature, it will contain the feature for each feature prob dist value in the datagv_fea = []# for every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea# if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_feafor index, row in df.iterrows():if row[feature] in dict(value_count).keys():gv_fea.append(gv_dict[row[feature]])else:gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])#             gv_fea.append([-1,-1,-1,-1,-1,-1,-1,-1,-1])return gv_fea# Get Gene variation featuredef get_gv_feature(alpha, feature, df):# print(gv_dict)#     {'BRCA1': [0.20075757575757575, 0.03787878787878788, 0.068181818181818177, 0.13636363636363635, 0.25, 0.19318181818181818, 0.03787878787878788, 0.03787878787878788, 0.03787878787878788],#      'TP53': [0.32142857142857145, 0.061224489795918366, 0.061224489795918366, 0.27040816326530615, 0.061224489795918366, 0.066326530612244902, 0.051020408163265307, 0.051020408163265307, 0.056122448979591837],#      'EGFR': [0.056818181818181816, 0.21590909090909091, 0.0625, 0.068181818181818177, 0.068181818181818177, 0.0625, 0.34659090909090912, 0.0625, 0.056818181818181816],#      'BRCA2': [0.13333333333333333, 0.060606060606060608, 0.060606060606060608, 0.078787878787878782, 0.1393939393939394, 0.34545454545454546, 0.060606060606060608, 0.060606060606060608, 0.060606060606060608],#      'PTEN': [0.069182389937106917, 0.062893081761006289, 0.069182389937106917, 0.46540880503144655, 0.075471698113207544, 0.062893081761006289, 0.069182389937106917, 0.062893081761006289, 0.062893081761006289],#      'KIT': [0.066225165562913912, 0.25165562913907286, 0.072847682119205295, 0.072847682119205295, 0.066225165562913912, 0.066225165562913912, 0.27152317880794702, 0.066225165562913912, 0.066225165562913912],#      'BRAF': [0.066666666666666666, 0.17999999999999999, 0.073333333333333334, 0.073333333333333334, 0.093333333333333338, 0.080000000000000002, 0.29999999999999999, 0.066666666666666666, 0.066666666666666666],#      ...#     }gv_dict = get_gv_fea_dict(alpha, feature, df)# value_count is similar in get_gv_fea_dictvalue_count = train_df[feature].value_counts()# gv_fea: Gene_variation feature, it will contain the feature for each feature prob dist value in the datagv_fea = []# for every feature values in the given data frame we will check if it is there in the train data then we will add the feature to gv_fea# if not we will add [1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9] to gv_feafor index, row in df.iterrows():if row[feature] in dict(value_count).keys():gv_fea.append(gv_dict[row[feature]])else:gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])#             gv_fea.append([-1,-1,-1,-1,-1,-1,-1,-1,-1])return gv_fea

Simple one-hot-encoding works well for Logistic Regression, Linear SVM while response coding works better for Naïve Bayes, Random Forest etc. This is because linear models can handle large dimensionality better than ensemble/forest models.

We need to infer how useful the gene feature is. For this, a model is built using only gene feature as the attribute and it is checked whether the log loss is less than 2.5.

First, an SGDClassifier (with ‘log’ loss and l2 regularization and h-param tuning) is used on one-hot encoded gene features. On top of it, a Calibrated Classifier is used. This essentially implements Logistic Regression

Hyperparameter tuning using grid search is done and corresponding train,cv,test log loss is measured. It is better than a random model and is not overfitting. Thus, our gene feature is pretty useful.

Now, we have to check if the gene feature is stable. A stable feature is one whose values have significant overlapping between train, CV and test sets. If such is not the case, ML models are impossible to build.

The train , cv, and test log loss were calculated. The lack of overfitting implies feature stability. Also, we observe most (above 90%) gene features in cv and test data were present in train data too. Thus the gene feature is stable.

Analysis of variation feature-

Next, the categorical feature variation is analysed.

Histogram plotted for variation.value_counts() shows extremely skewed data. Most values only have a single occurrence!

CDF shows a linear plot. This infers that most value counts for features are 1.

Now, one -hot-encoding followed by Logistic Regression+Calibrated CV is done. Hyperparameter tuning for alpha is performed. However, it is seen the model does overfit which arises from feature instability.

Only around 10% of variation points in test and cv df contained in set(train df[variation]). Thus, the variation feature, though useful, is not stable.

Analysis of text feature-

Now, the text feature is analyzed. The same questions are asked namely,

How many unique words are present in train data?

How are word frequencies distributed?

How to featurize text field?

Is the text feature useful in predicitng y_i?

Is the text feature stable across train, test and CV datasets?

On training Logistic Regression+CalibratedCV, the training log-loss with optimal hyperparameters is good. The model does not really overfit which shows that text feature is stable.

Metric comparison for all our features

Thus our feature importance is text>gene>variation. This is deduced from the cv and test data log loss values.

Above 95 percent of words present in test data was also present in set(train data). Thus text feature is stable too.

Modelling-

Next, we go for proper ML modelling.

The one-hot gene, variation, text features are horizontally stacked and converted to csr matrix. The response-coded gene, text, features are done same too.

One-hot-encoded features are 55517-dimensional while response-coded are just 27-dimensional. Thus one-hot features are good for Logistic Regression, LinearSVM models while response-coded are better for KNN, Boosting, Random Forest, Decision Trees etc.

Multinomial Naive Bayes with one-hot-encoded features-

The horizontally-stacked features are fed into a Multinomial Naïve Bayes Classifier +Calibrated Classifier with hyperparameter tuning with Grid Search.

Test log loss was inferior to the LR model just using text features. Moreover from precision and recall matrix, the model gets confused between 2–7 and 1–4 classes which form the majority of data points.

As per biz-requirements, for a particular query point from test data, both the predicted label as well as probability of classes was predicted using predict_proba().

Note that it is always recommended to use Calibrated Classifier on top of the base learner whenever output probabilities is required because it calibrates the probability values to match the expected distribution of probabilities for each class. The output probabilities vary greatly from expected class probabilities especially in case of complex non-linear models.

KNN Classifier with response-coded features

Next, KNN with response coded features is tried. The model performs better than previous Naive Bayes Classifier. However, confusion still occurs in the majority classes 2–7, 1–4. But performance (log-loss) on minority classes is better.

For interpretability of KNN, model.kneighbours() function is used to get the classes of nearest n neighbours.

Logistic Regression with class balancing and one-hot-features-

Next, an SGDClassifier (loss=’log’ and class=’balanced’) is used with one-hot encoded features. This model gives us the lowest log-loss value as of now. From the confusion matrices, we observe good results for low density classes 8,9 too. The classes were oversampled by using class balancing feature. Thus balancing helps in better precision and recall for minority classes.

Recall matrix for class-balanced logistic regression

LinearSVM Classifier with class balancing and one-hot-features-

Next, an SGDClassifier(loss=’hinge’, class=’balanced’) is implemented. The accuracy is lower than Logistic Regression model but other metrics are comparable for LinearSVM with LR.

Random Forest Classifier with both one-hot and response coded features-

Random Forests with both one-hot-encoding and response-coding are trained with hyperparameters as no. of estimators and max depth of the tree. For this model, precision and recall for minority classes is 1 or near to 1. However, the majority classes are not that well classified.

Random Forest with response coding overfits and has the highest log-loss.

Stacking Classifier-

Using MLXtend library, a Stacking Classifier was now tried. Base learners Logistic Regression, LinearSVM, and Naïve Bayes classifiers’ probability outputs were passed to the final meta classifier Logistic Regression Model.

However, stacking models faces issues of interpretability, and the training data is split among the base models. Thus, errors for individual models are increased.

Thus, stacking is not optimal in case we have fewer data points.

Next, a Majority Voting Classifier with voting=’soft’ (i.e taking the probability outputs of base models), was also used.

Deployment-

A webpage using Flask was developed and deployed on Heroku for real-time predictions. Unfortunately, Heroku has stopped free-tier all of a sudden.

A video of the working of the webpage can be found on the project GitHub link:

https://github.com/Debadri3/Personalized-Cancer-Diagnosis

If you found the project useful, please be generous in your claps and follow me on Medium: https://medium.com/@debadri3. Would really help a budding author like me!

--

--

Debadri Sengupta
Analytics Vidhya

Experimenter of Machine Learning in production and research. Deeply interested in MLOps.