Vijai Gandikota's Home Page

Breast Cancer class prediction using Machine Learning

Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California, Irvine Dataset

This imports SciKit-Learn's library's neighbors library. For my own version of the KNN algorithm using Euclidean Distance measure click breast_cancer_knn_vijai.html

Contents
1. Goal
2. Data Examination
3. Algorithm Choice, and code
4. Output
5. Final Results

Goal: Given 569 records of patient data on breast tumor testing and with the class outcome values "Benign" or "Malignant" the requirement is to build a model to predict the class of new patients' tumor based on the recorded features.
Data Examination:The features or attributes are

   1. Sample code number            id number
   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:   (2 for benign, 4 for malignant)

Algorithm choice and code:Since this is a multivariate feature set and we are aiming to predict a class label or we are doing classification we will use the K-Nearest Neighbors Algorithm.

Here I am using the K-Nearest Neighbors algorithm from Scikit-Learn. Here is the code:

Code

"""
Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm

INSTANCE BASED LEARNING

In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
decision tree. The work of classification is done at the time of classification and not when training is done.
This can therefore be performance intensive. Both in terms of speed and storage.

Knowledge representation structures (like trees or rules) are not created in Instance based learning.

Normally in humans we use something called 'rote learning' where we commit a set of learning examples
to memory and group similar items as a group or a class. Then when a new example comes we classify it
as one of the groups or classes depending on how closely it resembles a class.
K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
to the k nearest previously known instances and then depending on a majority of which class's instances
are closet to this new instance, it assigns the class to be that class.

The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure.
The requirement would be that the attributes are normalized & of equal importance. When one attribute
is deemed to be more important than another suitable weighting can be employed when calculating a distance
measure.

"""

print(__doc__)
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd



print("\nWe are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.")

bc = pd.read_csv('./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data')

print("\nLoaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data")

# On examination of the data set we find some missing data items which are marked as ?
# We will replace them with -9999
bc.replace('?',-99999, inplace=True)
print("Handled missing data by replacing ?s with -99999")

# Here the ID column does not impact how the individual experiments are classified so we will remove it
bc.drop(['id'],1,inplace=True)
print("Dropped ID column as it doesn't contribute any useful information to help with model creation")

# Now I am printing the columns
print("\n%s" % (bc[1:1]))

# Now I have to define my X and y labels

# X is the features data so I am assigning the entire data array but dropping the class column
# Here I am using the pandas datafram .drop function to drop a column. I specify that its a column
# by using the "1" and I specify the column name "class"
# the syntax is df = df.drop('column_name', <0 for rows, 1 for column>)
# Also note that numpy.array function converts from other python structures in this case
# a panda dataframe into a numpy array so I can use numpy processing functions on it

print("\nCreated X features array having all columns except class")
X = np.array(bc.drop(['class'], 1))

print("Created y known class array by assigning it the class column")
# Here I am just assigning the class column only to y
y = np.array(bc['class'])

# Now I create my training and test samples
# I will use my training set to create my model
# I will use my test set to test its accuracy of classification
# For this I am using sklearn's cross_validation.train_test_split function
# This function takes arguments
# *arrays : sequence of indexables with same length / shape[0]
# Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
# So for that we are using the features numpy arrays X, the class y
# test_size : float, int, or None (default is None)
# If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
# If int, represents the absolute number of test samples.
# If None, the value is automatically set to the complement of the train size.
# If train size is also None, test size is set to 0.25.

print("\nSplitting of data set into test and train sets completed")
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

# Now considering that this is a multivariate data set and I need to classify one column I will use
# an algorithm that is useful for such data sets and classifications : K Nearest Neighbors or KNN

myClassifier = neighbors.KNeighborsClassifier()
print("\nCreated KNN Classifier object")

# Next I need to use the training data to train the classifier
# It takes the X_train numpy array and the y_train numpy array to train the classifier
myClassifier.fit(X_train, y_train)
print("\nTraining Complete")

# Next I will run a test and score the accuracy using the test data that I set aside X_test and y_test

print("Testing Accuracy")
testAccuracy = myClassifier.score(X_test, y_test)

print("\nAccuracy = %s\n" % (testAccuracy))

# Ok now my model is trained and ready and I can use it to classify new incoming data
# lets define an example new data
# But np.array has to be passed as a 2D array to predict so we need to use X.reshape function

new_exp = np.array([[4,2,1,1,1,2,3,2,1],[4,1,1,1,1,2,2,2,1]])
print("New Instances as input for prediction of class:\n %s" % (new_exp))

# By using len(new_exp) I can provide any number of new test samples
new_exp = new_exp.reshape(len(new_exp), -1)

# Next I am going to use the predict function to classify the new experiment
new_class = myClassifier.predict(new_exp)

print("\nPredicted Classes of the instances:%s \n" % (new_class))

new_class_name = []
for i in range(len(new_class)):
        if new_class[i] == 2:
                new_class_name.append("Benign")
        else:
                new_class_name.append("Malignant")

for i in range(len(new_class_name)):
        print("Patient: %s, Predicted Classification: %s" % (i, new_class_name[i]))

And here is the output of running this. After training and testing I have provided two new patient data and used the trained algorithm to classify whether their tumors are benign or malignant.
Here the new unclassified data for prediction is hardcoded in the code but if needed can also be loaded in from a .csv file with a small modification.

Output


$ python3.6 knnclassify_bc.py 

Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm

INSTANCE BASED LEARNING

In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
decision tree. The work of classification is done at the time of classification and not when training is done. 
This can therefore be performance intensive. Both in terms of speed and storage.  

Knowledge representation structures (like trees or rules) are not created in Instance based learning.  

Normally in humans we use something called 'rote learning' where we commit a set of learning examples
to memory and group similar items as a group or a class. Then when a new example comes we classify it 
as one of the groups or classes depending on how closely it resembles a class. 
K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
to the k nearest previously known instances and then depending on a majority of which class's instances
are closet to this new instance, it assigns the class to be that class. 

The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure. 
The requirement would be that the attributes are normalized & of equal importance. When one attribute 
is deemed to be more important than another suitable weighting can be employed when calculating a distance
measure.


/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

We are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.

Loaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Handled missing data by replacing ?s with -99999
Dropped ID column as it doesn't contribute any useful information to help with model creation

Empty DataFrame
Columns: [clump_thickness, uniform_cell_size, uniform_cell_shape, marginal_adhesion, single_epi_cell_size, bare_nuclei, bland_chromation, normal_nucleoli, mitoses, class]
Index: []

Created X features array having all columns except class
Created y known class array by assigning it the class column

Splitting of data set into test and train sets completed

Created KNN Classifier object

Training Complete
Testing Accuracy

Accuracy = 0.957142857143

New Instances as input for prediction of class:
 [[4 2 1 1 1 2 3 2 1]
 [4 1 1 1 1 2 2 2 1]]

Predicted Classes of the instances:[2 2] 

Patient: 0, Predicted Classification: Benign
Patient: 1, Predicted Classification: Benign

Vijai Gandikota

Breast Cancer class prediction using Machine Learning

Predicting the class of a patient's breast tumor as 'Benign' or 'Malignant': Univ. of Wisconsin & Univ. of California, Irvine Dataset

This imports SciKit-Learn's library's neighbors library. For my own version of the KNN algorithm using Euclidean Distance measure click breast_cancer_knn_vijai.html

Contents 1. Goal 2. Data Examination 3. Algorithm Choice, and code 4. Output 5. Final Results

Contents
1. Goal
2. Data Examination
3. Algorithm Choice, and code
4. Output
5. Final Results