Here I am using the K-Nearest Neighbors algorithm from Scikit-Learn. Here is the code:
Code
"""
Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm
INSTANCE BASED LEARNING
In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
decision tree. The work of classification is done at the time of classification and not when training is done.
This can therefore be performance intensive. Both in terms of speed and storage.
Knowledge representation structures (like trees or rules) are not created in Instance based learning.
Normally in humans we use something called 'rote learning' where we commit a set of learning examples
to memory and group similar items as a group or a class. Then when a new example comes we classify it
as one of the groups or classes depending on how closely it resembles a class.
K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
to the k nearest previously known instances and then depending on a majority of which class's instances
are closet to this new instance, it assigns the class to be that class.
The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure.
The requirement would be that the attributes are normalized & of equal importance. When one attribute
is deemed to be more important than another suitable weighting can be employed when calculating a distance
measure.
"""
print(__doc__)
import numpy as np
from sklearn import preprocessing, cross_validation, neighbors
import pandas as pd
print("\nWe are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.")
bc = pd.read_csv('./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data')
print("\nLoaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data")
# On examination of the data set we find some missing data items which are marked as ?
# We will replace them with -9999
bc.replace('?',-99999, inplace=True)
print("Handled missing data by replacing ?s with -99999")
# Here the ID column does not impact how the individual experiments are classified so we will remove it
bc.drop(['id'],1,inplace=True)
print("Dropped ID column as it doesn't contribute any useful information to help with model creation")
# Now I am printing the columns
print("\n%s" % (bc[1:1]))
# Now I have to define my X and y labels
# X is the features data so I am assigning the entire data array but dropping the class column
# Here I am using the pandas datafram .drop function to drop a column. I specify that its a column
# by using the "1" and I specify the column name "class"
# the syntax is df = df.drop('column_name', <0 for rows, 1 for column>)
# Also note that numpy.array function converts from other python structures in this case
# a panda dataframe into a numpy array so I can use numpy processing functions on it
print("\nCreated X features array having all columns except class")
X = np.array(bc.drop(['class'], 1))
print("Created y known class array by assigning it the class column")
# Here I am just assigning the class column only to y
y = np.array(bc['class'])
# Now I create my training and test samples
# I will use my training set to create my model
# I will use my test set to test its accuracy of classification
# For this I am using sklearn's cross_validation.train_test_split function
# This function takes arguments
# *arrays : sequence of indexables with same length / shape[0]
# Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
# So for that we are using the features numpy arrays X, the class y
# test_size : float, int, or None (default is None)
# If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
# If int, represents the absolute number of test samples.
# If None, the value is automatically set to the complement of the train size.
# If train size is also None, test size is set to 0.25.
print("\nSplitting of data set into test and train sets completed")
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
# Now considering that this is a multivariate data set and I need to classify one column I will use
# an algorithm that is useful for such data sets and classifications : K Nearest Neighbors or KNN
myClassifier = neighbors.KNeighborsClassifier()
print("\nCreated KNN Classifier object")
# Next I need to use the training data to train the classifier
# It takes the X_train numpy array and the y_train numpy array to train the classifier
myClassifier.fit(X_train, y_train)
print("\nTraining Complete")
# Next I will run a test and score the accuracy using the test data that I set aside X_test and y_test
print("Testing Accuracy")
testAccuracy = myClassifier.score(X_test, y_test)
print("\nAccuracy = %s\n" % (testAccuracy))
# Ok now my model is trained and ready and I can use it to classify new incoming data
# lets define an example new data
# But np.array has to be passed as a 2D array to predict so we need to use X.reshape function
new_exp = np.array([[4,2,1,1,1,2,3,2,1],[4,1,1,1,1,2,2,2,1]])
print("New Instances as input for prediction of class:\n %s" % (new_exp))
# By using len(new_exp) I can provide any number of new test samples
new_exp = new_exp.reshape(len(new_exp), -1)
# Next I am going to use the predict function to classify the new experiment
new_class = myClassifier.predict(new_exp)
print("\nPredicted Classes of the instances:%s \n" % (new_class))
new_class_name = []
for i in range(len(new_class)):
if new_class[i] == 2:
new_class_name.append("Benign")
else:
new_class_name.append("Malignant")
for i in range(len(new_class_name)):
print("Patient: %s, Predicted Classification: %s" % (i, new_class_name[i]))
Output
$ python3.6 knnclassify_bc.py
Model Creation, Testing and Prediction of Breast Cancer Data using K Nearest Neighbor Algorithm
INSTANCE BASED LEARNING
In this type of machine learning algorithms, rather than construct a set of rules as an intermediate stage,
the instances (or features or experiments) are themselves directly employed. We dont infer a rule set or a
decision tree. The work of classification is done at the time of classification and not when training is done.
This can therefore be performance intensive. Both in terms of speed and storage.
Knowledge representation structures (like trees or rules) are not created in Instance based learning.
Normally in humans we use something called 'rote learning' where we commit a set of learning examples
to memory and group similar items as a group or a class. Then when a new example comes we classify it
as one of the groups or classes depending on how closely it resembles a class.
K-Nearest Neighbor Algorithm is one such approach which calculates a distance from the new instance
to the k nearest previously known instances and then depending on a majority of which class's instances
are closet to this new instance, it assigns the class to be that class.
The distance calculated is a Eucledian Distance measure. It is then reported as a confidence measure.
The requirement would be that the attributes are normalized & of equal importance. When one attribute
is deemed to be more important than another suitable weighting can be employed when calculating a distance
measure.
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
We are using the breast-cancer-wisconsin data set. I am reading a csv formatted file into a panda dataframe called bc.
Loaded data from ./datasets/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Handled missing data by replacing ?s with -99999
Dropped ID column as it doesn't contribute any useful information to help with model creation
Empty DataFrame
Columns: [clump_thickness, uniform_cell_size, uniform_cell_shape, marginal_adhesion, single_epi_cell_size, bare_nuclei, bland_chromation, normal_nucleoli, mitoses, class]
Index: []
Created X features array having all columns except class
Created y known class array by assigning it the class column
Splitting of data set into test and train sets completed
Created KNN Classifier object
Training Complete
Testing Accuracy
Accuracy = 0.957142857143
New Instances as input for prediction of class:
[[4 2 1 1 1 2 3 2 1]
[4 1 1 1 1 2 2 2 1]]
Predicted Classes of the instances:[2 2]
Patient: 0, Predicted Classification: Benign
Patient: 1, Predicted Classification: Benign