Using knn (K nearest neighbor) with R

In this post I would like to go over some basic prediction and analysis techniques using R. This article assumes you have R set up on your machine.
For our purposes, we will use Knn ( K nearest neighbor ) to predict Diabetic patients of a data set. It is a lazy, instance-based learning that does not build a model. Instead, it tries to find natural patterns in the data.

I am going to use a data set that R comes with- the Pima Indians Diabetes set.


> library(MASS)
> pima_data  <- Pima.tr
> summary(pima_data)
     npreg            glu              bp              skin            bmi             ped              age       
 Min.   : 0.00   Min.   : 56.0   Min.   : 38.00   Min.   : 7.00   Min.   :18.20   Min.   :0.0850   Min.   :21.00  
 1st Qu.: 1.00   1st Qu.:100.0   1st Qu.: 64.00   1st Qu.:20.75   1st Qu.:27.57   1st Qu.:0.2535   1st Qu.:23.00  
 Median : 2.00   Median :120.5   Median : 70.00   Median :29.00   Median :32.80   Median :0.3725   Median :28.00  
 Mean   : 3.57   Mean   :124.0   Mean   : 71.26   Mean   :29.21   Mean   :32.31   Mean   :0.4608   Mean   :32.11  
 3rd Qu.: 6.00   3rd Qu.:144.0   3rd Qu.: 78.00   3rd Qu.:36.00   3rd Qu.:36.50   3rd Qu.:0.6160   3rd Qu.:39.25  
 Max.   :14.00   Max.   :199.0   Max.   :110.00   Max.   :99.00   Max.   :47.90   Max.   :2.2880   Max.   :63.00  
  type    
 No :132  
 Yes: 68

The above is a summary of all the fields in our data set. The ‘type’ field is the target column that we are trying to predict. With the exception of the ‘ped’ (pedigree) field, we can see that the fields vary greatly from minimum to max values. These fields should be simplified with normalization- we’ll get to that.

We can find out the percentage of occurrences for each of the types: Yes = Positive for Diabetes. No = They do not have diabetes. Let’s find out the the percentage of each below:


> round(prop.table(table(pima_data$type))* 100, digits=1)

 No Yes 
 66  34

Alright. 66% of these cases are negative, 34% are positive. Let’s normalize the data fields- except our target column (type).


>summary(pima_n)
     npreg              glu               bp              skin             bmi              ped               age         
 Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.07143   1st Qu.:0.3077   1st Qu.:0.3611   1st Qu.:0.1495   1st Qu.:0.3157   1st Qu.:0.07649   1st Qu.:0.04762  
 Median :0.14286   Median :0.4510   Median :0.4444   Median :0.2391   Median :0.4916   Median :0.13050   Median :0.16667  
 Mean   :0.25500   Mean   :0.4753   Mean   :0.4619   Mean   :0.2415   Mean   :0.4751   Mean   :0.17057   Mean   :0.26452  
 3rd Qu.:0.42857   3rd Qu.:0.6154   3rd Qu.:0.5556   3rd Qu.:0.3152   3rd Qu.:0.6162   3rd Qu.:0.24103   3rd Qu.:0.43452  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

We can tell our ‘normalize’ function worked because the max is now 1 and the minimum is now 0 for our fields. Here is the normalize function code:


normalize <- function(x) {
       return ((x - min(x)) / (max(x) - min(x)))
}

We are going to separate this data set into 2 parts: training and test. The number of rows is 200, so we will give most of our data to the training set, and the rest to test against:


> pima_train <- pima_n[1:150, ]

> pima_test <- pima_n[151:200, ]

Then we will assign labels to check against from the original data set:


> pima_trainlab <- pima_data[1:150, 8]
> pima_testlab <- pima_data[151:200, 8]

It is helpful to keep in mind here that the syntax for doing this is:


trainortest_label_variable <- original_data[range, column_to_predict]

Now we will build our knn predictor with the following:


> library(gmodels) # for the CrossTable function
> library(class) # for knn function
> pima_pred <- knn(train = pima_train, test = pima_test, cl=pima_trainlab, k=39)
> CrossTable( x=pima_testlab, y=pima_pred, prop.chisq=FALSE)

In the above code, we set up our knn() function with the k value 39. Sometimes the K value will take some adjusting. It should be an odd number just in case there are ties. Afterwards, we call the CrossTable function to check the results of our knn() function predictions shown below. It found that we guessed 80% of Diabetes negative patients, and 40% of Diabetes positive. Not great, but one can obtain better results with less lazy algorithms or with different data sets. For example, with the UCI Breast Cancer set, I have predicted 95%+ of malignant and benign tumors. So just play around with different K values and data sets and see what you find. Good Luck!


             | pima_pred 
pima_testlab |        No |       Yes | Row Total | 
-------------|-----------|-----------|-----------|
          No |        24 |         6 |        30 | 
             |     0.800 |     0.200 |     0.600 | 
             |     0.686 |     0.400 |           | 
             |     0.480 |     0.120 |           | 
-------------|-----------|-----------|-----------|
         Yes |        11 |         9 |        20 | 
             |     0.550 |     0.450 |     0.400 | 
             |     0.314 |     0.600 |           | 
             |     0.220 |     0.180 |           | 
-------------|-----------|-----------|-----------|
Column Total |        35 |        15 |        50 | 
             |     0.700 |     0.300 |           | 
-------------|-----------|-----------|-----------|

Here are my results with a similar approach using the UCI Breast Cancer data:


Total Observations in Table:  100 

 
              | wdbc_predz 
wdbc_testzlab |    Benign | Malignant | Row Total | 
--------------|-----------|-----------|-----------|
       Benign |        76 |         1 |        77 | 
              |     0.987 |     0.013 |     0.770 | 
              |     0.987 |     0.043 |           | 
              |     0.760 |     0.010 |           | 
--------------|-----------|-----------|-----------|
    Malignant |         1 |        22 |        23 | 
              |     0.043 |     0.957 |     0.230 | 
              |     0.013 |     0.957 |           | 
              |     0.010 |     0.220 |           | 
--------------|-----------|-----------|-----------|
 Column Total |        77 |        23 |       100 | 
              |     0.770 |     0.230 |           | 
--------------|-----------|-----------|-----------|
Using knn (K nearest neighbor) with R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s