In this post I would like to go over some basic prediction and analysis techniques using R. This article assumes you have R set up on your machine.
For our purposes, we will use Knn ( K nearest neighbor ) to predict Diabetic patients of a data set. It is a lazy, instance-based learning that does not build a model. Instead, it tries to find natural patterns in the data.
I am going to use a data set that R comes with- the Pima Indians Diabetes set.
> library(MASS)
> pima_data <- Pima.tr
> summary(pima_data)
npreg glu bp skin bmi ped age
Min. : 0.00 Min. : 56.0 Min. : 38.00 Min. : 7.00 Min. :18.20 Min. :0.0850 Min. :21.00
1st Qu.: 1.00 1st Qu.:100.0 1st Qu.: 64.00 1st Qu.:20.75 1st Qu.:27.57 1st Qu.:0.2535 1st Qu.:23.00
Median : 2.00 Median :120.5 Median : 70.00 Median :29.00 Median :32.80 Median :0.3725 Median :28.00
Mean : 3.57 Mean :124.0 Mean : 71.26 Mean :29.21 Mean :32.31 Mean :0.4608 Mean :32.11
3rd Qu.: 6.00 3rd Qu.:144.0 3rd Qu.: 78.00 3rd Qu.:36.00 3rd Qu.:36.50 3rd Qu.:0.6160 3rd Qu.:39.25
Max. :14.00 Max. :199.0 Max. :110.00 Max. :99.00 Max. :47.90 Max. :2.2880 Max. :63.00
type
No :132
Yes: 68
The above is a summary of all the fields in our data set. The ‘type’ field is the target column that we are trying to predict. With the exception of the ‘ped’ (pedigree) field, we can see that the fields vary greatly from minimum to max values. These fields should be simplified with normalization- we’ll get to that.
We can find out the percentage of occurrences for each of the types: Yes = Positive for Diabetes. No = They do not have diabetes. Let’s find out the the percentage of each below:
> round(prop.table(table(pima_data$type))* 100, digits=1)
No Yes
66 34
Alright. 66% of these cases are negative, 34% are positive. Let’s normalize the data fields- except our target column (type).
>summary(pima_n)
npreg glu bp skin bmi ped age
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
1st Qu.:0.07143 1st Qu.:0.3077 1st Qu.:0.3611 1st Qu.:0.1495 1st Qu.:0.3157 1st Qu.:0.07649 1st Qu.:0.04762
Median :0.14286 Median :0.4510 Median :0.4444 Median :0.2391 Median :0.4916 Median :0.13050 Median :0.16667
Mean :0.25500 Mean :0.4753 Mean :0.4619 Mean :0.2415 Mean :0.4751 Mean :0.17057 Mean :0.26452
3rd Qu.:0.42857 3rd Qu.:0.6154 3rd Qu.:0.5556 3rd Qu.:0.3152 3rd Qu.:0.6162 3rd Qu.:0.24103 3rd Qu.:0.43452
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.00000
We can tell our ‘normalize’ function worked because the max is now 1 and the minimum is now 0 for our fields. Here is the normalize function code:
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
We are going to separate this data set into 2 parts: training and test. The number of rows is 200, so we will give most of our data to the training set, and the rest to test against:
> pima_train <- pima_n[1:150, ]
> pima_test <- pima_n[151:200, ]
Then we will assign labels to check against from the original data set:
> pima_trainlab <- pima_data[1:150, 8]
> pima_testlab <- pima_data[151:200, 8]
It is helpful to keep in mind here that the syntax for doing this is:
trainortest_label_variable <- original_data[range, column_to_predict]
Now we will build our knn predictor with the following:
> library(gmodels) # for the CrossTable function
> library(class) # for knn function
> pima_pred <- knn(train = pima_train, test = pima_test, cl=pima_trainlab, k=39)
> CrossTable( x=pima_testlab, y=pima_pred, prop.chisq=FALSE)
In the above code, we set up our knn() function with the k value 39. Sometimes the K value will take some adjusting. It should be an odd number just in case there are ties. Afterwards, we call the CrossTable function to check the results of our knn() function predictions shown below. It found that we guessed 80% of Diabetes negative patients, and 40% of Diabetes positive. Not great, but one can obtain better results with less lazy algorithms or with different data sets. For example, with the UCI Breast Cancer set, I have predicted 95%+ of malignant and benign tumors. So just play around with different K values and data sets and see what you find. Good Luck!
| pima_pred
pima_testlab | No | Yes | Row Total |
-------------|-----------|-----------|-----------|
No | 24 | 6 | 30 |
| 0.800 | 0.200 | 0.600 |
| 0.686 | 0.400 | |
| 0.480 | 0.120 | |
-------------|-----------|-----------|-----------|
Yes | 11 | 9 | 20 |
| 0.550 | 0.450 | 0.400 |
| 0.314 | 0.600 | |
| 0.220 | 0.180 | |
-------------|-----------|-----------|-----------|
Column Total | 35 | 15 | 50 |
| 0.700 | 0.300 | |
-------------|-----------|-----------|-----------|
Here are my results with a similar approach using the UCI Breast Cancer data:
Total Observations in Table: 100
| wdbc_predz
wdbc_testzlab | Benign | Malignant | Row Total |
--------------|-----------|-----------|-----------|
Benign | 76 | 1 | 77 |
| 0.987 | 0.013 | 0.770 |
| 0.987 | 0.043 | |
| 0.760 | 0.010 | |
--------------|-----------|-----------|-----------|
Malignant | 1 | 22 | 23 |
| 0.043 | 0.957 | 0.230 |
| 0.013 | 0.957 | |
| 0.010 | 0.220 | |
--------------|-----------|-----------|-----------|
Column Total | 77 | 23 | 100 |
| 0.770 | 0.230 | |
--------------|-----------|-----------|-----------|