How to Use Knn Mean R to Predict a Continuous Values
KNN Regression in R
06.24.2021
Intro
The KNN model will use the K-closest samples from the training data to predict. KNN is often used in classification, but can also be used in regression. In this article, we will learn how to use KNN regression in R.
Data
For this tutorial, we will use the Boston data set which includes housing data with features of the houses and their prices. We would like to predict the medv
column or the medium value.
library(MASS) data(Boston) str(Boston)
## 'data.frame': 506 obs. of 14 variables: ## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ... ## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ... ## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ... ## $ chas : int 0 0 0 0 0 0 0 0 0 0 ... ## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ... ## $ rm : num 6.58 6.42 7.18 7 7.15 ... ## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ... ## $ dis : num 4.09 4.97 4.97 6.06 6.06 ... ## $ rad : int 1 2 2 3 3 3 5 5 5 5 ... ## $ tax : num 296 242 242 222 222 222 311 311 311 311 ... ## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ... ## $ black : num 397 397 393 395 397 ... ## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ... ## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Basic KNN Regression Model in R
To fit a basic KNN regression model in R, we can use the knnreg
from the caret
package. We pass two parameters. First we pass the equation for our model medv ~ .
which is the medium value by all predictors. Secondly, we will pass our data set Boston
.
## Warning: package 'caret' was built under R version 4.0.5 ## Loading required package: lattice ## Loading required package: ggplot2
model = knnreg(medv ~ ., data = Boston) model
## 5-nearest neighbor regression model
Modeling Ridge Regression in R with Caret
We will now see how to model a ridge regression using the Caret
package. We will use this library as it provides us with many features for real life modeling.
To do this, we use the train
method. We pass the same parameters as above, but in addition we pass the method = 'knn'
model to tell Caret to use a lasso model.
library(caret) set.seed( 1 ) model <- train( medv ~ ., data = Boston, method = 'knn' ) model
## k-Nearest Neighbors ## ## 506 samples ## 13 predictor ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 506, 506, 506, 506, 506, 506, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 5 6.774213 0.4788519 4.616781 ## 7 6.709875 0.4771239 4.635036 ## 9 6.746559 0.4654866 4.690258 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 7.
Here we can see that caret automatically trained over multiple hyper parameters. We can easily plot those to visualize.
Preprocessing with Caret
One feature that we use from Caret is preprocessing. Often in real life data science we want to run some pre processing before modeling. We will center and scale our data by passing the following to the train method: preProcess = c("center", "scale")
.
set.seed( 1 ) model2 <- train( medv ~ ., data = Boston, method = 'knn' , preProcess = c( "center" , "scale" ) ) model2
## k-Nearest Neighbors ## ## 506 samples ## 13 predictor ## ## Pre-processing: centered (13), scaled (13) ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 506, 506, 506, 506, 506, 506, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 5 4.827696 0.7297751 3.048151 ## 7 4.793191 0.7373525 3.043650 ## 9 4.788986 0.7410578 3.070081 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 9.
Splitting the Data Set
Often when we are modeling, we want to split our data into a train and test set. This way, we can check for overfitting. We can use the createDataPartition
method to do this. In this example, we use the target medv
to split into an 80/20 split, p = .80
.
This function will return indexes that contains 80% of the data that we should use for training. We then use the indexes to get our training data from the data set.
set.seed( 1 ) inTraining <- createDataPartition(Boston$medv, p = .80 , list = FALSE ) training <- Boston[inTraining, ] testing <- Boston[ -inTraining, ]
We can then fit our model again using only the training data.
set.seed( 1 ) model3 <- train( medv ~ ., data = training, method = 'knn' , preProcess = c( "center" , "scale" ) ) model3
## k-Nearest Neighbors ## ## 407 samples ## 13 predictor ## ## Pre-processing: centered (13), scaled (13) ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 407, 407, 407, 407, 407, 407, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 5 4.919010 0.7135124 3.208421 ## 7 4.899928 0.7177230 3.234322 ## 9 4.895600 0.7205759 3.267220 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 9.
Now, we want to check our data on the test set. We can use the subset
method to get the features and test target. We then use the predict
method passing in our model from above and the test features.
Finally, we calculate the RMSE and r2 to compare to the model above.
test.features = subset(testing, select= -c(medv) ) test.target = subset(testing, select=medv) [ , 1 ] predictions = predict(model3, newdata = test.features) # RMSE sqrt(mean( (test.target - predictions) ^ 2 ) )
# R2 cor(test.target, predictions) ^ 2
Cross Validation
In practice, we don't normal build our data in on training set. It is common to use a data partitioning strategy like k-fold cross-validation that resamples and splits our data many times. We then train the model on these samples and pick the best model. Caret makes this easy with the trainControl
method.
We will use 10-fold cross-validation in this tutorial. To do this we need to pass three parameters method = "cv"
, number = 10
(for 10-fold). We store this result in a variable.
set.seed( 1 ) ctrl <- trainControl( method = "cv" , number = 10 , )
Now, we can retrain our model and pass the trainControl
response to the trControl
parameter. Notice the our call has added trControl = set.seed
.
set.seed( 1 ) model4 <- train( medv ~ ., data = training, method = 'knn' , preProcess = c( "center" , "scale" ) , trControl = ctrl ) model4
## k-Nearest Neighbors ## ## 407 samples ## 13 predictor ## ## Pre-processing: centered (13), scaled (13) ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 367, 366, 367, 366, 365, 367, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 5 4.616138 0.7518673 3.064657 ## 7 4.734625 0.7404093 3.151517 ## 9 4.677503 0.7508160 3.156651 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 5.
This results seemed to have improved our accuracy for our training data. Let's check this on the test data to see the results.
test.features = subset(testing, select= -c(medv) ) test.target = subset(testing, select=medv) [ , 1 ] predictions = predict(model4, newdata = test.features) # RMSE sqrt(mean( (test.target - predictions) ^ 2 ) )
# R2 cor(test.target, predictions) ^ 2
Tuning Hyper Parameters
To tune a ridge model, we can give the model different values of lambda
. Caret will retrain the model using different lambdas and select the best version.
set.seed( 1 ) tuneGrid <- expand.grid( k = seq( 5 , 9 , by = 1 ) ) model5 <- train( medv ~ ., data = training, method = 'knn' , preProcess = c( "center" , "scale" ) , trControl = ctrl, tuneGrid = tuneGrid ) model5
## k-Nearest Neighbors ## ## 407 samples ## 13 predictor ## ## Pre-processing: centered (13), scaled (13) ## Resampling: Cross-Validated (10 fold) ## Summary of sample sizes: 367, 366, 367, 366, 365, 367, ... ## Resampling results across tuning parameters: ## ## k RMSE Rsquared MAE ## 5 4.616138 0.7518673 3.064657 ## 6 4.754269 0.7386282 3.162237 ## 7 4.734625 0.7404093 3.151517 ## 8 4.656271 0.7508317 3.133727 ## 9 4.677503 0.7508160 3.156651 ## ## RMSE was used to select the optimal model using the smallest value. ## The final value used for the model was k = 5.
Finally, we can again plot the model to see how it performs over different tuning parameters.
Source: https://koalatea.io/r-knn-regression/