--- title: "Using `computeError`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Using `computeError`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( fig.width = 7, fig.height = 5, collapse = TRUE, comment = "#>" ) ``` The main function in the `cvwrapr` package is `kfoldcv` which performs K-fold cross-validation (CV). It does so in two parts: (i) computing the out-of-fold predictions, then (ii) using the resulting prediction matrix to compute CV error. The `computeError` function is responsible for the second task and is exposed to the user as well. (For those familiar with the `glmnet` package, `computeError` is similar in spirit to the `glmnet::assess.glmnet` function.) Sometimes you may only have access to the out-of-fold predictions; in these cases you can use `computeError` to compute the CV error for you (a non-trivial task!). Let's set up some simulated data: ```{r} set.seed(1) nobs <- 100; nvars <- 10 x <- matrix(rnorm(nobs * nvars), nrow = nobs) y <- rowSums(x[, 1:2]) + rnorm(nobs) biny <- ifelse(y > 0, 1, 0) ``` The code below performs 5-fold CV with the loss function being the default (deviance): ```{r message=FALSE} library(glmnet) library(cvwrapr) foldid <- sample(rep(seq(5), length = nobs)) cv_fit <- kfoldcv(x, biny, family = "binomial", train_fun = glmnet, predict_fun = predict, train_params = list(family = "binomial"), predict_params = list(type = "response"), foldid = foldid, keep = TRUE) plot(cv_fit) ``` The plot above is for binomial deviance. If we want the misclassification error for the out-of-fold predictions, we can compute it with `computeError`: ```{r} misclass <- computeError(cv_fit$fit.preval, biny, cv_fit$lambda, foldid, type.measure = "class", family = "binomial") misclass$cvm ``` The output returned by `computeError` has class "cvobj", and so can be plotted: ```{r} plot(misclass) ``` To see all possible `type.measure` values for each family, run `availableTypeMeasures()`: ```{r} availableTypeMeasures() ``` ### The special case of `family = "cox"`, `type.measure = "deviance"` and `grouped = TRUE` There is one special case where `computeError` will not be able to compute the CV error from the prediction matrix, and that is when we set the options `family = "cox"`, `type.measure = "deviance"` and `grouped = TRUE`. Let's set up a survival response and perform cross-validation with the error metric being the C-index: ```{r} library(survival) survy <- survival::Surv(exp(y), event = rep(c(0, 1), length.out = nobs)) cv_fit <- kfoldcv(x, survy, family = "cox", type.measure = "C", train_fun = glmnet, predict_fun = predict, train_params = list(family = "cox"), predict_params = list(type = "response"), foldid = foldid, keep = TRUE) plot(cv_fit) ``` Now, let's say we want to compute the deviance arising from these predictions instead. We might call `computeError` as below: ```{r error=TRUE} deviance_cvm <- computeError(cv_fit$fit.preval, survy, cv_fit$lambda, foldid, type.measure = "deviance", family = "cox") ``` That threw an error. What happened? In this special case of `family = "cox"`, `type.measure = "deviance"` and `grouped = TRUE` (`grouped = TRUE` is the default for `computeError`), we actually need more than just the out-of-fold fits to compute the deviance. In this setting, deviance is computed as follows: for each fold, 1. Fit the model on in-fold data. 2. Make predictions for *both* in-fold and out-of-fold data. 3. Compute the deviance for the full dataset, and compute the deviance for the *in-fold* data. 4. The CV deviance associated with this fold is the deviance for the full dataset minus the deviance for the in-fold data. As you can see from the above, we need *both* in-fold and out-of-fold predictions for each of the CV model fits. The way out is to call `kfoldcv` with `type.measure = "deviance"`. Internally, `kfoldcv` calls `buildPredMat` which computes a `cvraw` attribute and attaches to the prediction matrix. `computeError` uses this `cvraw` attribute to compute the deviance. ```{r} cv_fit2 <- kfoldcv(x, survy, family = "cox", type.measure = "deviance", train_fun = glmnet, predict_fun = predict, train_params = list(family = "cox"), predict_params = list(type = "response"), foldid = foldid, keep = TRUE) plot(cv_fit2) ``` This is a edge case that we don't expect to encounter often. This problem is not faced when `family = "cox"`, `type.measure = "deviance"` and `grouped = FALSE`. This is because computing deviance in this case only requires out-of-fold predictions: for each fold, 1. Fit the model on in-fold data. 2. Make predictions for out-of-fold data. 3. The CV deviance associated with this fold is the deviance for the *out-of-fold* data. ```{r} deviance_cvm <- computeError(cv_fit$fit.preval, survy, cv_fit$lambda, foldid, type.measure = "deviance", family = "cox", grouped = FALSE) plot(deviance_cvm) ```