Tree Models for Data Imputation

Frequently in our research we are in need of memory efficient algorithms that can scale to big data, and are also fast to compute. In big data, trees are one of the most popular classes of models. A dataset of $n$ observations and $p$ features can be processed quickly and constructed in $O(pn\log n)$.

Decision trees are robust to outliers and handle missing data very well using surrogate splits. The latter feature is a very useful one. Suppose we train a decision tree to predict variable $Y$ using features $A, B, C$. If we receive new data that only had valid values for $A,B$, we would be in trouble and might not be able to predict $Y$. Tree models are trained to include surrogate splits, so that if a variable $C$ is missing in a new data point, the decision defers to another variable $C'$ that is highly correlated with $C$ and proceeds with the prediction. One additional advantage of using splits is that we can define the prediction piecewise. In our time series data where variance is not constant over time, or when there are interventions, it would be nice to identify these regions and fit different models. Trees give such flexibility.

For a data matrix, we can use trees to impute missing data with the following simple strategy:

```X = data matrix
Repeat:
In parallel:
For each predictor p, train a tree model that predicts p using all other p-1 features
Predict missing values for predictor p, but do not write to X yet
sync threads
Fill X with predicted values
```

It’s important to repeat the procedure for stability, just as in SVD imputation. For the actual implementation, there are two popular choices. A simple one would just use CART and for a time series matrix with $T$ time points, $n$ observations and $p$ features, the runtime would be $O(np^2T\log T)$

A more powerful technique would be Gradient Boosting, which uses an ensemble of weak decision trees. We use the R package gbm to create our imputation function

First, let’s create some data and see how gbm works

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ``` ```set.seed(100) N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)]   SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma)   data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) head(data)   Y X1 X2 X3 X4 X5 X6 1 -0.09117769 0.30776611 0.1480966 d a c 2.31015454 2 -0.32203092 0.25767250 0.2237327 d a b 1.98513625 3 1.64481361 0.55232243 1.2478880 d e b 0.04731843 4 3.68823135 0.05638315 1.3421636 a c c 1.81793991 5 2.07303574 0.46854928 0.7317884 c e a 2.38320102 6 0.90415066 0.48377074 0.3663628 c c a 1.67086893```

Now we use gbm to fit a model to the data and compute the sum of squared errors.

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ``` ```gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress   # check performance using an out-of-bag estimator # OOB underestimates the optimal number of iterations best.iter <- gbm.perf(gbm1,method="OOB") print(best.iter)   data.predict = predict(gbm1, n.trees = best.iter) SSE = sum((Y - data.predict)^2) print(SSE)   168.2677```

Now let’s see how this handles missing values

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ``` ```X1[sample(1:N,size=500)] <- NA X4[sample(1:N,size=300)] <- NA   data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)   gbm2 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data2, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 5, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=TRUE) # print out progress   # check performance using an out-of-bag estimator # OOB underestimates the optimal number of iterations best.iter <- gbm.perf(gbm2,method="OOB") print(best.iter)   data.predict = predict(gbm2, n.trees = best.iter) SSE = sum((Y - data.predict)^2) print(SSE)   225.7908```

Great! We’ll build our imputation function by looping over all columns of the data matrix and fitting a gbm for each column.

?View Code RSPLUS
 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 ``` ```  gbmImpute = function(x, max.iters = 2, cv.fold = 5, verbose=T) {   missing.matrix = is.na(x) numMissing = sum(missing.matrix) if(verbose) { print(paste("imputing on", numMissing, "missing values with matrix size", nrow(x)*ncol(x), sep=" ")) } if(numMissing == 0) { return (x) }   missing.cols.indices = which(apply(missing.matrix, 2, function(i) { any(i) }))   for (i in 1:max.iters) { if (verbose) print (paste("Begin iteration: ", i)) x[,missing.cols.indices] = sapply(missing.cols.indices, function(j) { good.data = which(!missing.matrix[,j]) gbm1 <- gbm(x[good.data,j] ~ ., data = as.data.frame(x[good.data,-j]), var.monotone = rep(0, ncol(x)-1), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # bernoulli, adaboost, gaussian, # poisson, coxph, and quantile available n.trees=3000, # number of trees shrinkage=0.005, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = cv.fold, # do 5-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=F) # print out progress best.iter <- gbm.perf(gbm1,method="OOB", plot.it = F) data.predict = predict(gbm1, newdata = as.data.frame(x[-good.data,-j]), n.trees = best.iter) x[-good.data,j] = data.predict x[,j] }) }   return ( list ( x=x, missing.matrix=missing.matrix )) }```